Design Article
Need a watchdog for improved system fault tolerance?
Suhas Chakravarty, Rohit Tomar, Mohit Arora, Freescale Semiconductor
10/22/2008 11:41 PM EDT
The need for fault tolerant systems
Electronic control units (ECU) are fast becoming ubiquitous. Among other areas, they are increasingly finding their way into safety critical and mission critical applications, such as automobile safety systems, aircraft fly-by-wire controls and spacecraft thrust controls. These control systems are supposed to work reliably under all environmental conditions. The software, running on the ECU, does experience faults while running in the real environment which may lead to partial or total system crash. Therefore it is of the utmost importance that the system displays a high degree of fault tolerance, so that if and when faults such as software crashes happen, the system is able to recover quickly and rapidly return to a safe state.
A good example of a mission and safety critical application is the thrust control of spacecrafts. One of the most delicate operations carried out in outer space is the docking of two spacecrafts. Precision direction control and maneuvers are required to line up the two bodies properly, so that they can dock. The system controlling the spacecraft's thrusters must work flawlessly. A software crash in the thrusters' ECU could result in the thrusters firing away for too long, or at the wrong angle, or both, and instead of a docking a collision would result. A safety mechanism must be in place that can detect faults and put the ECU into a safe state before the thrusters start firing away unpredictably.
Another critical application is that of the use of a robotic arm in surgery, which is becoming commonplace in advanced medical facilities. These systems can enhance the ability of physicians to perform complex procedures with minimum interventions. During an operation, the physician initiates a particular procedure, say a fine incision in a vital organ, and then control goes completely to the robotic arm wielding the scalpel. If software failure happens while the robot is at work, the robotic arm could behave unpredictably, posing a risk to the patient. If there is ability in the system to recover quickly from such crashes, the robotic operation can halt and the physician can take appropriate further actions. The operating room of the future is envisioned as a fully automated cell. The surgery would be carried out by robotic arms, under remote supervision from any place around globe. Then fault-tolerance becomes much more critical owing to the increased system dependency.
The above examples serve to highlight the need for fault tolerant systems. Looking ahead, it's not just the automotive, industrial, aeronautical, medical and space applications that need fault tolerance. With the introductions of the IEC 60730 standards, it is required that even automatic electronic controls in household appliances ensure safe and reliable operation of their component.
Reasons for System Failure
When deployed in any application, embedded systems experience two kinds of failures, hard errors and soft errors. Hard errors signify irreversible damage to the system, for example permanent damage to the chip package due to excessive vibrations in a machine, or internal transistor breakdown at extreme temperatures. On the other hand, it is possible for the system to recover from soft errors. Soft errors generally involve some form of data corruption in the system. Reasons could vary from cosmic ray exposure, EMI, noisy power supply to faulty coding. Cosmic rays or other kinds of high frequency radiations would be conditions commonly faced by space crafts and controls in X-ray units of hospitals. The robotic arm in the surgery unit is a pertinent example as it can be exposed to stray radiations from X-ray units.
With increasing system frequencies, on chip high speed serial interfaces and decreasing pitch of chip package pins, EMI is an all too familiar enemy. Power supplies to the chip can be held hostage to transients at the time of power down and can face droop due to ground bounce or current surge. Cosmic rays can cause bit-flipping in memory bit cells, while EMI and noisy power supplies can result in a read or write of incorrect data to memories/registers.
When such data corruption happens, program execution can get affected as the program counter might have gotten modified. Modification of the program code memory or a read of wrong data from code memory can result in a totally different and unintended instruction getting executed. Thus, program flow or the program code itself gets modified, i.e. code runaway, and the system can enter an unknown state where its behaviour is unpredictable. Such runaway can also be a result of faulty coding on part of the firmware coder. There might be unhandled exceptions, out of bound array accesses, unbounded loops or simply an unexpected sequence of user inputs, all of which can lead to an unexpected outcome.
Once the program flow takes an unexpected branch, the system can start behaving unpredictably, which is unacceptable for a safety critical system. For example, an airbag control unit could go haywire, firing at the wrong time or worse, not firing during an accident. While there are remedial measures available to prevent data corruption, there is need for a system monitor that can detect such system failures and take action to bring the system into a safe/known state. The system monitor would, in essence, act as the last "dive-and-catch" for the system when a code runaway takes place. The system monitor should be able to reliably detect a code runaway and then bring the system into a safe state with minimum delay. The system monitor should itself be immune to code runaways.
A System Monitor--The Watchdog Timer
For quite some time now, the role of a system monitor in embedded systems has been fulfilled by a simple piece of logic called the Watchdog. It is known by different names--COP (Computer Operating Properly), Watchdog Timer or simply Watchdog. It is essentially a timer running off a continuous clock. It expects to receive some sort of an "All's well" signal from the system at regular intervals. This signaling is termed as "refreshing the watchdog," and can take varied forms depending on the implementation--for example, a write of a particular value by the system's CPU to a designated location in the watchdog's register space, or the execution of a special instruction by the CPU. In the absence of such a signal, the watchdog timer eventually times out and issues a reset to the system. The minimum frequency at which the watchdog has to be refreshed is determined by the timeout value of its timer. Figure 1 illustrates the basic concept of a watchdog.

The way that the aforementioned arrangement works is that the firmware code is first profiled to determine the sequence of instruction execution and the time taken. Watchdog refresh routines are then inserted into the code in such a manner that the interval between the executions of two successive refresh routines works out to be less than the watchdog timer's timeout period. If a code runaway happens, the program flow will get disrupted and either the refresh routines won't be executed at all or they would be executed at intervals exceeding the timeout period. The watchdog timer would timeout and reset the system, pulling it back into a known state.
One essential requirement of a watchdog is that it should be immune to the effects of runaway code. If runaway code was to accidentally disable the watchdog, then there would be no way for the system to recover. Even a similar modification in other parameters of the watchdog, such as its timeout period, is undesired. Therefore a lot of thought has to go into the design of a watchdog and also its integration into the system.


pekon_
3/21/2011 2:57 PM EDT
nice article.. summarizing all details while designing WDT. However one thing to mention, RC osc are very inaccurate, and their freq variation is large. so
1) it would be difficult for running software to keep a sync of WDT refresh window.
2) the actual timeout period of WDT can vary alot depending of RC freq shift.
with regards, pekon
Sign in to Reply