Design Article
Need a watchdog for improved system fault tolerance?
Suhas Chakravarty, Rohit Tomar, Mohit Arora, Freescale Semiconductor
10/22/2008 11:41 PM EDT
To design a good watchdog the following guidelines should be kept mind:
- The width of the watchdog timer should be such that it can cover a whole range of timeouts for all available clock sources in the system.
- The watchdog timer should run off a clock source that is independent of the clock source of the system that it is monitoring. Preferably it should be a dedicated clock source for the watchdog, say an RC oscillator. This means that even if the system clock dies out due to some reason, leaving the system hung, the watchdog timer can still timeout and reset the system.
- The watchdog's method of signaling a fault to the system should be fault tolerant itself.
- The critical control and configuration register bits of the watchdog should have write protection on them so that once set they cannot be accidentally modified.
- The method of refreshing the watchdog should be such that the chances of runaway code accidentally refreshing the watchdog are minimal. If runaway code, through some weird chance, manages to refresh the watchdog, the watchdog would either not get to know about the code runaway or get to know it after a long time.
- The response of the watchdog to detection of runaway condition should be swift. If the watchdog takes too much time to reset the system, the system in an unknown state could cause a lot of damage in a safety critical application. Thinking back to the example of the robotic arm, the longer it takes for the arm to be halted in case of a fault, the more risk there is to the patient's life.
- The watchdog's proper operation should be testable so that it can be made sure after boot that it is up and functioning. The test should not take an impractical amount of time.
- The watchdog should facilitate diagnosis of the fault that caused a watchdog timeout.
Robust Watchdog
The Robust Watchdog has been designed keeping in mind the aforementioned guidelines. It incorporates features that make improvements over existing implementations, in the following specific areas:
- Better, more unique, timed refresh scheme.
- Timed password style access to control and configuration registers.
- Detection of runaway code footprints, before actual timeout.
- Faster but at the same time fault tolerant response to timeouts.
- Fast test of the watchdog.
The Width of the Watchdog Timer
When designing a watchdog, one of the questions confronting the designer is how wide the watchdog timer should be kept. The answer to this can be obtained by deciding on what range of timeout values does one want to support and then considering the different clocks available to the watchdog.
Consider an example target timeout range of 1ms to 1 second. To be able to generate timeout values ranging from 1ms to 1 second, the length of the watchdog timer has to be chosen carefully. What makes this task difficult is that the frequency of the clock source for the watchdog could vary widely from a few KHz (say an on-chip RTC oscillator) to hundreds of MHz (system clock). Figure 2 shows timeout values possible with 8, 16, 24 and 32 bit timers, for different, practical clock frequencies.

The vertical green band marks out a range of timeouts which cover the 1ms to 1 second range. As can be observed, a 32 bit counter is required to cover all clock frequencies and the expected range of watchdog timeouts.
Independent Clock Source
The Robust Watchdog implements a pretty standard option of switching between two clock inputs, one of which should ideally be connected to a dedicated clock source, such as an on-chip RC oscillator. The other clock source can be the system clock. In applications, which aren't safety critical but still need the watchdog, the system designer might want to avoid the overhead of a dedicated clock source and simply use the system clock.
Write Protection
Watchdogs generally have several control and configuration register bits, which are used to influence its working, for example a bit to disable or enable the watchdog. Since these bits have a direct impact on the watchdog's functioning, it is of prime interest to make sure they are not modified un-intentionally. To achieve this objective a write protection scheme is generally present in good watchdogs. One of the better, extant, write-protection schemes is to have a password style protection on the said register bits, where the password is a sequence of two particular values. However, this scheme allows any amount of time to elapse in between the write of the two values, which means that the chances of runaway code managing to accidentally replicate the password are high. If the writes of the two values are spaced far apart in the code, it could so happen that after the write of the first value the code runs away in an unintended direction, causes havoc, and then after enough number of iterations, branches to the location of the write of the second value.
The Robust Watchdog places a restriction on the time gap between the writes of the two values, thereby reducing chances of runaway code being able to "unlock" the registers for writing and possibly disabling the watchdog. By placing a limit on the time gap, where the limit is just equal to the time it takes for the CPU to fetch and execute the write instruction for the second value, the user is forced to place the write instructions for the two values one after the other in the code (as assembly instructions). Now if there is a runaway after the execution of the first write, there is no time left for the code to possibly return and execute the instruction writing the second value of the sequence. This makes the refresh sequence more unique because it minimises the chance of the sequence being replicated by runaway code.


pekon_
3/21/2011 2:57 PM EDT
nice article.. summarizing all details while designing WDT. However one thing to mention, RC osc are very inaccurate, and their freq variation is large. so
1) it would be difficult for running software to keep a sync of WDT refresh window.
2) the actual timeout period of WDT can vary alot depending of RC freq shift.
with regards, pekon
Sign in to Reply