datasheets.com EBN.com EDN.com EETimes.com Embedded.com PlanetAnalog.com TechOnline.com  
Events
UBM Tech
UBM Tech

Design Article

Using FPGAs in mission-critical systems

Adam Peter Taylor, Principal Engineer, EADS Astrium

12/6/2010 5:01 AM EST

Correction schemes
The techniques presented so far detect or prevent an incorrect change from one legal state to another legal state. Depending upon the end application, this could result in anything from a momentary system error to the complete loss of the mission.

Techniques for detecting and fixing incorrect legal transitions are triple-modular redundancy and Hamming encoding. The latter provides a Hamming distance of three and covers all possible adjacent states. A simpler technique for preventing legal transitions is the use of Hamming encoding with a Hamming distance of two (rather than three) between the states, and not covering the adjacent states. This, however, will increase the number of registers your design will require.

Triple-modular redundancy, for its part, involves implementing three instantiations of the state machine with majority voting upon the outputs and the next state. This is the simpler of the two approaches and many engineers have used it in a number of applications over the years. Typically, a TMR implementation will require spatial separation of the logic within the FPGA to ensure that an SEU does not corrupt more than one of the three instantiations. It is also important to remove any registers from the voter logic, since they can create a single-point failure in which an SEU could affect all three machines.

The use of a state machine with encoding that provides a Hamming distance of three between states will ensure both SEU detection and correction. This guarantees that more than a single bit will change between states, meaning an SEU cannot make the state transition from one legal state to another erroneously. The use of a Hamming distance of two between states is similar to the implementation for the sequential machine, where the unused states are addressed by the “when others” clause or reset cycling. However, as the states are explicitly declared to be separate from each other by a Hamming distance of two within the RTL, the state machine cannot move from one legal state to another erroneously and will instead go to its idle state should the machine enter an illegal state. This provides a more robust implementation than the binary one mentioned above.

If you wish to implement a state machine that will continue to function correctly should an SEU corrupt the current state register, you can do so by using a Hamming-code implementation with a Hamming distance of three and ensuring its adjacent states are also addressed within the state machine. Adjacent states are those which are one bit different from the state register and hence achievable should an SEU occur. This use of states adjacent to the valid state to correct for the error will result in N*(M+1) states, where N is the number of states and M is the number of bits within the state register. It’s possible to make a small high-reliability state machine using this technique, but crafting a large one can be so complicated as to be prohibitive. The extra logic footprint associated with this approach could also result in lower timing performance.

Deadlock and other issues
There are other issues to consider when designing a high-reliability state machine beyond the state encoding schemes. Deadlock can occur when the state machine enters a state from which it is never able to leave. An example would be one state machine awaiting an input from a second state machine that has entered an illegal state and hence been reset to idle without generating the needed signal. To avoid deadlock, it is therefore good practice to provide timeout counters on critical state machines. Wherever possible, these counters should not be included inside the state machine but placed within a separate process that outputs a pulse when it reaches its terminal count. Be sure to write these counters in such a way as to make them reliable.

When checking that a counter has reached its terminal count, it is preferable to use the greater-than-or-equal-to operator, as opposed to just the equal-to operator. This is to prevent an SEU from occurring near the terminal count and hence no output being generated. You should declare integer counters to a power of two and, if you are using VHDL, they should also be modulo the power of two to ensure in simulation that they will wrap around as they will in the final implementation [count <= (count + 1) Mod 16; for a 0- to 15-bit integer counter]. Unsigned counters do not require this, since there is no simulation mismatch between RTL and post-route simulation regarding wraparound.

You can replicate data paths within the design and compare outputs on a cycle-by-cycle basis to detect whether one of them has been subjected to an SEU event. Wherever possible, edge-detect signals to enable the design to cope with inputs that are stuck high or stuck low. You should analyze each module within the design at design time to determine how it will react to stuck-high or stuck-low inputs. This will ensure it is possible to detect these errors and that they cannot have an adverse effect upon the function of the module.




Mario Blunk

12/8/2010 3:03 AM EST

Interesting work but the flow diagrams are hard to read. Please make them readable for the public. Thank you.

Sign in to Reply



Bob Lacovara

12/8/2010 1:35 PM EST

This is more than just good advice for mission critical implementations. Many of the suggestions would apply equally well to software. Of course, viewed from the hardware description language perspective, the hardware is software...not that I wish to start a philosophic discussion on that score. But state machines and software systems both feature the ability to find themselves at execution points or states unexpectedly, and can sport unreachable states, and so on. I was bemused to find that one-hot machines were in use: I wasn't aware that much use had been made of one-hot architectures. Interested readers might want to go to abebooks and find a copy of Hill and Peterson's "Digital Logic and State Machine Design" (even the 2nd edition will do) wherein they will find a pedagogical HDL that compiles into a one-hot model. Another interesting thing about one-hot designs is that parallel execution is possible under some circumstances by allowing more than one flop to be hot at once. Suitable precautions and procedures are required, though.

Sign in to Reply



karax

12/9/2010 5:18 AM EST

Talking about simulation and verification tool SST tool from nebrija uses built-in simulator commands, that is, to feed the simulator with the appropriate commands to control the simulation or inject faults (PERL/TCL). This method is easy to implement and does not require the modification of circuit models. On the contrary, the command parsing and the reaction times of the simulator present a performance reduction and make that unviable for big circuits. I know other advanced techniques to speed up the simulation process using FLI (Modelsim) and other emulation HW methods in FPGA.

Sign in to Reply



anne-francoise.pele

7/16/2012 11:25 AM EDT

Dear Mario,

I have made the diagrams more readable. Click on the images to enlarge them.

Best,

Anne-Francoise Pele

Sign in to Reply



Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)