Of course, SIL4 and Fault Tolerance are not the same concept, but SIL4 requires in almost all cases a fault tolerant architecture (as you say as well). The issue is that the standards often allow to minimize the safety risk because they define SIL as function of probability of occurence, severity and controllability. Very subjective. And because relying on a Hazrd and Risk Aalysis, most likely not covering all the risks (as there is not always a historical track record, especially in automotive). Not to speak of the unpredictable nature of the environment and the operator.
The other issue is that SIL thinking is rooted in the times when things were mostly linear (analog, continuous domain), where probabilities and graceful degradation (still) apply. Digital electronics and software are however in the digital (discrete, non-linear) domain. One small bit flip and 20 nanoseconds later, the system can have failed. The statespace is so large that fault tree analysis techniques can never go to this level of detail. The point is also that software is like a virtual machine sitting on top of a discrete state machine sitting on top of a semiconductor device (that is again in the continuous domain). The hidden assumption for software is not so much that it is error-free (more or less true when using formal methods), but that the hardware is always fault-free. Hence we have a hierarchy of levels. At the chip level, reliability margins apply, at the discrete level micro-redundancy applies, at the software level, block level redundancy and at the system level macro-level redundancy applies. There is an additional level that takes into account residual common mode failures and that requires diversity as well. We have developed a criterion, called ARRL (Assured Reliability and Resilience Level) that takes this analysis onto account. Draft white paper on request (I need an email to send it to).
The benefit of this approach is that it becomes possible to characterise components (or subsystem entities) in terms of how they deal with failures and one can reuse them from one domain to another also in the contact of safety critical systems (in essence the components carry a contract with them). One can also define rules on how to reach higher ARRL (and hence SIL) levels by composition. Note also that SIL and ARRL are complementary. They meet in the middle (just like a HARA and FMEA do).
The point I wanted to make is that there is no reason why MCU can't be made "fault tolerant" by default. Gates are almost for free these days. And while lockstepping CPUs can help, they are not a miracle solution for safety. They basically only alow to detect that there is a fault, but not to correct the fault. Safety comes from masking out such internal faults so that the system continous to deliver its service. Using 2 such chips (2 oo 4) is a higher level remedy (but watch out for common mode failures, e.g. power issues). The other in my view more interestig approach is already in use in space (and as far as I know in high-end server chips like IBM's Power7). Make the logic cells fault tolerant (triplicate the gates). A very nice and recent example is Microsemi's SmartFusion-22. And it is not expensive. Certainly less expensive than developing a fault detection and correction architecture around traditional chips.
"The default approach should be fault tolerant (SIL4)"
Fault tolerant and SIL4 are not equivalent terms. Fault tolerant refers to the ability of a system or function to operate correctly even though one or more of its component parts are malfunctioning. SIL4 refers to required or achieved probability or rate of failure of a safety system or function. Fault tolerant systems vary by how many simultaneous faults they can detect correct and by how many of those faults they can correct. It is only implicit that higher SIL levels generally require greater degress of fault tolerance.
"In terms of functional safety this is not fault tolerant."
I am confused by this statement. Functional Safety refers to the part of the overall safety of a system or function that depends on a system or function operating correctly in response to its inputs. Thus, Functional Safety depends on hazard and on what the correct function is. These microcontrollers >are< fault tolerant to the degree to which they are capable. ECC-SECDED means that the microcontroller can tolerate up to 2 simultaneous bit flip faults in any word at any time, 1 bit flip results in no effect, 2 bit flips results in a trigger than can be used to safe the microcontroller. That is fault tolerant, but whether that is fault tolerant enough depends on the particular Functional Safety requirements that are placed upon the microcontroller. Dual lock step cores are fault tolerant, 1 one fault is detectable. That is enough for some Functional Safety cases, but not in others. Triplication can provide 1 fault correction, but end to end triplication is exceptionally complex, and in distributed functions exposes the system to Byzantine faults. The trend to accomplish guarantees of 1 fault correction is not triplication, but Quadruple Modular Redundancy (2oo4); that is, the pairing of lock-step microcontrollers, or implementation of 2 pairs of lock-step cores in a microcontroller (see FSL's QUASAR project)
Freescale is committed to helping system manufacturers more easily achieve system compliance with functional safety standards (ISO 26262 and IEC 61508). Through our new SafeAssure functional safety program, engineers can easily identify Freescale hardware and software solutions that are optimally designed to support functional safety implementations. There’s more info about these as well as our safety processes and support at Freescale.com/safeassure
-Aaron McDonald, Freescale
Although very relevant article, In terms of functional safety this is not fault tolerant. It allow to fail "safely" (like when driving 200 km/hr).
Extra cost for triplication and voting is very minimal (giving today's silicon dimensions) and could seriously reduce the development cost of fault tolerance support. The default approach should be fault tolerant (SIL4) so that when there is a failure, the system drops in SIL3. Still fully functional but only a second failure leads to a fail-safe stop.
eric.verhulst (at=@) altreonic.com
NASA's Orion Flight Software Production Systems Manager Darrel G. Raines joins Planet Analog Editor Steve Taranovich and Embedded.com Editor Max Maxfield to talk about embedded flight software used in Orion Spacecraft, part of NASA's Mars mission. Live radio show and live chat. Get your questions ready.
Brought to you by