I've worked around control software for nuclear devices, which obviously operate by a different set of rules than just about any other. One interesting safeguard is testing within the body of critical functions to ensure that the function was entered at the top, rather than as a random jump into the body of the code (potentially the kind of error that could result from cosmic rays). One of the guys on our team was former military, and he told us that they had running bets whether the missiles would actually fire, given a valid control sequence. None of them believed that it would fire by accident.
If you look at modern automotive control systems they are beginning to introduce redundant voting controls. This is an effective way of effectively eliminating this type of error, be it from hardware or software.
It's certainly the case that tasks can die, and require a system reboot. That's why you have watchdog timers in control system software. In the description of the problem, it appears that several tasks died simultaneoiusly, although we don't know which tasks nor how simultaneous they were.
And it's also not clear whether individual task were monitored correctly, and whether it was the simultaneous nature of the failures that created a case where the reboots didn't occur.
Also, it looks like they found several potential mechanisms, not necessarily THE cause. One way to design around this sort of problem, although nothing will be 100 percent, is to have redundant processes do the same computations, and then compare the control signal at the output. If there's no match, you default to no acceleration.
The last safety measure is of course the driver. If unintended acceleration occurs, certaily in a 2005 car, put the car in neutral and shut off the engine!
Although the quote about the danger of a "single bit flip" seems to have been in the context of software bugs -- it's hard to tell just from the quotes in this interview -- Barr also mentions single event upset. Memory bit errors (so-called "soft error rate") are a more of a hardware & system design issue, at least to the extent that the design includes mirroring, error detection and/or correction or other fail-safe measures.
At modern VLSI geometries, the soft error rate of an SRAM bit cell being bombarded with cosmic radiation at ground level is not as inconsequential as one might think -- especially for critical safety systems.
It makes one wonder how blame can be attributed to software in a system in which the source of the error may have been a random SRAM bit that was flipped by an alpha particle or other natural radiation event. Is the failure being blamed on software, or is it an overall laxity of hardware plus software that failed to prevent all of those 16 million possible ways a software task can die? How much fail-safing & hardware redundancy is enough to adequately protect against these events? In the end, it is a probabalitic issue, and the probability of failure will never be zero.
Blog Make a Frequency Plan Tom Burke 17 comments When designing a printed circuit board, you should develop a frequency plan, something that can be easily overlooked. A frequency plan should be one of your first steps ...