What I would hope would be to encourage the next generation of engineers to read not just about engineering successes, but engineering failures. It is not the making of mistakes that matters, but the failure to learn from them.
Although an electrical engineer, I remember being shown a film about the Tacoma narrows bridge disaster - Galloping Gertie and it really stuck in my mind. It really illustrates dramatically system instability and eventual collapse.
There is now a wealth of material available via U tube that could set the scene for classes in safety and reliability. Otherwise Failure modes and effects analysis or component reliability classes will seem incredibly boring.
System Safety in Computer-Controlled Automotive Systems, by Nancy G. Leveson, SAE Congress, March, 2000. (Postscript), (PDF).
An invited paper that summarizes the state of the art in software system safety and suggests some approaches possible for the automotive and other industries.
and what I think is a brilliant read:
High-Pressure Steam Engines and Computer Software by Nancy Leveson. Presented as a keynote address at the International Conference Software Engineering in Melbourne Australia, 1992 and published in IEEE Computer, October 1994. (PostScript) (PDF ).
A comparison between the history of steam engine technology and software technology and what we can learn from the mistakes made with steam engines.
Quite a lot, as it so happens! May I wish everyone a happy reading weekend!
No need to kill everything, just the engine....leave the brakes and all safety systems intact. First rule of problem resolution - when you find yourself in a hole you can't get out of - stop digging then solve the problem!
I think much of the discussion may be missing the point.
My previous experience with safety-critical systems was with burner safety controls, at Fireye, Inc. We designed and verified our products under UL 372. The centerpiece of regulatory approval was the Failure Mode Effects Analysis, in which we had to enumerate each of the physically possible failure modes of each of the components, and show by analysis or experiment that each of those failures would lead to a safe condition; the unit would either close the fuel valve or operate to spec. If it continued to operate, then we had to hold that fault and enumerate every other possible fault in combination with the non-shutdown fault. I've seen FMEA tables that ran to 1500 pages. We had to do it. No excuses.
Similarly, if the product had internal programming, we had to show that it was physically impossible for a software failure or memory corruption to cause an unsafe failure.
For safety-critical aircraft software there is RTCA/DO-178, which requires an excruciatingly rigorous development and validation process to eliminate all possibility of a dangerous bug. This is not pie in the sky, this is legally mandatory. And then, formally correct software can't be relied on if the hardware platform is buggy; for that there is RTCA/DO-254 to ensure that the hardware is logically correct.
Of course, any hardware platform can be correct at the time of manufacture and suffer a component failure in the field. So we assume that every physically possible failure will occur at some time in the future; it must be comprehensively proven that any failure of any component will result in a safe condition; ideally, the failure should be immediately obvious so that repairs will be performed before a second hardware failure appears.
In order that airtight proof of design correctness and comprehensive failure mode effects analysis be possible, the design must be kept simple, so that it can be completely understood by its human designers and independent reviewers. I can't emphasize that enough; it requires a certain ruthlessness to exclude anything from the product specification that will result in complexity. The safety-critical subsystems must be physically separated from the convenience functions, and from dangerous external influences such as high-level electromagnetic interference.
Personally, I find it unforgivable for a vehicle to be designed in such a way that it's physically possible to open the throttle without the driver's foot supplying the force to do it. If any part of the *mechanical* linkage between the gas pedal and the throttle fails, not even God should be capable of preventing the return spring from closing the throttle. Similarly, putting the transmission lever or its equivalent in an electric car into neutral should physically disconnect the wheels from the power source, and it should not be possible for any component failure to prevent it. There ought be a law. Literally.
The response of the non-embedded software community is also interesting: They claim the software development processes were bad , the tools we're bad and too low level for such a complex critical software , and the education of electronics engineers is also part here - because they are'nt taught proper software engineering is and it shows.
To be fair to NASA, from what I understand it, they really didn't have as much time as Michael Barr did in looking into this problem. Besides, Barr said that he was able to build his work by picking up where NASA left off. So, NASA's work is not wasted. But at the same time, NASA also made it clear in its own report: "Absence of proof that ETCS-i has caused an unintended acceleration doesn't vindiate te system."
There has been a lot of ink used discussing the Toyota electronic throttle issue - however, no post so far has commented on the failure of Toyota to provide a kill button that kills the power to the system of everything else fails. It would seem that looking for a tow truck is much preferable to a runaway out of control vehicle with a panicked driver and pax!
Hmmm..NASA did a long and expensive study on the reliability of the Toyota throttle control and didn't find anything wrong! Makes me wonder if NASA is no longer the bastion of high performance, high reliability HW/SW design. How is it that a consultant can find the problem but NASA cannot?