Our readers' take on Toyota's unintended acceleration
Many of our readers, without access to Barr's 800-page report, kept a healthy dose of skepticism over what they read in the court transcript. Nonetheless, many found cascading effects of multiple software and hardware failures resulted in Toyota's unintended acceleration. As one reader put it, "software hygiene isn't a fool's errand."
Some also worried whether the Toyota case is just the tip of the iceberg for everyone engaged in developing driver assistance systems or drive-by-wire controls, as well as those pushing fully autonomous cars.
It's certainly the case that tasks can die and require a system reboot. That's why you have watchdog timers in control system software. In the description of the problem, it appears that several tasks died simultaneously, although we don't know which tasks nor how simultaneous they were.
And it's also not clear whether individual task were monitored correctly, and whether it was the simultaneous nature of the failures that created a case where the reboots didn't occur.
Also, it looks like they found several potential mechanisms, not necessarily THE cause. One way to design around this sort of problem, although nothing will be 100 percent, is to have redundant processes do the same computations, and then compare the control signal at the output. If there's no match, you default to no acceleration.
Looks like the failure was massive.
The driver claims that full-force braking had been applied. If it is true indeed (though there is no particular reason to take it for granted), then the failure must have had crippled both E-gas (forcing the engine to overrev) and ESP/ABS/brake assist that could potentially loosen pressure on brakes.
I have analyzed systems many times, where a minor, unnoticed software error could prove catastrophic. In one particular case, the probability of an incremental error happening was extremely small. However, the cumulative effect over a long period of time eventually proved catastrophic to the system's operation.
One assumes that the hardware and software gaps identified in this ECU implementation are sobering for anyone implementing driver assistance systems, drive-by-wire controls as well as those pushing fully autonomous cars.
Most automobiles today have multiple CPUs communicating over non-redundant networks with a non-zero error rates; as the complexities of control increase in drivers assisted drive-by-wire and autonomous vehicles, the ability to fully test and verify the implementation reduces. Even though the software and hardware may be designed to meet strict standards, the standard themselves have limitations and are open to some interpretation.
My major concern is that automation systems built by multiple organizations have an almost zero chance of reaching the same decision in the same time in any given emergency (and faults are just another emergency) event. Putting lots of these systems in close proximity in high-speed flowing traffic may be a classical Butterfly Effect when things do foul up.
There is in my opinion no way of 100% preventing an un-commanded wide open throttle condition occurring from time to time somewhere in a world population of over 1000 million vehicles. However the effects of a wide open throttle could be largely prevented by means of a totally independent fail-safe, a kill switch for example, that reduces engine power in an emergency. The present situation in which drivers have to brake against full engine power is totally unnecessary and potentially very dangerous. From a functional safety point of view it is unacceptable to make the driver the fail-safe for the malfunctioning electronics.
…I developed a verification and testing process for a firm that developed embedded engine controllers and this all sounds familiar. I'd been dubious about the Toyota failures, but I didn't realize that this car was drive by wire. Buggy software as the root cause of the failure mode is therefore completely plausible, despite no finding of mechanical or electronic failures.
… This may be the first time that indicators of bad code (not actual results) were sufficient to get a judgment. If so, I hope this is a wakeup call for people who manage this kind of system development and its risks: software hygiene isn't a fool's errand.
…As far as the error replication, if you don't know the root cause you are only guessing at it, which makes replication a real bear. I'm not surprised that they were not able to reproduce it. The theory that they examined was that it was a Single-Bit-Error, which can have many causes. And without ECC, it was un-mitigated. The system design just propagated the fault to a system failure which was an unsafe end state.
Of course, [Toyota] really have 3 errors. In addition to the unmitigated single-bit-error, they also have a test / validation process that failed to find it. And, third, they have a design process that failed to prevent it happening in the first place. The lawsuit will really only address the first error; it's incumbent on Toyota to address the second two. (And in my experience, Japanese companies tend to go after all three as a matter of course.)
This "flip-bit" situation reminds me of an AT&T problem several years ago. Their long-distance phone system went down entirely. The controlling software had been running without problem for many years. Upon examination, it was determined that one line of code that had never been executed in the previous years was finally executed because all the parameters leading to its execution were met for the first time. That one line of the source code was missing a semicolon at the end of the line of code! That's all it took to bring the entire system to its knees.