Depends -- For example in DO-178/DO-254 used in the aviation industry, there are Safety Assurance Levels -- Level A = Loss of Aircraft or Life, Level E = No effect on continued safe flight of the aircraft -- For Key things like Flash and RAM integrity, There are OS'es that have been safety certified that can perform this all in software --- In aircraft systems whole redundant systems are used such that this allows one systems OS to drop offline and restart while the remaining Two or More Systems continue operation -- With that said this also increases cost sometimes. For automotive systems, the preceding systems relied on two means of controlling throttle -- the mechanical cable or link, and the EFI sensor -- both had to be in agreement to go full throttle -- It is clear the designers did not think the code through in terms of the OS implementations in this regard from the comments/article, based on the safety implications.
I agree. I wouldn't go off and focus on just this one aspect - whether RAM was parity checked or not. There are way more overarching design issues at play here, it looks like, and there are safe systems out there that don't necessarily have parity checked RAM.
Let's point out that parity checked RAM is just EDAC RAM that uses the simplest possible EDAC. As we've been discussing on other threads, in the end it's all about probability. There is a probability that one or more RAM bits could get randomly flipped, and the single bit flip event can be detected with a parity check, but many multi-bit flips will escape the parity check. EDAC can detect and/or correct many multi-bit flips, depending on the complexity of the error correcting code.
How much EDAC is enough? How low of a failure probability is low enough? As was already stated, "it depends."
We all agree that if a failure could cost a human life, then the probability of failure must be extremely low. But no matter how low we make it, failure-proof can never be completely guaranteed. And that is likely to be unsettling to many people.
For examples of systems being used in high error generation enviroments that do not use EDAC see www.spacemicro.com.
There are other methods besides these -- I have written a requirements guidelines document for this for another company for using automotive parts in an aerospace environment(Used to generate requirements for software) I have also done this for firmware.
It also would be interesting to see if the PCB had a latent defect such as a weak ground via, or other issue that could have contributed in the Toyota case -- Often unaware board layout people will reduce an OEM PCB's layer count to try and save cost -- this can lead to ground bounce due to I/O toggling and result in bit errors in the core CPU on rare occasions when a large number of pins toggle at once due to coincidence.
Good skill with PCB design can make or break a board in mass production -- have had to help troubleshoot many mass production board issues that contributed to hardware issues also.
See my profile for contact details if more information is needed
I've seen similar. Routing RTC (real time clock) tracks across address tracks on a PCB for a mobile phone caused the RTC to jitter when those address lines toggled on wake-up from deep sleep, thereby destroying the value of having an RTC in the first place and triggering a complete cell aquisition. Battery did not last long. Temporary solution was to rearrange program memory such that those address lines did not toggle on wake-up. Permamant solution was to respin the PCB.