This is frankly, nonsense. As was demonstrated in the Audi debacle years ago, automobiles will come to a stop with the accelerator floored (throttle fully open) in approximately 20% longer distances than normal if the brakes are fully involved (ABS invoked). Audi's president at the time soberly demonstrated this by planting both feet and observers recorded the brake performance. This is because brake torque >> engine torque. It is a classic mechanical safety override (SW be damned). The only way you can get around this is either a) simultaneous failure of two major auto control systems, separately enabled (one electronic, the other largely hydraulic/mechanical), or b) if you have a system which can electronically disable the brakes (pathological reverse-ABS??). The former is ridiculously unlikely and the latter, not demonstrated to be the case.
It is telling that, in the testimony, the vehicle was brought to a virtual stop on the dyno through brake actuation even with the simulated loss of task X.
heh. That's why I think the numbers are only as good as the guy who worked them out. Yes, for sure, you have to consider the availabilties of the different subsystems in the calculation, and you also have to consider those functions that are safety critical, as opposed to the functions that are not.
This is a whole science unto itself, as you might imagine. Books have been written on this subject.
thank you Bert, make sense...but how do you calculate mean time between failures on a complex software-hardware system? I think these calculations refer to component wearout and reliability, this is fairly standard in a componet electronics industry...but they don't really take into account complex interactions between sofwtare and hardware, unexpected behaviour under signal interference, noise, etc...Kris
"is 99.99% certainity is sufficient, or you need 99.999% or better? how do you determine that point?"
Mean time between failures. That's the only way I know of to put those strings of nines to good use. Like Frank said in another post, in some cases, you can quickly reach the known age of the universe. At that point, surely, you've done a good job.
(Of course, these numbers are only as good as the guy who worked them out.)
Another very real issue that can haunt projects is the use of two fine a PCB via's for a given environment -- This can lead to via breakage due to shock, vibration, and temperature. (personally went through a plant closing and 1200 mile move due to vias breaking on another projects PCB's) Even if only redundant ground vias break the ground bounce can grow and when combined with humidity results can even be more significant) Ground bounce can cause logic corruption in MCU's, DSP's, CPU's and FPGA's. There has to be enough built in self test of the hardware via software and safeguards to detect this issue.
Fascinating case and interesting lessons in product development and potential liability...it brings in my mind a question of how much design, verification and validation effort is required and sufficient? ...seems Toyota didn't do enough system testing...but when do you to stop? is 99.99% certainity is sufficient, or you need 99.999% or better? how do you determine that point? Kris
You mention "multithread" in the context of safety-critical code. That's kind of a stretch given that there are only a small number of languages for which it is even possible to write an "informal" tool to determine whether a particular thread or build is threadsafe, let alone one that can demonstrate this in a "formal" manner (show as a matter of mathematical proof that it WILL NOT miss any thread problems) so that a safety agency could allow its use. And those languages themselves generally either aren't suitable for safety-critical applications or very few people write in them in the first place. The truly safety-critical sections are required to run in a totally deterministic manner therefore even object-oriented languages generally aren't even currently tolerated for Level A of DO-178C (the known exception being Ada and I haven't participated in one of those projects yet, so I'm not sure exactly what you are and aren't allowed to do). Some of the IEC safety coding standards are so stringent that even the "routine" use of interrupt service routines is prohibited, try doing precise timing or comms without that! So there's not only a heck of a lot of work that needs to be done on the fundamentals, there's also too many people without sufficient knowledge of how restrictive the current rules are or how VERY far we need to go before some of their "assumptions" come even CLOSE to reality. I believe it would be a "good first step" if the heads of the various groups who write these safety specifications could get together and publish some references of how all these languages, tools and requirements mesh and that would send the message to the academic world what areas of research need to be highlighted. Please note I don't want to "cast aspersions" on those who get it wrong or simply aren't aware what they are saying, it's hard enough for those of us who spend a good portion of our lives trying to keep current at this, and there's also quite a few "commercial claims" I see being made that need to be taken with a grain of salt because particular products or tools might theoretically have a certain advantage but they still haven't been approved for use because their claims have yet to be proven.
Memory with ECC correction at the controller can mitigate electrically noisy environments.
ARM's AXI busses support client xPUs (APU/MPU/RPU) to provide task level access control to address space based on virtual machine IDs, even in multi-core SOCs.
Properly configured, even threaded task OSes without full MMU support can have some level of memory protection between threads, and in multi-core solutions, individual cores can be corraled into private sandboxes.
These two techniques have been around for years, they are not new.
The FAA papers show some RTOS's that do some SW protection of Tasks, for others it is done as part of the Certification effort by the more reputable airframe, and equipment manufacturers.
For example Xilinx has some good whitepapers on SEU that detail some of the techniques for it's ARM based processors.
The spacemicro (www.spacemicro.com) offers IP/code for hardening un-hardened OS'es and using Non-edac CPU's to self check, and check vs a redundant channel. These have been flown on space missions where bit-flips can happen quite often even on a small mcu.
I have myself written guidelines for hardening software and firmware in MCU's and FPGA's for companies -- see my profile for contact information.
"Most of the major OS's such as VxWORKS, PSOS, Green Hills, etc should support something like this or better (possibly with an option) "
This is really the crux of what I was asking about. Trying to see if there are any RTOS vendors who advertise fault-tolerant countermeasures such as mirroring critical RTOS variables & data structures. I've haven't found one yet. If I remember Michael Barr's testimony, some of the scheduler's task lists or whatever were right next to the stack, and some of the important application variables weren't mirrored.
Will be interesting to see if this type of functionality starts showing up in some of the more heavyweight RTOSes. IMO it would be a reaction to this fiasco right here.