As discussed in the Part 1 of this series (requirements and assessment flow)
, safety is one of the key parameters that most of automotive companies are focusing on. Part I described the ISO 26262 standard and the related nomenclature. This part deals with looking into design solutions for increasing the safety/reliability of products, thus enabling automotive chip suppliers and their customers to deliver safer parts. Design failures
According to ISO26262, a design should be robust enough to handle the random failures caused by harsh ambient conditions. It has been observed that the cosmic rays and alpha particles can produce enough charge inside a chip to change the state of one or more flip flops or temporary change in the net value. And due to aging, a flip flop may not be able to retain its values for long duration. These failures can be temporary in nature, like bit flipping or permanent because of wearing out of the device. Such failures can lead to a malfunction resulting in a violation of safety goals (damage incurred).
Failures can be classified into two types, depending on failure tolerant time.a) Single point failures (SPF)
SPF refers to the faults that would immediately cause the output to go invalid state (i.e. result in immediate functional error (failure tolerant time is less)) and make the error dangerous (failure of a safety goal). These errors should be detected as quickly as possible and corrective action taken. An example for these errors is incorrect functioning of the CPU core that could lead to malfunction of a critical feature such as steering. b) Latent failures
These failures refer to the faults that wouldn’t immediately cause the output to go invalid state but would result in the part degradation (failure tolerant time is greater). These faults alone might not result in the functional failure immediately, but with certain subsequent failure conditions they can be dangerous. These faults should be checked periodically and corrective action taken. An example for these errors is bit flipping in the memory or degradation in the memory (i.e. memory ability to hold data). Failure rate dependence
ISO26262 defines the process that could ensure higher tolerance to failure. Failure rate depends on various factors like technology and the environment where the part has to operate. These parameters are generally uncontrollable by silicon providers. So, silicon providers need to look into the design/architecture solution for increase reliability (i.e. detecting the failures as quickly as possible or reducing the impact of those failures).
Several design solutions that can be implemented to increase the circuit reliability are discussed below.Lockstep mode and delayed lockstep mode
An extra CPU is integrated in the device. If one of the CPU is malfunctions then it is immediately sensed by compare logic. Once the fault is detected, the system could be designed to run in a safe mode and give warning to the user.
Another means of ensuring CPU operation is to run the main CPU one or two "clocks" delayed from the "checker" CPU and accordingly do the comparison. This is known as delayed lockstep mode of operation.Adding structural redundancy to make a design more immune to failures
Critical flops in the design could be replaced by the triple flop structure as shown in the above diagram. Additional flops reduce the risk of the circuit malfunction due to bit inversion on one of the flops. Having more flops reduces the probability that two flops in the same structure will flip simultaneously.
Critical modules/peripherals in the SoC can be replicated to provide fault tolerant operation (e.g. two ADC measuring the same quantity). In case of an error in a module, the backup module can be made active. A safety protection mechanism must be put in place like regular LBIST (logic built-in self-test
), ECC (error correction code), or CRC (cyclic redundancy check
) scan of configuration space to detect errors in module/peripherals.Data redundancy added to make a design more immune to noise
A master unit could generate an error correction code, which could travel along with the address and data, being checked at the destination. In case of the bit flip due to noise, this error could be detected at the receiver end and corrective action taken by having these end-to-end ECC checks. Similarly critical modules can have local ECC checks.
Similarly data in RAM could be stored with the ECC whenever a "write" occurs in the RAM, and on "read" it can check for ECC error.