FPGAs are being used more and more in high-integrity and safety-critical domains. There is however, a lack of consensus on how FPGAs can be safely deployed and certified. Should these devices be treated as hardware or software during the certification process? Also there is a lack of shared information on the determination of the risk associated with using FPGA technology.
FPGAs possess features such as parallelism, reconfiguration, separation of functions, and self-healing capabilities. All are compelling for creating redundancy and independent blocks as well as increasing the overall availability, however these features are not generally well known especially to safety assessors.
This article touches on the application of the IEC 61508 Edition 2 Safety Standard to FPGAs pertaining to methods, and it establishes the foundation of a guideline for a Safety Package allowing the certification of FPGA-based products in accordance to the functional safety recommendation of the IEC 61508 Edition 2 Safety Standard.
FPGAs and safety systems of interest
FPGAs are an excellent fit in safety applications. In order to perform the necessary validations, however, the system utilizing FPGAs has to be defined.
Figure 1 defines the subsystem managing the equipment under control (EUC). Each subsystem is equipped with sensor elements (S) and actuator elements (A) connected to a logic system (LS). This connection is achieves through an input interface (I) and an output interface (O). This guideline concerns the input and output interfaces and the logic system. The EUC, the sensors, and the actuator elements are excluded, as are the communication protocols of the EUC.
According to the IEC 61508 Edition 2 Safety Standard, the different system levels are referred to as follows:
- The electrical/electronic/programmable electronic systems (E/E/PES) design level describes how different subsystems are correlated
- The subsystem design level describes how one of these subsystems is internally designed
- The FPGA design level describes how the FPGA is internally designed
Figure 1. FPGA in IEC 61508 system of interest
The objective is to determine and mitigate failures that have the potential to cause hazardous outcomes. There are a number of sources of failures that can be classified according to the origin of the failures as follows:
- Concept: incomplete safety concepts
- Design: incorrect design
- Implementation: systematic failures introduced in the generation of the bit-stream from the design description
- Build: physical faults in the FPGA when it is produced or when it is built into the system
- Wear: physical failures caused by continued use or storage
- External: random failures caused by the external environment in which the FPGA is operating
The basic assumption is that the overall design requirements for the FPGA are correct; issues with requirements and concepts are a more general problem, and the standard presumes that the safety management process should take care of them.Failure mitigation
The way in which these failures can be mitigated is normally achieved with redundancy and duplication. These increase:
- The availability of a system
- The robustness of the system
As safety-related functions might become very complex and expensive, it is desirable that only a specific part of a function is duplicated. Several types of redundancy can be defined as follows:
A) Hardware Redundancy
consists of duplicating all or parts of the electronic hardware. There are several possibilities:
- Single Chip redundancy
- Separate Chip redundancy
- Separate and diverse Chip redundancy
Additionally the redundancy can be N-module redundancy; simple redundancy, triple-mode, etc.
B) Software Redundancy
consists of duplicating all or parts of the software, so that if one part of the software fails to operate at least one of the duplicated parts will still deliver the correct service. Examples of software redundancy may be:
- Additional redundant conditions before initiation of a critical event in conjunction with special bit patterns (instead of single bits) for critical flags
- Implementation of different redundant data paths between critical inputs and their corresponding outputs
C) Information Redundancy
consists of duplicating all or part of critical information so that if one part of the information becomes corrupted at least one part containing the correct information remains preserved. Examples of information redundancy may be:
- Multiple storage of any type of critical information
- Adding a checksum or signatures to preserve the integrity
- Special coding, such as anti-valent bits; 0 is represented with two bits 01 and 1 is represented with 10
D) Time Redundancy
consists of duplicating all or part of critical control at several times, to decrease the probability that faulty information is transferred leading to system faults. Examples of time redundancy may be:
- Duplicating the transmission of a message at different times
- Duplicating the reading of a signal status within a time window
is a method used to decrease the probability of systematic failures. Different implementations for a redundant function ensure the independence of common development errors of the redundant components. Diversity can be performed in software or hardware. The two major forms of diversity realized with an FPGA at the design level are N Self-Checking Programming and N-Version Programming. Self-checking adds redundancy so that it can check its own dynamic behavior during execution, consisting of either a variant and an acceptance test or two variants and comparison algorithm. In an N-version software system, each module is made up of N different implementations. VHDL development and Verilog development with the same specifications are examples of N-Version Programming to minimize the common cause effect of the HDL synthesizer.
describes functions added to the system to detect if the operation is out of the specification via on-line testing. The result of the diagnostic test becomes input parameters to subsystems that affect the system operation and should be considered in the safety system requirements standard.
Redundancy and diversity in conjunction with diagnostic functions ensure that critical failures are detected and handled in time to prevent the loss of safety-related functions.IEC 61508 Edition 2 requirements for on-chip redundancy
By providing requirements for ASICs and FPGAs, Edition 2 of the standard fills a major gap left by Edition 1. In Edition 1, these components were not considered. Edition 2 provides techniques and measures for the avoidance and control of systematic failures as well as random hardware failures in ASICs and FPGAs. Another novelty of Edition 2 is the consideration of on-chip redundancy (i.e., the implementation of a minimum of two redundant channels into a single IC die). This is welcomed since on-chip redundancy has many advantages, such as reduced space and power consumption compared to classical redundant solutions. However, on-chip redundancy also brings more risks. A failure of a component with on-chip redundancy can produce a common cause failure (i.e. a failure which impacts several channels of a redundant system). Therefore, the standard defines very stringent requirements to avoid or reduce the probability of such common cause failures.
One critical requirement of on-chip redundancy is the need of separation. Each channel, as well as each diagnostic function, must be electrically and thermally separated. This is achieved with a fence created with a minimum distance between each block. Each block has to be perfectly separated from the other blocks. Temperature monitors in each block provide diagnostics to observe thermal influence. The effects of a failure in the power supply must also be taken into account, by implementing voltage monitors for example.Interpretation of the IEC 61508 Edition 2 redundancy clauses
The modalities for implementing the failure mitigation account for the directives of the IEC 61508 Edition 2 Safety Standard as practicable with special attention given to the right interpretation of its clauses. With Edition 2 extending the ASIC and FPGA coverage, there is a mix of rules and recommendations that are very conservative and sometimes, if taken literally, may lead to contradicting each other. Thus assuming that from a functional safety standpoint a proper safety module has been identified, its redundancy concept analyzed and architected, one of the common questions is: how to achieve physical separation between two or more modules into a single FPGA for the purpose of augmenting diagnostic or redundancy (see Figure 2).
Figure 2. Example of Separation
Among the possible techniques, two are considered here:
- The Xilinx Isolation Design Flow via IDF
- The detailed N-Version Isolation Design