A single event upset (SEU) is where a latch or logic element on the device is flipped into the wrong state by an unexpected occurence.
Two implementation issues still remain. Firstly, any additional capacitance must not significantly increase the die size. Large FPGAs, such as Stratix V from Altera, have configuration memory sizes of over 300 megabits, so an increased die area would push up the cost. As an example, Altera provides Stratix FPGAs with a built-in dedicated cyclic redundancy check (CRC) that detects configuration bit flips and flags an error on a dedicated pin. Secondly, the SRAM accessed by the FPGA application (e.g., Block RAM or BRAM) needs to be as fast as possible, because it affects the system speed that can be achieved.
All is not lost, however, because several additional factors come into play. For a start, only between 5% and 10% of the configuration file is actually used in any application. This is because there will be resources such as LUTs and interconnect that are not used in a particular design, so this significantly reduces the chance that a bit-flip will disrupt normal operation. Moreover, some vendor tools predict the likelihood of an SEU affecting the design. This information can be used to decide what, if any, additional measures should be taken.
Vendors have also provided other ways to mitigate the effects of SEUs. Errors induced into the BRAM can be identified by built-in error detectors. These are called Error Correction Checking (ECC) circuits, which use additional parity bits to identify and flag any errors. As an example, Virtex-7 devices feature 8 bit parity that can automatically detect and correct any single bit error or detect a double error. The Stratix equivalent can detect up to three individual errors and correct up to two errors.
There are features hard-wired into some FPGAs, such as the Virtex-7 family from Xilinx and Altera's Stratix, which provide a continuous background read-back from the configuration memory elements. This scans for differences from the initial data pattern and can detect both single and double errors. Should an error be detected, then it is flagged, and the logic can be set to perform an automatic correction on single bit errors. That's very neat.
Even with all these measures, the more discerning will have noticed that the SEU is only detected at some time after the event. This means that any data processed between the time the SEU hits and it is detected might be in error. This may be acceptable, even in infrastructure equipment, because the corrupted packet will be detected and dropped.
However, in applications where this is not acceptable, other measures must be adopted. A commonly used method is called Triple Modular Redundancy (TMR). TMR is not something that you would adopt lightly. The reason is that it can significantly reduce the logic capacity, because the concept is to replicate the logic three times and then pass the three outputs into a majority voting circuit as illustrated below.
The Triple Module Redundancy (TMR) concept.
There are several ways to incorporate TMR into a design. It does not have to be the entire circuitry that is triplicated, so it may be that only critical logic is included. However, as is shown in the above graphic, the majority voting logic now becomes a potential failure point. More complex schemes can overcome this shortcoming, but at this point I am starting to move outside my comfort zone.
Remember that I used the word "probability" when referring to the likelihood of an SEU occurrence in any particular SRAM element. One reason for the growing concern about SEUs is that, as FPGAs march down the Moore's Law path, the total quantity of SRAM is increasing. Given this, it would follow that the overall problem must be growing along with the FPGA complexity.
Fortunately, Xilinx publishes the results of radiation testing across the range of products in its quarterly report on reliability as Failures in Time (FITs) per Mbits of memory. This shows a significant drop in SEU susceptibility with the latest 28 nm products. The reduction halves the probability of failure for the configuration, while the user memory (Block RAM) is only one fifth compared to the previous 45 nm generation. This shows that, with sufficient attention to the problem, the SEU performance can be tackled with on-going improvements... well, probably.
The future promises FPGAs fabricated using FinFET processes. This move from planar technology is expected to give a significant boost to performance and reduction in power consumption as headline benefits. However, only time will tell how the radically different transistor will perform with respect to single-event upsets. How about you? Have you run into any SEU-based problems? Are you currently creating your designs to mitigate against such events?