News & Analysis
Neutron storm swirls around FPGA reliability
Ron Wilson
4/19/2004 9:00 AM EDT
Actel Corp. (Mountain View, Calif.) on Monday (April 19) will release a report indicating that normal neutron flux even at the Earth's surface can alter the configuration of SRAM-based FPGAs. Actel will use the data to argue that its antifuse and flash-based field-programmable gate arrays are a better choice than SRAM-based parts in failure-sensitive applications. But SRAM FPGA vendors like Xilinx and Altera counter that the new data is consistent with a large body of existing measurements that in fact show how resistant SRAM-based FPGAs are to neutron radiation.
The Actel report, based on tests conducted at Los Alamos National Laboratories' neutron-beam facility, is bound to add energy to a simmering debate about the reliability of SRAM-configured FPGAs. "SRAM-based devices did show significantly more upsets, and more changes to their operation, than did the flash-based parts," said Eric Dupont, president and CEO of iROC Technologies (Santa Clara, Calif.). Actel contracted with iROC, a specialist in single-event-upset modeling and correction, on the tests. The report is available at www.actel.com.
No party to the controversy disputes that ionizing radiation whether from alpha particles or cosmic neutrons can change the state of SRAM cells. At issue is the impact on actual systems.
Test data suggests that the field failure rate for a single FPGA from neutron-induced configuration-bit upset is vanishingly small a mean time to chip failure on the order of a millennium or perhaps a century, depending on the configuration being used. But in large networks of systems, each equipped with a moderate number of FPGAs, these numbers can add up to a serious problem. At the same time, high-altitude or space deployment can change everything by orders of magnitude.
The mechanism of the upsets is well-understood. Moderately energetic neutrons penetrate the Earth's atmosphere continuously, creating a permanent particle flux that increases with both altitude and latitude. That can cause changes in the SRAM bits that configure the logic cells and interconnect matrices of FPGAs, potentially causing errors in the behavior of the devices.
Neutrons are quite standoffish particles that rarely interact with other matter. Hence, shielding against them is nearly hopeless. However, once in a long while a neutron, on its trajectory through an IC, will strike a silicon atom. One of the possible outcomes, depending on the energy involved, is that the silicon atom will become ionized and displaced from its lattice. This ionized heavy atom will blaze through the lattice, in the process generating a trail of energetic carriers that can be swept into a junction, causing a current pulse. Under realistic conditions, this pulse can in fact change the state of an SRAM cell.
After some acrimonious public debate on the significance of such events, Actel which does not use SRAM cells for configuration purposes in two of its FPGA families contracted with iROC to test real devices.
It was decided, given the very long times between events at sea level, to use accelerated testing at the Los Alamos Neutron Sciences Center (LANSCE). The devices were bombarded with very high doses of neutrons, with a carefully controlled energy spectrum, to simulate decades and centuries of exposure at various altitudes.
"We wanted to separate out the effect of changes in configuration from upset of individual flip-flops or memory bits," said Ken O'Neill, director of product marketing at Actel. "So we filled each of the devices under test with a step-and-repeat pattern of cascaded combinatorial multipliers. We used over 90 percent of the devices' capacities in this way.
"During the exposure, we would both read out the configuration memory of the device, to determine if any of the configuration bits had been altered, and we would perform a behavioral test on the device to determine if the chip was still acting correctly. In this way we could see not only if a bit had changed, but if the change had made any difference in the operation of the chip."
The tests were performed on 0.22-micron flash and 0.15-micron antifuse devices from Actel, and on 0.13-micron Altera and 90-nanometer Xilinx Spartan SRAM FPGAs. Dupont said that iROC adopted the procedure defined in the JESD-89 standard for the tests. The standard describes both a procedure for accelerated life testing for neutron radiation and one for estimating actual failure-in-time (FIT) rates at normal atmospheric radiation levels from the measured results.
Not too surprising
"The results were not particularly surprising," Dupont said. "The behavior of the SRAM-based FPGAs was essentially consistent with the behavior of other SRAM devices we have tested."
Specifically, over the course of the radiation dose, the nominally 1 million-gate Altera EP1C20 exhibited 453 functional failures. The nominally 1 million-gate Xilinx XC3S1000 exhibited 1,936 configuration-cell upsets and 405 functional errors, and the nominally 3 million-gate Xilinx XC2V3000 showed 3,459 configuration upsets and 349 functional errors. Neither the antifuse Actel AX1000 nor the flash APA1000 displayed any functional errors.
Configuration data was not checked on the Actel parts because it is not readily available. Nor was it checked on the Altera part, because the experimenters were unable to get the configuration read-back function to work until after the experiment.
Using the formulas in JESD-89, the data predicted that at sea level the Altera device would experience 460 FIT and the two Xilinx devices 320 and 1,150 FIT respectively. A failure in time is one failure in a billion hours about 114,000 years.
Clearly this is not a failure rate that would keep the average iPod owner up nights. But Actel vice president of marketing Barry Marsh pointed out that at the altitude of Denver, the FIT rates would be more than three times greater than those at sea level. And when systems equipped with large numbers of FPGAs are considered, rather than individual parts, the numbers can become worrisome. For example, Actel estimated that a Sonet ring with 64 systems, each using 64 FPGAs and operating at an altitude of 5,000 feet, could experience a mean time to failure of less than 250 hours just from configuration-induced failures.
"Acceptable FIT rates for commercial ICs are normally under 100," Marsh said. "The fact is that even at sea level SRAM-based FPGAs can exceed that by a factor of 10. That has to have an impact on the reliability of large systems, or of large populations of small systems."
Vendors of the SRAM-based devices disagreed. "It is very clear that flash- and antifuse-based FPGAs are more resistant to radiation-induced upset of their configurations," said Tim Colleran, vice president of product marketing at Altera Corp. (San Jose, Calif.). "Designers whose systems must work in high-radiation environments are aware of this, and that has given Actel an important niche in the market. But to claim that there is a significant issue outside the space market, I think, is scare-tactic marketing. If this were a real issue in real systems at sea level, it would be showing up in system failures, and be getting traced back to us. But we are frankly not seeing any system reliability impact from the issue."
Austin Lesea, principal engineer in the advanced products group at Xilinx Inc. (San Jose), added some quantitative information to the debate. "We have been testing devices for neutron susceptibility for a long time, both at LANSCE and with atmospheric testing using ambient radiation," Lesea said. "One of the more important contributions from this work is what we call the Rosetta Project: It is an experimentally based way of using the high-dose LANSCE data to project failure numbers that are consistent with what we are actually observing in our ambient test sites."
Lesea said Xilinx now has almost 1,000 devices in a number of test centers, including an observatory on Mauna Kea in Hawaii and sites in Albuquerque, N.M., and San Jose. The company has accumulated 2,500 device-years of data from these installations. In addition, several Xilinx customers have gathered data of their own that has proved relatively consistent with this database, he said.
Lesea claimed the Xilinx data suggests a mean time to failure meaning a configuration-bit upset, not a device failure for a typical device on the order of 20 years at sea level. This number could get up to 25 times worse depending on the combination of altitude and latitude, he said. This would suggest a bit-upset FIT figure of around a few thousand, relatively consistent with the projections from the iROC data.
Gauging the meaning
But how changes in configuration bits relate to device failures is a complex and hotly disputed topic. The Actel experiment showed that, on average, there would be an observable device failure for every five to 10 configuration-bit changes.
"This relationship is highly dependent on a number of variables," Lesea said, including "the configuration itself" and "whether the designer is taking any measures to minimize the impact of configuration-bit changes." Xilinx estimates "that the ratio of bit upsets to actual chip functional failures can range from about six to about 100," he said. "We have one very complete data set from a customer on the 2V4000 part that shows a mean time to bit upset of 26 years, and an average of one chip failure for every 42 bit upsets."
Despite their susceptibility to neutron radiation, SRAM-based FPGAs are used in high-radiation environments. For example, a Xilinx Virtex 1000 resides in the control circuit for each of the Mars rover's wheels. In these applications, both Xilinx and Altera recommend design tactics to overcome the risk of failure.
"The main strategies are triple redundancy, error-correcting or parity codes, and redundancy in time," Lesea said. "In fact, Xilinx has a tool that will convert a design into a voting, triple-redundant design automatically."
An alternative to full redundancy, Lesea said, is redundancy in time. An FPGA can complete a task, reconfigure itself, repeat the task and compare the results, if time allows such a strategy. Or circuitry can check its results continuously with error-detecting or error-correcting codes. In less sensitive applications, simply reconfiguring the FPGA periodically can substantially reduce the probability of a failure. "The chance of upsets actually changing the function of the chip increases with the number of upsets," Lesea pointed out. "So, if you don't let them accumulate, you are better off."
Actel and iROC, on one side, and Xilinx, on the other, diverge on the implications of the data for the future. O'Neill and Dupont maintained that the problem of neutron upset will only get worse as geometries get finer. But Lesea begged to differ.
"Our measured data show that in moving from 150 nm to 130 nm, our susceptibility to upset improved by 15 percent," he said. "We are seeing another 15 percent improvement between 130 nm and 90 nm."
That's partly because as the area of the cross-coupled inverter latches Lesea prefers not to call them SRAM cells decreases, there is a smaller target for the secondary silicon ions to hit. But another key factor is that Xilinx designers are consistently making device and circuit changes to improve the cells' resistance. "One huge advantage we have is that our configuration latches don't have to be designed for speed," Lesea said. "We, unlike SRAM [memory] vendors, have the luxury of designing them for upset resistance."
Given this difference of views, the debate will certainly continue.



