Reliability is a multi-dimensional problem that can have some very unfortunate (possibly career-limiting) consequences when things go wrong.
My previous two columns have focused on Microsemi's SmartFusion2 and Igloo2 FPGA families. Today I ask the question: What do we mean by FPGA reliability?
Recently, I was listening to a Microsemi webinar, and I was struck by a remark that its FPGAs were "more reliable" than others. That started me wondering if there were reasons the FPGA reliability might be different from that of other semiconductor devices.
Before we proceed, let me clarify that, even though I have an appreciation of semiconductor reliability issues, I am no expert in the topic. Therefore, the following discussion reflects my perspective as a marketing person.
The first step is to decide what we mean by semiconductor reliability. Wikipedia says, "The finished product quality depends upon the many layered relationship of each interacting substance in the semiconductor." Most hardware engineers would probably expect the FPGA not to fail in their system. Even this simple definition is not obvious, as I will explain by an analogy. If I get into my car and it will not start, I call that a failure. However, that I were driving and ran out of fuel because the fuel gauge was wrong, would that be a failure or a calibration problem?
Many semiconductor companies periodically issue a reliability report that details their work in this area. I decided to compare the reliability of Microsemi FPGAs with those of its major competitors, Altera and Xilinx. Microsemi already lists its SmartFusion2 family in its reliability report under 65nm Flash technology. (The Igloo2 family is not included because it had not been released when the report was compiled.)
The report shows the outcome of a wide range of tests to measure parameters such as the probability of device failures, nonvolatile data retention, ESD testing, and other mechanical tests of packaging. These tests apply to most semiconductor devices. FPGA vendors typically take a sample batch from production and subject the devices to many hours of being powered up and clocking data while in an oven at the maximum junction temperature. Periodically, the devices are tested and then returned to the oven to provide cumulative data points. This procedure, called accelerated life testing, aims to confirm that the devices will operate satisfactorily and stay within specification over a number of years when operated in a more typical environment -- say, at 55°C.
Obviously, customers cannot wait 10 years for confirmation that the devices are good for that period of time, so the elevated temperature is designed to highlight weaknesses quickly. Vendors accumulate device hours to calculate the probability. Returning to my analogy, if my car started with the first turn of the key every day this week, would this guarantee that it will start tomorrow? Of course not, but it would give me more confidence than if it laboriously spluttered to a start each time. Reliability engineers calculate a mean time to failure (MTTF) at a certain (statistical) confidence level. The parameter is expressed as a failure in time (FIT) rate and gives the number of failures that can be expected in a billion (109) device hours of operation. From a mathematical perspective, more device hours give more credibility to the assumptions behind the FIT calculations.
The SmartFusion2 test results showed no failures over a (relatively) small number of device hours. This is not surprising, but it does not mean that the product will never fail, of course. The initial FIT rate in the report is calculated as 24.51. In contrast, FPGA products that Xilinx released several years ago using a 65nm SRAM process have stacked up considerably more device hours. Xilinx shows a headline FIT rate of 10, which is broadly comparable to the Altera results.
To Page 2 >