@paul: FPGA should be treated like other VLSI or similar chip. Basic process of construction remains similar to other programmable VLSI logic.
Sinlge bit error can be due to many reasons. It can be EMI or EMP or nuclear radition or other. But that is another dimension of reliability. Similarly, software quality and FPGA program reliability is totaly different branch.
Ah yes, power supplies. I've heard of FPGA designs dissipating over 40 watts, which with core supply voltages of 1 volt or less translates to significant supply currents. (Think also "voltage droop" and "electromigration"...).
I agree that it's a great idea to involve reliability guys up front. (Reliabilty should be designed into a product - not tested afterwards).
What started me pondering the topic was considering if FPGAs were significantly different from other semiconductors. In many ways they are the same, but they have additional dimensions, such as vendor-supplied software for customers to incorporate their own designs.
Thanks for the interesting discussion about FIT rates, SEUs and reliability, which apply of course not only to FPGAs. There has recently been a lot of renewed interest in the probabilities of bit flips in SRAM locations for safety critical systems, so this is a very timely discussion.
PS: One basic concept to learn is arrhenius equation and activation energy. Electromigration and decap inspection of die is very interesting part and after understanding this, you will design much more reliable products.
Reliability is an intricate and wonderful science. And it involves many dsciplines - like physics/semiconductor, mechanical, electroincs, materials and many others.
Best way to get familiar with basic concept is to read MIL-HDBK-217F. Also, relevance of this concept are more applicable as per application. Parts can be used in space with controlled environment or open to space, military - navy, air - hgih altitude, land - Siberia to Sahara or automobile or medical etc.
Best way to have new product with high reliability is to involve reliability engineers from start of the project.
As the technology evolves the electronics systems are becoming more and more complex devices both in terms of hardware designs and software designs, now as very well discussed in the article the failure can accurate due to design/manufacturing/usage flaws, it will require new testing methods to classify and evaluate the failures.
Some of the worst devices I've ever had the misfortune to be associated with had a Full Mil design and Test -- put them in an Industry standard programmer and 50% of the parts would fail on the first try -- this compared with no fall - out what so ever from another vendors re-packaged commercial grade parts -- the key difference was the second vendor implemented full EDAC on every storage location -- performance was greatly enhanced by conservative design rather than trying to screen quality in after the design process. The second vendors parts would tollerate two bad bits per byte of code or data stored ---- and still function perfectly at the device level. We did not achive similar results with the first vendors parts. (One small item in a datasheet or test report can make or break a part -- one potential issue with your presentation of these FIT numbers is that the Xilinx and Altera parts also need to factor in the FIT of the separate Memory IC and additional solder conections and decoupling capacitors, strapping resistors, etc, etc etc. for a non flash part -- the reports only present the FIT on the FPGA, and one must also add in the FIT for all the other parts required on the circuit board to get a FIT for a given solution as Adam partially alludes to. (The power supply will likely have a FIT many times greater than the FPGA for example, so the fewer number of rails one has the better FIT one can achieve given FPGA FIT rates in the same order of magnitude)
That's the benefit of the vendor Reliability Reports. The vendors perform tests such as ESD testing, High Temperature Operating Life (HTOL), autoclave and cycling, but also tests that (most) customers cannot replicate such as bond strength testing or total ionising dose.
Another related topic is where customers choose to "up-screen" by purchasing, say, industrial grade devices and testing them to a higher standard.