Breaking News
Blog

What Do We Mean by FPGA Reliability?

NO RATINGS
Page 1 / 2 Next >
View Comments: Threaded | Newest First | Oldest First
Max The Magnificent
User Rank
Blogger
I know it's important, but...
Max The Magnificent   11/26/2013 2:20:04 PM
NO RATINGS
I know reliability is important. It's just that I find wading through things like MTBF and FIT and stuff boring ... I just want to solder things together and have them work (LOL)

DrFPGA
User Rank
Blogger
SEUs in SRAM-based FPGAs
DrFPGA   11/26/2013 2:28:58 PM
NO RATINGS
Single Event Upsets (SEU) in SRAM-based FPGAs can have some unfortunate consequences. Not only can you change the logic function of a look-up table but since many of the configuration SRAM bits control the interconnect, a 'flipped' bit could add/subtract connections to an otherwise working device. Either of these errors could cause a failure that ripples thru the system and might end up creating a cascade effect where additional failures impact other devices (external memories, MCUs or data transmissions, PoL converters, etc). The mind reels...

paul.dillien
User Rank
Blogger
Re: SEUs in SRAM-based FPGAs
paul.dillien   11/26/2013 3:11:07 PM
NO RATINGS
Hi DrFPGA

Yes SEUs can do very nasty things, which is why it is important to understand what is the probability of a bit flip.  Designers of high reliability equipment will use features such as triple module redundancy and Error Detection and Correction (EDAC).  In addition there are techniques for "scrubbing" the configuration.

I know that Xilinx has been very active for many years on mitigating SEUs.  This includes design techniques that have resulted in the measurements on 28nm devices of SEUs/Mbit being the best ever going back as far as 250nm.  Obviously, there is much more configuration memory and user memory (Block RAM) in the latest devices, but the numbers are real (not calculated), and users can build in Soft Error Mitigation (SEM) IP cores to attack the issue from the design side too.

Adam-Taylor
User Rank
Blogger
Lots to consider
Adam-Taylor   11/26/2013 6:05:01 PM
NO RATINGS
Paul, 

Great blog, there are lots to consider when looking at fpga reliability, not just the actutal fit rate and mtbf of the device, remember FIT rate only applies in the constant failure rate period of the bathtub. 

You also need to consider the mounting method - BGA, column or land grid, Quad flat pack. Then there is the assignment of pins which are best to use if you have a choice. 

SEU are a concern which can lead to lock up but also there is the impacts of total ionising dose which can effect both the timing and the power dissipation.

With SEU you need to be very careful of synthesis optimisations to ensure they do not introduce potential problems under SEU. Many companies / institiutes are a little concerned about thing like auto state machine illegal state detection and instead prefer hand coded solutions. It is possible to determine the MTBF between SEU events in user logic and connfiguration logic for Xilinx devices I wrote an article on it but manufacturers are very careful not to scare users with SEU as there is a lot of bad advice out there. 

In real high end applications you are also going to be trying to ensure the junction temperature is de rated correctly at your maximum qualification temperature to ensure reliability (think of arrhenius) 

Of course within th FPGA we can do TMR, error correction and detection which can impact the speed of the device. You also need to consider the effects on single points such as clocks, resets and inputs, hencce why global TMR can be so useful. 

Also if you are designing your FPGA to be relaible then the rest of the system needs to be and you need to consider a lot more so the cost goes up quickly. 

Adam-Taylor
User Rank
Blogger
MTBF and FIT
Adam-Taylor   11/26/2013 6:14:48 PM
NO RATINGS
The MTBF is also an intersting point as what people are really interested in is the probablity of success. Which at the point that the elapsed operating time = MTBF has a 37% chance of still being working. Which means if you want something to work for 10 years which a high probability of success you will need a much larger MTBF or a redundancy architecture or both.

While FPGA have good FIT rates where the problem comes at times is in creating the power architecture as DC DC and other POLS especially hybrids can have much worse FIT rates which swamp the FPGA contribution.

MS243
User Rank
Manager
Modern FPGA'S AND MTBF
MS243   11/27/2013 7:10:10 AM
NO RATINGS
MTBF is ususaly projected by just putting parts in an Oven, and letting them bake --   There is no thermal cycling, vibration, ESD due to service personnel, or other factors such and HIRF suceptibility, etc factored in.    All it takes for a failure, is a customer not contracting for a service guide and then attempting to service a complex system with an FPGA which may be much more ESD sensitive than past products to give everyone a sour taste in the mouth with failed devices for example.

paul.dillien
User Rank
Blogger
Re: Modern FPGA'S AND MTBF
paul.dillien   11/27/2013 9:26:46 AM
NO RATINGS
That's the benefit of the vendor Reliability Reports.  The vendors perform tests such as ESD testing, High Temperature Operating Life (HTOL), autoclave and cycling, but also tests that (most) customers cannot replicate such as bond strength testing or total ionising dose.

Another related topic is where customers choose to "up-screen" by purchasing, say, industrial grade devices and testing them to a higher standard. 

MS243
User Rank
Manager
Re: Modern FPGA'S AND MTBF
MS243   11/27/2013 1:51:06 PM
NO RATINGS
Some of the worst devices I've ever had the misfortune to be associated with had a Full Mil design and Test -- put them in an Industry standard programmer and 50% of the parts would fail on the first try -- this compared with no fall - out what so ever from another vendors re-packaged commercial grade parts -- the key difference was the second vendor implemented full EDAC on every storage location -- performance was greatly enhanced  by conservative design rather than trying to screen quality in after the design process.  The second vendors parts would tollerate two bad bits per byte of code or data stored ---- and still function perfectly at the device level.   We did not achive similar results with the first vendors parts.   (One small item in a datasheet or test report can make or break a part -- one potential issue with your presentation of these FIT numbers is that the Xilinx and Altera parts also need to factor in the FIT of the separate Memory IC and additional solder conections and decoupling capacitors, strapping resistors, etc, etc etc. for a non flash part -- the reports only present the  FIT on the FPGA, and one must also add in the FIT for all the other parts required on the circuit board to get a FIT for a given solution as Adam partially alludes to.  (The power supply will likely have a FIT many times greater than the FPGA for example, so the fewer number of rails one has the better FIT one can achieve given FPGA FIT rates in the same order of magnitude)

AZskibum
User Rank
CEO
Re: Modern FPGA'S AND MTBF
AZskibum   11/30/2013 7:56:21 AM
NO RATINGS
Thanks for the interesting discussion about FIT rates, SEUs and reliability, which apply of course not only to FPGAs. There has recently been a lot of renewed interest in the probabilities of bit flips in SRAM locations for safety critical systems, so this is a very timely discussion.

paul.dillien
User Rank
Blogger
Re: Modern FPGA'S AND MTBF
paul.dillien   11/30/2013 3:13:39 PM
NO RATINGS
Hi MS243

Ah yes, power supplies.  I've heard of FPGA designs dissipating over 40 watts, which with core supply voltages of 1 volt or less translates to significant supply currents.  (Think also "voltage droop" and "electromigration"...). 

KB3001
User Rank
CEO
Testability...
KB3001   11/29/2013 3:02:28 PM
NO RATINGS
Interesting article. Testability is increasingly crucial in modern electronic chips/platforms.

Kinnar
User Rank
CEO
Electronics Devices are becoming more complex systems
Kinnar   11/30/2013 6:48:00 AM
NO RATINGS
As the technology evolves the electronics systems are becoming more and more complex devices both in terms of hardware designs and software designs, now as very well discussed in the article the failure can accurate due to design/manufacturing/usage flaws, it will require new testing methods to classify and evaluate the failures.

_hm
User Rank
CEO
Reliability - An intricate science
_hm   11/30/2013 7:21:57 AM
NO RATINGS
Reliability is an intricate and wonderful science. And it involves many dsciplines - like physics/semiconductor, mechanical, electroincs, materials and many others.

Best way to get familiar with basic concept is to read MIL-HDBK-217F. Also, relevance of this concept are more applicable as per application. Parts can be used in space with controlled environment or open to space, military - navy, air - hgih altitude, land - Siberia to Sahara or automobile or medical etc.

Best way to have new product with high reliability is to involve reliability engineers from start of the project. 

 

_hm
User Rank
CEO
Re: Reliability - An intricate science
_hm   11/30/2013 7:25:46 AM
NO RATINGS
PS: One basic concept to learn is arrhenius equation and activation energy. Electromigration and decap inspection of die is very interesting part and after understanding this, you will design much more reliable products.

 

paul.dillien
User Rank
Blogger
Re: Reliability - An intricate science
paul.dillien   11/30/2013 3:07:41 PM
NO RATINGS
I agree that it's a great idea to involve reliability guys up front.  (Reliabilty should be designed into a product - not tested afterwards).

What started me pondering the topic was considering if FPGAs were significantly different from other semiconductors.  In many ways they are the same, but they have additional dimensions, such as vendor-supplied software for customers to incorporate their own designs.

_hm
User Rank
CEO
Re: Reliability - An intricate science
_hm   11/30/2013 7:06:02 PM
NO RATINGS
@paul: FPGA should be treated like other VLSI or similar chip. Basic process of construction remains similar to other programmable VLSI logic.

Sinlge bit error can be due to many reasons. It can be EMI or EMP or nuclear radition or other. But that is another dimension of reliability. Similarly, software quality and FPGA program reliability is totaly different branch.

   

Flash Poll
Like Us on Facebook

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)
EE Times on Twitter
EE Times Twitter Feed
Top Comments of the Week