MTBF is ususaly projected by just putting parts in an Oven, and letting them bake -- There is no thermal cycling, vibration, ESD due to service personnel, or other factors such and HIRF suceptibility, etc factored in. All it takes for a failure, is a customer not contracting for a service guide and then attempting to service a complex system with an FPGA which may be much more ESD sensitive than past products to give everyone a sour taste in the mouth with failed devices for example.
The MTBF is also an intersting point as what people are really interested in is the probablity of success. Which at the point that the elapsed operating time = MTBF has a 37% chance of still being working. Which means if you want something to work for 10 years which a high probability of success you will need a much larger MTBF or a redundancy architecture or both.
While FPGA have good FIT rates where the problem comes at times is in creating the power architecture as DC DC and other POLS especially hybrids can have much worse FIT rates which swamp the FPGA contribution.
Great blog, there are lots to consider when looking at fpga reliability, not just the actutal fit rate and mtbf of the device, remember FIT rate only applies in the constant failure rate period of the bathtub.
You also need to consider the mounting method - BGA, column or land grid, Quad flat pack. Then there is the assignment of pins which are best to use if you have a choice.
SEU are a concern which can lead to lock up but also there is the impacts of total ionising dose which can effect both the timing and the power dissipation.
With SEU you need to be very careful of synthesis optimisations to ensure they do not introduce potential problems under SEU. Many companies / institiutes are a little concerned about thing like auto state machine illegal state detection and instead prefer hand coded solutions. It is possible to determine the MTBF between SEU events in user logic and connfiguration logic for Xilinx devices I wrote an article on it but manufacturers are very careful not to scare users with SEU as there is a lot of bad advice out there.
In real high end applications you are also going to be trying to ensure the junction temperature is de rated correctly at your maximum qualification temperature to ensure reliability (think of arrhenius)
Of course within th FPGA we can do TMR, error correction and detection which can impact the speed of the device. You also need to consider the effects on single points such as clocks, resets and inputs, hencce why global TMR can be so useful.
Also if you are designing your FPGA to be relaible then the rest of the system needs to be and you need to consider a lot more so the cost goes up quickly.
Yes SEUs can do very nasty things, which is why it is important to understand what is the probability of a bit flip. Designers of high reliability equipment will use features such as triple module redundancy and Error Detection and Correction (EDAC). In addition there are techniques for "scrubbing" the configuration.
I know that Xilinx has been very active for many years on mitigating SEUs. This includes design techniques that have resulted in the measurements on 28nm devices of SEUs/Mbit being the best ever going back as far as 250nm. Obviously, there is much more configuration memory and user memory (Block RAM) in the latest devices, but the numbers are real (not calculated), and users can build in Soft Error Mitigation (SEM) IP cores to attack the issue from the design side too.
Single Event Upsets (SEU) in SRAM-based FPGAs can have some unfortunate consequences. Not only can you change the logic function of a look-up table but since many of the configuration SRAM bits control the interconnect, a 'flipped' bit could add/subtract connections to an otherwise working device. Either of these errors could cause a failure that ripples thru the system and might end up creating a cascade effect where additional failures impact other devices (external memories, MCUs or data transmissions, PoL converters, etc). The mind reels...