Reliability is a multi-dimensional problem that can have some very unfortunate (possibly career-limiting) consequences when things go wrong.
FPGAs differ in some significant ways from other semiconductor products. For a start, they are shipped as blanks, and the customer's design is configured into them. The Microsemi devices store their configuration in Flash, while Xilinx and Altera devices employ SRAM-based storage. Here again, Microsemi claims a number of advantages, including immunity in the Flash elements from single-event upsets (SEUs). This is where a subatomic particle hits the device and can flip a latch storing the configuration. This topic could fill a blog in itself, but suffice it to say that this effect can be detected and measured, and conventional wisdom suggests it gets worse with smaller technologies. This is not necessarily correct, but again, we must push on.
What is true is that SEUs can affect the memory used by the application in FPGAs. This memory might be used for storing data, processor executable code, or coefficients, and any error could cause the system to malfunction. Is this a failure? I would class it as unreliable operation. The Microsemi reliability report has no SEU data on the memory fabricated in SmartFusion2. Xilinx reports both configuration and application memory sensitivities. At this point, it is worth mentioning that standalone SRAM devices are susceptible to SEUs, and companies such as Cypress, Atmel, and Honeywell produce specially radiation-hardened devices for aerospace applications.
Another key difference among semiconductors relates to the design software. The algorithms to place and route, and then verify the design, are bespoke to the device architecture. Vendors generate parameters, such as timing models, based on their characterization work. These models are typically tuned as the product matures from early samples toward production release, but they can also be adjusted as a result of mistakes and errors reported by the user base. Xilinx and Altera claim more than 15,000 customers, so problems can be caught quickly. If, for example, the software models contained a timing issue that might stop a small percentage of devices from functioning at maximum temperature and minimum supply voltage, the likelihood of identifying it is obviously better with multiple customers filling FPGAs with many different designs.
The SmartFusion2 and competing products also contain ARM processors. These are hard implementations that are fabricated as optimized cores on to the silicon, so they can be carefully characterized. The processors benefit from the extensive ecosystem and tool support built up by ARM, but how many software engineers have ever written entirely bug-free code?
The ARM processors on the Microsemi chips clock much slower (at 166MHz) than the latest 28nm products from Xilinx (which go up to 1GHz). However, designers have plenty of opportunity at any clock speed to foul up their design. For example, there will be interfaces between the ARM and the programmable fabric deep inside any design. Probably data will be stored in a FIFO before loading on to the bus, and that means that somewhere there must be the equivalent of a driver to control the data transfer. The design must consider and evaluate what tools support this transfer and how is it modeled if an interrupt occurs during the transaction.
These are a few of the reasons the comment from Microsemi started me thinking about reliability. In reality, reliability is a multi-dimensional problem that can have some very unfortunate (possibly career-limiting) consequences when things go wrong. Do you have any reliability horror stories to share?