Over the last year or so I’ve become very interested in the topic of creating radiation-tolerant designs, so I am really excited by Today’s Announcement
by Xilinx as to their new space-grade rad-hard Virtex-5QV FPGAs.
What Xilinx have done is really rather cunning, because… actually, I have so many ideas bouncing around in my head that I don’t know where to start. Perhaps we should begin by discussing the various nasty things radiation can do to a silicon chip (well, any sort of integrated circuit really, but let’s focus on silicon).
First of all we have something called a single event upset (SEU). This is where a radiation event (in the form of an ionizing particle or photon) hits the chip and transfers sufficient energy to cause a register bit or a memory bit to flip into its opposite logical state (a 0 flips to a 1, or vice versa). We also have a single event transient (SET), which is where a radiation event causes a logic gate or buffer to exhibit a positive- or negative-going pulse at its output. If such an SET is subsequently loaded into a register or a memory element, then at that point it becomes an SEU.
There’s also something called “latchup,” which refers to a particular type of short circuit in the form of a low-impedance path between the power supply rails of a CMOS circuit. A common cause of latchups is ionizing radiation, which makes this a significant issue in electronic products designed for aerospace applications. If a latchup condition persists too long, the overcurrent can damage the chip; the only way to clear such a condition is to power-down the chip and power it up again.
And one more thing to consider is the total ionizing dose (TID), also known as the “absorbed dose,” which is a measure of the energy deposited in a medium by ionizing radiation per unit mass. I’m a bit “fluffy” about this one, but my understanding is that prolonged exposure to radiation physically degrades the chip until – at some point – it ceases to function.
A really bad analogy
As the geometries of silicon chips get smaller and smaller, they become more susceptible to the effects of radiation. Let’s visualize a logic gate implemented in a 5 micron technology as being equivalent (in terms of size) to say a full-size car like a Volkswagen Beetle. Let’s also visualize an ionizing heavy ion as being the equivalent of a tennis ball. In this case, throwing the tennis ball at the car is not likely to cause any serious effects (at least, it won’t at any velocity I can achieve). By comparison, consider a logic gate implemented at the 28 nanometer node, which would be roughly equivalent to a toy car of the Matchbox or Hot Wheels variety. In this case, our hurled tennis ball could cause some serious damage. (This is probably a dreadfully inaccurate analogy … but it’s the best I can come up with on the spur of the moment.)
Rad-hard versus rad-tolerant
So what’s the difference between rad-hard and rad-tolerant? Actually, this is one of those “Beware, here be dragons”
topics, because these mean different things to different people.
One definition of a rad-hard device is one that has special physical characteristics that make it immune to the effects of radiation; for example, devices fabricated with lightly doped epitaxial layers grown on heavily doped substrates are less susceptible to latchup. Another definition of a rad-hard device is one that is guaranteed to meet specific radiation characteristics.
Now, some folks would say that a rad-tolerant chip is one that exhibits some capabilities of withstanding the effects of radiation environments. In this case they are typically talking about the physical aspects of the chip. Personally I equate this to the phrase “stretch-resistant socks”; we all know that they are going to stretch – the best we can hope is that they resist it for a while.
My own personal feeling is that the term “rad-hard” should be applied to (a) anything we do to the underlying physical construction of a chip that makes it less susceptible to the effects of radiation and (b) a chip that is guaranteed to meet specific radiation characteristics. Meanwhile, I feel that the term “rad-tolerant” should be used in the context of designs that are created in such a way as to mitigate the effects of any radiation events that do occur. (I’m more than open to discussion on this point and will happily change my mind in the face of an appropriately compelling argument.)
Register bits, memory cells, and configuration cells
All digital silicon chips – including ASICs and FPGAs of all varieties – are susceptible to SETs in their logic gates and SEUs in their register bits and memory elements. There are a number of ways to handle these effects.
In the case of blocks of memory, for example, we can use additional bits to implement ECC (error correcting codes). We can also implement automatic memory scrubbing circuitry that works away in the background cycling through the memory reading it word by word and using its ECC to detect, correct, and re-write any corrupted data.
With regard to finite state machines (FSMs), we can design them in such a way that they can never enter an illegal state or perform an illegal transition. Actually, I’ll have to think about this … maybe the best we can do is design them in such a way that they immediately detect a problem and automatically recover from the error condition.
In the case of individual registers, we can use triple modular redundancy (TMR), in which we replicate the register three times and then use voting circuitry to take a “best two out of three” approach. The way to think about this is that all three registers should ideally contain the same 0 or 1 value, but a radiation event occurs leaving two registers contain one value and one register containing another, then the voting circuit will go with the majority.
Actually, you can perform TMR at any point in the hierarchy – at the individual register bit level – or at the functional block level – or at the chip level – or at the board level – or at the system level…
Assuming that you have created your design such that it will mitigate the effects of an SEU – like performing TMR on a register bit, for example – then all you have to do is to wait for the next clock to come along and clear out the offending value.
Having said this, there’s one thing that FPGAs have that ASICs don’t, and that’s their configuration cells. The problem is that if a configuration cell becomes corrupted by a radiation event, then it won’t automatically be cleared out on a subsequent beat of the system clock.
This is why antifuse-based FPGAs have long been of interest for deep space applications, because antifuse configuration cells aren’t susceptible to the effects of radiation. The downside is that these FPGAs are only one-time-programmable (OTP), so you had better get things right the first time.
Of course SRAM-based FPGAs have a lot of advantages, such as the fact that they are fabricated using a standard CMOS process. This means they can take advantage of the latest process nodes, which in turn means they can offer high densities. It also means that they offer all of the advantages of reprogramability. The downside is that – until now – it has been possible for their SRAM-based configuration cells to be corrupted by radiation events.
Now Xilinx have actually been doing some really interesting things in this area, such as using TMR to replicate the entire design three times in the same FPGA and to use voting circuits to compare the outputs from the three design iterations. Also to periodically read out the configuration associated with each design instantiation, to perform a cyclic redundancy check (CRC) to test for any corruption, and – if necessary – to use partial reconfiguration to reload the offending configuration bits.
To be honest I thought that this was the best we could hope for, until I heard about Xilinx’s new rad-hard Virtex-5QV FPGAs…
A prototype of the Xilinx space-grade Virtex-5QV FPGA, with Mega-rad
capability is part of the payload of MISSE-8 (Materials On International
Space Station Experiment). Here, a NASA astronaut installs the MISSE-8
module on the International Space Station. MISSE is a testing ground
for computing elements and materials to determine how they react to the
effects of atomic oxygen, ultraviolet, direct sunlight, radiation and
extreme temperatures. (Photo courtesy of NASA).
Configuration cells on steroids
And so we come to the Virtex-5QV. Like all Xilinx FPGAs intended for space applications, Virtex-5QVs are rad-hard in that they are fabricated with epitaxial layers to make them less susceptible to latchup conditions. But the real magic is in their configuration cells. First consider a standard 6-transistor SRAM-based configuration cell as illustrated below. As we see, this can be upset by direct ionization coming in on any trajectory.
Standard 6T configuration memory cell
Now compare this to a Virtex-5QV rad-hard by design (RHBD) configuration cell. This 12-transistor dual interlocking latch can only be “flipped” by the direct ionization of dual complementary nodes.
A Virtex-5QV RHBD 12T configuration memory cell
To be more specific, this is not simply a matter of using redundancy by creating a duplicated 6T cell. Every point in one half of the cell has a complementary point in the other half, and the same ionizing particle has to upset both complementary points for the cell’s value to become corrupted.
The means that only ionizing radiation that is coming in via a very narrow cone has any chance of affecting the cell (the size of the cone shown above is much exaggerated). The result is effective immunity to protons and more than 800X improvement for upset by heavy ions.
I could waffle on for hours about this. For example, Xilinx say that the Virtex-5QV can handle greater than 1Mrad(Si) Total Ionizing Dose (TID). However, they told me that they’ve really verified these parts to twice this value; also that actual failures don’t start to occur until anywhere between 2 and 4 Mrads, which is very, VERY impressive.
There’s so much more… but my wife just called me to say that I need to be setting off for home because she’s just put our supper in the oven, and if there is one thing I know, that thing is that it would not be a good idea to be late for supper…
If you found this article to be of interest, visit Programmable Logic Designline
where you will find the latest and greatest design, technology, product, and news articles with regard to programmable logic devices of every flavor and size (FPGAs, CPLDs, CSSPs, PSoCs...).
Also, you can obtain a highlights update delivered directly to your inbox by signing up for my weekly newsletter – just Click Here
to request this newsletter using the Manage Newsletters tab (if you aren't already a member you'll be asked to register, but it's free and painless so don't let that stop you [grin]).