The underlying idea seems simple enough - stick multiple FPGA die into a single package. It's the implementation that's so clever, and - as we shall discuss - the ramifications are truly enormous!
I just returned from a trip to visit Xilinx in San Jose, California. While I was there, they showed me one of the most exciting things I've seen recently in programmable logic space (where no one can hear you scream).
The underlying idea seems simple enough – stick multiple FPGA die into a single package. It's the implementation that's so clever, and – as we shall discuss – the ramifications are truly enormous!
Xilinx 7 Series FPGAs
Before we plunge into the fray, let's take a step back to remind ourselves that – in June of this year – the folks at Xilinx announced their forthcoming FPGAs to be implemented at the 28 nm technology node. They're collectively calling these "Xilinx 7 Series FPGAs."
The folks at Xilinx are presenting these new devices as three ... hmmm ... I hesitate to say "three families" because that implies that they are functionally different ... I personally prefer to think of them as three branches of the same family, where these branches are called Artix-7, Kintex-7, and Virtex-7. All three of these family branches share a unified architecture that allows for ease of design, migration, and IP portability.
And where did these three names come from?
Stacked Silicon = "More than Moore"
- Artix is rooted in Arctic, suggesting cool, low-power. The Artix-7 branch of products will offer 50% lower power and 35% lower cost than the Spartan-6 family, making this branch ideally suited for the cost-sensitive, high-volume markets served by ASSPs and ASICs.
- Kintex gets its roots form the word kinetic for movement and energy. Kintex offers the best combination of price and performance. The new Kintex-7 branch will deliver the performance of the existing 40 nm Virtex-6 family at half the cost. It not only addresses aggressive power and cost requirements with significant price/performance improvements over Virtex-6 and Spartan-6, but will also deliver on the emerging need for insatiable bandwidth in applications that include next generation broadcast systems and wireless networks.
- Virtex represents the summit and highest capability. The Virtex-7 branch will deliver the highest system performance and capability. The Virtex-7 branch will provide 2M logic cells and deliver 2X the system performance over previous generations. Virtex-7 is designed to meet the extreme performance needs of wired infrastructure, high performance computing (HPC) systems, and aerospace and defense among others.
What the folks at Xilinx didn’t say when they announced their 7 Series FPGAs was exactly how they intended to achieve the 2M logic cells in the Virtex-7 devices. I guess that (like most folks) I simply assumed that they were going to make bigger and bigger die containing more and more transistors. And, of course, they will be doing this, but there's much more to the story...
The problem is that when you first move to a new process technology there are issues with yield. Smaller devices have higher yield as illustrated in the graphic below. This explains why FPGA vendors typically come out with their mid-range devices first, because the larger FPGAs only become viable much later in the life-cycle once the process has been fine-tuned.
So how can this issue be addressed? Well, the folks at Xilinx have come up with something rather clever, which is to place multiple smaller die in the same package. Take a look at the graphic below. The four gold rectangles in the middle represent FPGA die. The large green square represents the main chip package. And we will return to consider the light blue square surrounding the four FPGA die in just a moment.
Now, having multiple chips in one package has been done many times before. Back in the 1990s we used to call them Multi-Chip Modules (MCMs)
. More recently, we started to use the term System-in-Package (SiP)
. We might think of the new Xilinx solution as SiP TNG
(the next generation).
In conventional SiPs the die are attached directly to the package substrate. In this case, compared to the tracks on the die, the tracks on the package substrate are relatively large, slow, and driving signals onto them consumes a lot of power. What Xilinx are doing is to use a special layer of silicon known as a "silicon interposer" combined with Through-Silicon Vias (TSVs) as illustrated below:
This technology may be referred to as "Stacked Silicon Interconnect"
by some and "2.5D integrated circuits"
by others. Depending on who is doing what to whom, the silicon interposer may be purely passive (that is, contain only tracks) or it may be active (it may also include devices like transistors and logic gates ... all the way up to complex macros and cores).
In this first Xilinx incarnation, the four FPGA die are implemented at the 28 nm technology node, while the passive
silicon interposer is implemented at the 65 nm technology node. Implementing the large silicon interposer at this higher node reduces costs and increases yield without significantly degrading performance.
One way to think about this is that the silicon interposer essentially adds four additional tracking layers that can be used to connect the FPGAs to each other. And how many connections are we talking about here? Well, I bet you'll be surprised when I tell you that there are more than 10,000 connections between each pair of adjacent die!
On top of this, Through-Silicon Vias (TSVs) are used to pass signals through the silicon interposer to C4 bumps on the bottom of the interposer. These bumps are then used to connect the interposer to the package substrate.
Compared with having to use standard I/O connections to integrate two FPGAs together on a circuit board, this stacked silicon interconnect technology provides over 100X the die-to-die connectivity bandwidth-per-watt, at one-fifth the latency, without consuming any of the FPGAs' high-speed serial or parallel I/O resources.
Furthermore, by having the die sit adjacent to each other, Xilinx can avoid the thermal flux and design tool flow issues that would be introduced had a purely vertical die-stacking approach been adopted.
With regard to the design tools, the tracks on the silicon interposer are – to a large extent – seen as simply being long lines. The folks at Xilinx say that designers can simply "Press the Big Red Button" for the entire design to be automatically implemented across all four FPGA die as though they were a single large die. Alternatively, if the users wish to partition the design across the four die by hand, they can obtain 8 to 10% performance improvement on top of the staggering performance that is already offered by this technology.
Proven technology and supply chain
Of course, it's easy for folks to jump up and down and wave their arms around and tell you all sorts of things that sound wonderful, but when you come to look closely they aren’t really there. There's a world of difference talking about this stuff and actually doing it. But I think it's safe to say that Xilinx have actually succeeded here, not the least that they've been working on it for a long, long time.
Personally I am amazed that they've managed to keep this secret. It seems that they've actually been working on this for the last four or five years. They created their first test vehicle at the 90 nm node in 2008; the second test vehicle at the 40 nm node in 2009; and the third test vehicle at the 28 nm node this year in 2010 (check out the picture of this latter test vehicle below – I actually held this little beauty in my sweaty hands):
Now, the silicon graveyard is littered with technologies that seemed to be a good idea at the time, but which never succeeded because their originators failed to ensure that all of the players were in place. Not this time – as illustrated in the graphic below, Xilinx have fully solved the big infrastructure supply chain problem, which has proved to be the show-stopper for other folks in the past.
The current state of play is that the Xilinx design tools are already geared up to take full advantage of this new technology starting with the ISE 13.1 Beta release, and we should be seeing the first engineering samples of this technology around the middle of 2011. I cannot wait!
As I mentioned at the beginning, the ramifications of this new technology are truly enormous! Here are just a few thoughts off the top of my head...
First, as I noted above, FPGA vendors typically come out with their mid-range devices first, because the larger FPGAs only become viable much later in the life-cycle once the process has been fine-tuned. Well, by gathering four medium capacity die (and remember that the term "medium" is relative – these are actually honking big die whichever way you look at them) into a single package as described here, the effect is as though we had immediate access to the largest members of the family.
Another way to look at this is that we are getting next-generation density in this generation's technology. And, of course, as the process becomes fine-tuned and the yield improves, the folks at Xilinx can boost the capacity even further.
Another consideration is that we don’t have to limit ourselves to four FPGA die in a package. There could be fewer (two or three) or more (six, eight...). Also, the FPGA die don’t have to be homogeneous. Although my understanding is that all four die will be identical in the initial releases, there's no reason why Xilinx may not decide to "mix-and-match" in the future – for example, combining two DSP-intensive die with two SERDES-intensive dia. Or how about replacing one or two of the die with pure memory die, or .... I tell you, the more you think about this the more exciting it becomes.
And one last point to ponder is that Xilinx currently have a push to use SRAM-based FPGAs in space applications. (Creating radiation-tolerant SRAM-based FPGA designs is something of a "hot-button" for me at the moment). Well, how about using three of the die to implement triple-modular redundancy (TMR) at the die level, and then use the third die to perform house-keeping tasks like implementing the voting circuits and constantly reading the configuration data for the other die, performing CRC checks on that data, and reloading as required. (The functions on this fourth die could themselves be implemented in TMR fashion and this die could also be monitoring and reloading its own configuration data as necessary.)
New ideas are popping into my head as I pen these words. I tell you, we certainly do live in interesting times...