For the last couple of years, QuickSilver Technology has been tempting us with promises of delectation and delight with regard to their soon-to-be-released Adaptive Computing Machine (ACM), which they herald as being a revolutionary new form of digital IC.
According to QuickSilver, the ACM's architecture is specifically designed to allow it to be adapted "on-the-fly" tens, or hundreds of thousands of times a second, while consuming very little power. That is, any part of the device - from a few small portions all the way up to the full chip - can be adapted blazingly fast, in many cases within a single clock cycle. This architecture allows dynamic software algorithms to be directly converted into dynamic hardware resources, which supposedly results in the most efficient use of hardware in terms of cost, size, performance, and power consumption. Due to the way the ACM performs its magic, it can dramatically reduce - or eliminate - the need for custom ASIC material in a typical high-performance system on chip (SoC).
Since the folks at QuickSilver commenced their quest, my interest has been two-fold. First of all, the technology they described in hushed tones seemed to be incredibly cool, even though - until now - they've provided only the vaguest of hints and the occasional crumb of information as to how they intended to achieve their objectives. Secondly, I've been pondering how one would go about creating EDA tools geared to designing devices whose internal functionality can be "transmogrified" on a clock-by-clock basis.
As we all know, there are many ways to categorize ICs. For example, we might say that ASICs are "fine-grained," because designer's can specify their functionality down to the level of gates (and switches) and wires. Correspondingly, the majority of today's FPGAs may be classed as "medium-grained," because they consist of islands of small blocks of programmable logic and memory "floating around" in a sea of programmable interconnect. (Yes, I know that today's FPGA offerings can also include processor cores, large blocks of memory, and numbers of functions like adders and multipliers, but the underlying architecture is as described above.)
And so we come to the ACM, which features an incredibly cunning "coarse-grained" architecture. Before we consider this architecture, however, it's worth spending a few moments discussing the algorithmic evaluation QuickSilver performed in order to determine the underlying requirements.
When QuickSilver first started to consider Adaptive Computing (AC) in the late 1990s, they realized that the conventional Reconfigurable Computing (RC) doctrine was considering the problem at either too macro of a level (at the level of entire applications or algorithms, resulting in inefficient use of resources) or at too micro of a level (at the level of individual gates or FPGA blocks, resulting in hideously difficult application programming). In reality, it is only necessary to consider the problem at the level of algorithmic elements.
Based on this viewpoint, they asked themselves "just how many core algorithmic elements are there?" They started by considering the elements used in word-oriented algorithms. In this case, they commenced with the compute-intensive Time Division Multiple Access (TDMA) algorithm used in digital wireless transmission. Any variants such as Sirius, XM Radio, EDGE, and so forth form a subset of this algorithmic class, so an AC architecture that could handle high-end TDMA should also be able to handle its less sophisticated cousins.
Figure 1 - Algorithm space
Having evaluated word-oriented algorithms, they took a 180 degree turn to consider their bit-orientated counterparts, such as Wideband Code Division Multiple Access (W-CDMA) - used for wideband digital radio communications of Internet, multimedia, video, and other capacity-demanding applications - and its sub-variants such as CDMA2000, IS-95A, etc. And they also looked at algorithms comprising various mixes of word-oriented and bit-oriented components, such as MPEG, voice and music compression, and so forth.
The results of these evaluations revealed that algorithms are heterogeneous in nature, which means that if you take a bunch of diverse algorithms, their constituent elements are wildly different. In turn, this indicates that the homogeneous architectures associated with traditional FPGAs (having the same lookup table replicated tens of thousands of times) is not appropriate for many algorithmic tasks. The solution is to use a heterogeneous architecture that fully addresses the heterogeneous nature of the algorithms it is required to implement.
A cool beans architecture
Based on the above evaluations, the backroom guys and gals with size-16 brains (they don't let them out very often) came up with a heterogeneous node-based architecture:
Figure 2 - A node-based architecture
They started with five types of nodes: arithmetic, bit-manipulation, finite state machine, scalar, and configurable input/output used to connect to the outside world (the later is not shown in the above figure).
Each node consists of a number of computational units - which can themselves be adapted on the fly, but that's another story - and its own local memory (approximately 75% of a node is in the form of memory). Additionally, each node includes some configuration memory. Unlike FPGAs with their serial configuration bit-stream, an ACM has a dedicated 128-bit or 256-bit bus to carry the data used to adapt the device.
It's important to understand that each node performs tasks at the level of complete algorithmic elements. For example, a single arithmetic node can be used to implement different (variable width) linear arithmetic functions such as a FIR filter, a Discrete Cosign Transform (DCT), a Fast Fourier Transform (FFT), etc. Such a node can also be used to implement (variable width) non-linear arithmetic functions such as ((1/sine A) x (1/x)) to the 13th power.
Similarly, a bit-manipulation node can be used to implement different (variable-width) bit-manipulation functions, such as a Linear Feedback Shift Register (LRSR), Walsh code generator, GOLD code generator, TCP/IP packet discriminator, etc.
A finite state machine node can, not surprisingly, be used to implement any class of FSM. Last but not least, a scalar node can be used to execute legacy code, while a configurable input/output node (not shown in the figure) can be used to implement I/O in the form of a UART or bus interfaces such as PCI, USB, or Firewire.
Once again, a key point is that any part of the device - from a few small portions all the way up to the full chip - can be adapted blazingly fast, in many cases within a single clock cycle. This allows for a radical change in the way in which algorithms are implemented. As opposed to passing data from function to function, the data can remain resident in a node while the function of the node changes on a clock-by-clock basis.
It also means that, unlike an ASIC implementation in which algorithms are effectively "frozen in silicon," the ACM's ability to be adapted tens or hundreds of thousands of times a second means that only those portions of an algorithm that are actually being executed need to be resident in the device at any one time. This provides for tremendous reductions in silicon area and power consumption.
Another key point is that the ACM's architecture is incredibly scalable. A level 1 node cluster is formed from an arithmetic, bit-manipulation, FSM, and scalar node connected via a Matrix Interconnection Network (MIN). A level 2 node cluster is formed from four level 1 clusters linked by their own MIN, while a level 3 cluster is formed from four level 2 clusters linked by their own MIN, and so forth. Different ACMs can contain from one to thousands of node clusters, as required.
The proof of the pudding
As engineers, we're all used to seeing wild and wonderful claims for emerging technologies (I could tell you some stories that would make your hair stand on end!), but QuickSilver's ACM technology seems to be real. For example, they compared their first physical test chip to best-in-class ASIC implementations for a number of well-understood problems as follows (note that, thus far, these tasks have always been implemented using ASICs, because DSP and FPGA implementations are way too slow and power hungry).
In the case of a CDMA2000 searcher with 2x sampling using 512-chip complex correlations, with captured data processed at 8x chip rate (equivalent to 16 parallel correlators running in real time), the best-in-class ASIC took 3.4 seconds, while the ACM took only 1.0 seconds (3.4 x faster). In the case of a CDMA2000 pilot search with the same parameters as above, the ASIC took 184 ms, while the ACM took only 55ms (3.3 x faster). And in the case of a W-CDMA searcher with 1x sampling using 256-chip correlations with streaming data, an ASIC took 533 us while the ACM took only 232 us (2.3 x faster). Furthermore, in addition to out-performing the three ASIC implementations, the ACM counterpart (one ACM was used to perform all three tasks) used significantly less silicon real estate and consumed only a fraction of the power!
Other things to ponder
All of the above opens up an entire raft of questions. For example, is anyone else doing anything like this? Well it's funny you should ask, because there's a company called picoChip in England that appears to be working on a somewhat related coarse-grained concept. This seems to be based on arrays of processors, although not many nitty-gritty details have leaked out as to their underlying implementation as yet.
There's also a German company called PACT which has a finer-grained architecture based on reconfigurable "clusters of arrays" (wow!) of 32-bit arithmetic processing elements and embedded memory. Meanwhile, in Japan, a company called IP Flex is jumping up and down singing the praises of their forthcoming architecture. This appears to be another finer-grained approach based on a combination of RISC plus a reconfigurable fabric, but it's sort of hard to tell from the information on their web site. At this stage, it's not totally clear to me exactly what applications PACT and IP Flex are targeting, but dynamic reconfiguration certainly plays a role.
I look forward to discovering the true capabilities (and appropriate application spaces) for these finer-grained architectures as more information becomes available. However, my gut feeling is that the coarse-grained heterogeneous architecture sported by QuickSilver's ACM could prove to be extremely interesting (I currently don't know enough about picoChip's architecture to comment, except to say that it appears to be coarse-grained homogeneous as opposed to coarse-grained heterogeneous).
Of course, the next big question is how would one go about creating applications for one of these little rapscallions? Well, QuickSilver's design flow approach is based on a C derivative language called SilverC, while my limited understanding of the picoChip flow is that it's similar to conventional HDL-based flows.
In my next column I'll introduce QuickSilver's SilverC language and ponder how this fits into EDA space as we know it. Meanwhile, I also hope to be able to chat with the picoChip folks to see exactly what they're up to, and mayhap pen a future column on them. As always, until next time, have a good one!
Clive (Max) Maxfield is president of Techbites Interactive, a marketing consultancy firm specializing in high-tech. Author of Bebop to the Boolean Boogie (An Unconventional Guide to Electronics) and co-author of EDA: Where Electronics Begins, Max was once referred to as a "semiconductor design expert" by someone famous who wasn't prompted, coerced, or remunerated in any way.