Large arrays of "things"
One way to think of the hardware used to perform computations is in terms of its granularity. The finest level of granularity is provided by an ASIC or ASSP, in which algorithms can be hand-crafted in silicon at the level of individual logic gates. Next, we have FPGAs with their four-input lookup tables (LUTs), the SRAM-based versions of which have the advantage that they can be reconfigured as required. [Structured ASICs may be considered to occupy a space somewhere between ASICs and FPGAs, especially in the case of eASIC (www.easic.com
) devices, which combine custom routing with FPGA-like SRAM-based LUTs.]
Note that we might decide to include one or more hard processor cores on an ASIC (in which case it becomes an SoC); similarly, we might decide to include one or more hard and/or soft processor cores on an FPGA (which may also be viewed as an SoC by some folks); all of these cases would then be considered to be a hybrid solution involving a mixture of traditional processor core(s) and algorithms implemented in gates/LUTs/etc.
In recent years, a number of companies have started to offer more exotic architectures, each of which is applicable to a focused set of computational applications. If we consider these offerings in terms of granularity, then the first step above traditional FPGAs would be an architecture such as that provided by Elixent (www.elixent.com). This reconfigurable algorithm processing (RAP) architecture – which is targeted toward the efficient implementation of arithmetic/DSP functions – is based on an array of 4-bit arithmetic-logic units (ALUs) in a "sea" of programmable interconnect. These ALUs can be linked using fast carry chains so as to implement wider functions. In addition to forming part of a datapath, the output of one ALU may be used to select the instruction of another ALU. The programming model for these devices is to take the same RTL used to create an ASIC or FPGA, and to use an appropriate synthesis engine to generate a corresponding configuration file.
Next, we have the field programmable object array (FPOA) architecture from MathStar
(www.mathstar.com). An example FPOA device may contain around 400 silicon "objects" in the form of 16-bit ALUs (each with its own instruction cache and scratchpad memory), register files, and multiply accumulators (MACs) – along with internal RAM banks and external high-speed memory interfaces – all of which can communicate with each other through programmable interconnect fabric. Each object can be programmed individually and acts autonomously. All of the objects and the interconnect run at 1 GHz. In addition to general-purpose I/O (GPIO) pins, the FPOA boasts high-speed I/O that can transmit and receive 2 × 32 GB/s. The main programming model for these devices is to use a graphical interface that generates SystemC, and the target application area is for compute-intensive DSP tasks such as edge detection and pattern recognition for robotic vision systems with high-frame-rates and high resolutions.
An good example of the next higher level of granularity is provided by picoChip (www.picochip.com), whose picoArray features several hundred 16-bit CPU and DSP cores connected by a sea of programmable interconnect that can move 5 terabits of data per-second around the device. Each core, has its own local memory (ranging from 1K to 64K depending on the core type). The programming model for a picoArray is an interesting mixture of styles. A VHDL block-level netlist is used to define the connectivity between each of the CPU and DSP cores (each block in the netlist maps onto a specific type of core); meanwhile, the actual function of each block is defined in C and/or assembly code.
Another good example of this level of granularity is provided by the multiprocessor DSP (MDSP) architecture from Cradle Technologies (www.cradle.com). Current incarnations of the MDSP offer up to 8 CPU cores and 16 DSP cores. Each of these 32-bit cores has its own local instruction and data memory. The latest programming model for these devices is to create a C program that is divided into multiple threads, and to tag each thread as being either a control thread (to be executed on a CPU) or a signal processing thread (to be executed on a DSP). A run-time dynamic scheduler is then used to assign threads to available resources on the device.
Configurable and reconfigurable cores
Perhaps the best-known configurable core to date is that fielded by Tensilica (www.tensilica.com). In this case, you start with the concept of a core 32-bit post-RISC processing engine called Xtensa that comprises around 25K gates. Next, Tebnsilica's tools analyze your C/C++ application and evaluate millions of possible processor extensions based on techniques like single-instruction-multiple-data (SIMD) and vector operations, operator fusion, and parallel execution. Once you select the configuration that's best for your particular application, a processor generator outputs the RTL for your custom processor along with a custom compiler, assembler, and source-level debugger. A typical customer may end up with 5 or 6 heterogeneous Tensilica cores on their SoC, and some devices (for networking applications) have several hundred such cores.
The term "reconfigurable computing" means different things to different people. One incarnation of this is static reconfiguration, in which a programmable device such as an FPGA is first configured to perform a certain task, and is later reconfigured to perform a different task. By comparison, dynamic reconfiguration refers to configuring different portions of a device "on-the-fly" while other portions of the device continue to perform their tasks.
One interesting scenario involves an FPGA containing a number of soft microprocessor and DSP cores, each executing its own local microcode. A special controller block can be used to supply the various processor cores with new microcode as required (this new microcode could be stored in an external memory).
Perhaps the best example of reconfigurable computing to date is provided by Stretch Inc. (www.stretchinc.com), which provides a family of off-the-shelf software-configurable processors. Each of these chips contains two main units: Tensilica's Xtensa core coupled with Stretch's reconfigurable instruction set extension fabric (ISEF), which contains wide register files and lots of computational units (multipliers, adders, and so forth) in a sea of programmable interconnect. Stretch's tools analyze your C/C++ application and generate a corresponding configuration file to program the ISEF to perform specific tasks. The point here is that the ISEF can be reconfigured thousands of times a second so as to tailor it to better serve different portions of the algorithm.
This article has really only touched the surface of the state of play in modern computing. In addition to yet more hardware solutions, it is also necessary to consider such things as operating system issues along with the problems of programming, debugging, and verifying applications.
The point is that there are now a lot of options available to the designers of today's state-of-the-art systems. As usual, system architects have to perform the traditional tradeoff between power, performance, and cost. Ultimately we have to ask the questions: How much performance do you want? How much do you need? And how much can you afford?
Clive "Max" Maxfield is president of TechBites Interactive, a marketing consultancy firm specializing in high technology. Max is the author and co-author of a number of books, including Bebop to the Boolean Boogie (An Unconventional Guide to Electronics) and How Computers Do Math (ISBN: 0471732788) featuring the pedagogical and phantasmagorical virtual DIY Calculator.
Widely regarded as being an expert in all aspects of computing and electronics (at least by his mother), Max was once referred to as "an industry notable" and a "semiconductor design expert" by someone famous who wasn't prompted, coerced, or remunerated in any way. Max can be reached at firstname.lastname@example.org.