Design Article
The state-of-play in multi-processor and reconfigurable computing
Clive Maxfield
2/21/2006 4:01 PM EST
For the majority of the three and a half decades since the 4004's introduction, increases in computational performance and throughput have been largely achieved by means of relatively obvious techniques as follows:
- Increasing the width of the data bus from 4 to 8 to 16 to 32 to the current 64 bits used in high-end processors.
- Adding (and then increasing the size of) local high-speed cache memory.
- Shrinking the size – and increasing the number – of transistors; today's high-end processors can contain hundreds of millions of transistors.
- Increasing the sophistication of processor architectures, including pipelining and adding specialized execution blocks, such as dedicated floating-point units.
- Increasing the sophistication of such things as branch prediction and speculative execution.
- Increasing the frequency of the system clock; today's high-end processors have core clock frequencies of 3 GHz and higher.
The problem is that these approaches can only go so far, with the result that traditional techniques for increasing computational performance and throughput are starting to run out of steam. In this article, we take a "50,000 foot" view of the hardware portion of the computing universe and introduce a wide variety of existing and emerging solutions, including the use of multiple processors and the concept of configurable (and reconfigurable) processors.
The computing universe
For the purposes of this article, we will consider the term "computing" in its most general sense; that is, computing means the act of performing computations. There are many different types of computational tasks we might wish to perform, including – but not limited to – general-purpose office-automation applications (word-processing, spreadsheet manipulation, etc.); extremely large database manipulations such as performing a Google search; one-dimensional digital-signal processing (DSP) applications such as an audio codec; and two-dimensional DSP applications such as edge-detection in robotic vision systems.
In many cases, these different computational tasks are best addressed by a specific processing solution. For example, an FPGA may perform certain DSP tasks very efficiently, but one typically wouldn't consider using one of these devices as the main processing element in a desktop computer. Similarly, Intel and AMD processors are applicable to a wide variety of computing applications, but you wouldn't expect to find one powering a cell phone (apart from anything else, the battery life of the phone would be measured in seconds).
Fundamentally, there are three main approaches when it comes to performing computations (Fig 1). At one end of the spectrum we have a single, humongously large processor; at the other end of the spectrum we have a massively-parallel conglomeration of extremely fine-grained functions (which some may call "a great big pile of logic gates"); and in the middle we have a gray area involving multiple medium- and coarse-grained processing elements. (Note that this article focuses on the microprocessor and DSP arenas; mainframe computers and supercomputers are outside the scope of these discussions.)

1. The computing universe (Click Here for a larger, more detailed figure).
Single cores
The classical processing solution for many applications is to use a single, humongously large "off-the-shelf" processor, such as a general-purpose CPU chip from Intel (www.intel.com) or AMD (www.amd.com) or a special-purpose DSP chip from Texas Instruments (www.ti.com). Similarly, in the case of embedded applications, one might choose to use a single general-purpose processor core from ARM or ARC or a DSP core from TI.
At some stage, a single processor simply cannot meet the needs of a target application, in which case it becomes necessary to evaluate alternative solutions as follows.
Co-processors and accelerators
One technique is to augment an existing processor with one or more dedicated co-processors and/or hardware accelerators. For example, Critical Blue (www.criticalblue.com) has a tool called Cascade that accepts as input compiled applications in the form of executable ARM machine code. By means of a simple interface, the user selects which functions are to be accelerated, and Cascade then generates the RTL for a dedicated co-processor (and the microcode to run on that co-processor) to implement the selected functions.
An alternative approach is that taken by Poseidon Systems (www.poseidon-systems.com), whose Triton tool suite allows users to analyze C source code, to identify areas of the code to be accelerated, and to generate co-processors or hardware accelerators for use with ARM, PowerPC, Nios, or MicroBlaze hard and soft processor cores implemented in SoCs and/or FPGAs.
Multi-cores (homogeneous)
Perhaps the most famous early example of using multiple processors was the INMOS transputer chip, which surfaced in the mid 1980s (the all lowercase “transputer” was the official written form). As a point of interest, the native programming language for the transputer was occam (again, the all lowercase “occam” was the official written form), which was named in honor of the 14th century English philosopher and Franciscan friar William of Ockham, also spelled Occam (1286–1348 give or take a few years).
The idea was that users could hook as many transputer chips together on a circuit board as was necessary to satisfy the computational requirements of the target application. Many believed that the transputer was going to be the next great leap in computing, but creating programs that ran efficiently on this parallel architecture was non-trivial, and the transputer eventually faded away.
Although most non-engineers don't realize it, it is actually very common for systems to use multiple processors. Consider a home computer, for example; in addition to the main CPU, the keyboard will also have its own processor; each hard disk and optical (CD/DVD) drive will typically contain two or more processors, and so forth. However, the above examples are characterized by the fact that these multiple processors all have very focused well-partitioned tasks that can be largely performed in isolation. It is much more complicated to have tightly-coupled homogeneous processors, such as the dual-core chips that are now available from AMD and Intel (the term "homogeneous" means that these processing elements are of the same kind). Another term that is applicable to this type of configuration is symmetric multiprocessing (SMP), which means that the view of the rest of the system – memory, input/output, operating system, etc. – is exactly the same (i.e. "symmetrical") for each processor.
When moving from a single core to a dual-core configuration, the system becomes noticeably more responsive, and users don't experience those annoying "hang-ups" and "stalls" that are the hallmark of a single-processor environment. And two cores are only the start; for example, Intel is already talking about a four-core processor called "Clovertown," which is expected to appear on the market in early 2007.
Meanwhile, Sun Microsystems (www.sun.com) is already fielding an eight-core processor called the Ultrasparc TI. Formally known as Niagara, this extreme-performance device is well-suited to highly-threaded commercial environments, such as thread-aware web servers, applications servers, and database servers. Of particular interest is that fact that Sun is open sourcing this chip; in fact, the RTL recently became available when the www.opensparc.net website went live on January 24th 2006.
Before we move on, we should also make mention of the Multicore Association (www.multicore-association.org), which is a new industry group focused on companies involved with multicore processor, software, and system implementations.
Multi-cores (heterogeneous)
As opposed to using multiple identical cores, it may be preferable to use a mixture of dissimilar cores. For example, even the most rudimentary cell phone will typically contain at least one ARM core to manage the human-machine interface coupled with at least one DSP core to perform the baseband signal processing. Such solutions are referred to as being "heterogeneous," meaning "consisting of dissimilar elements or parts."
One example of this type of scenario is the Cell processor from IBM (www.ibm.com), which consists of a general-purpose CPU core tightly coupled with eight DSP cores. Another example is a high-end cell phone, which may include two or more CPU cores and two or more DSP cores combined with large numbers of hardware accelerator blocks and peripheral functions.
Things are further complicated by the fact that the processing cores and other functional units may have their own individual memories along with shared memory structures; and everything may be connected together using multi-level buses and cross-point switches. One term which is commonly associated with this type of environment is asymmetric multiprocessing (AMP or ASMP), in which computational tasks (or threads) are strictly divided by type between processors.



