Editor's Note: Way back in the mists of time (in the early days of 2006), I penned an article for on EE Times discussing the various computing options available to designers, from single processors and multiple processors, via co-processors and hardware accelerators, through arrays of "things", all the way down to "great big piles of gates." Since then, I've been maintaining this little rapscallion as a "living breathing document" on the "More Cool Stuff" page of my DIY Calculator website. The point is that I've added so much new "stuff" that it seemed like a good idea to re-present the little scamp here on Programmable Logic DesignLine. Thus, for your delectation and delight, an abstracted version of this ever-evolving paper is presented below.
Before we start, we should note that the following discussions relate to the illustration shown below (of which I am inordinately proud, because capturing the diverse computing options graphically proved to be a non-trivial task).
The computing universe
(Click this image to view a larger, more detailed version)
Defining Some Terms
OK, let's kick things off by defining a few concepts, because this will make things easier as we wend our way through the rest of this paper. The term central processing unit (CPU) refers to the "brain" of a general-purpose digital computer – this is where all of the decision making and number crunching operations are performed. By comparison, a digital signal processor (DSP) is a special-purpose CPU that has been created to process certain forms of digital data more efficiently than can be achieved with a general-purpose CPU.
Both CPUs and DSPs may be referred to as "processors" for short. The term microprocessor refers to a processor that is implemented on a single integrated circuit (often called a "silicon chip," or "chip") or a small number of chips. The term microcontroller refers to the combination of a general-purpose processor along with all of the memory, peripherals, and input/output (I/O) interfaces required to control a target electronic system (all of these functions are implemented on the same chip to cut down on size, cost, and power consumption).
The heart of a processor is its arithmetic-logic unit (ALU) – this is where arithmetic and logical operations are actually performed on the data. Also, in the case of DSP algorithms, it is often required to perform multiply-accumulate (MAC) operations in which two values are multiplied together and the result is added to an accumulator (that is, a register in which intermediate results are stored). Thus, DSP chips often contain special hardware MAC units.
Last but not least, the term core is understood to refer to a microprocessor (CPU or DSP) or microcontroller that is implemented as a function on a larger device such as a field-programmable gate array (FPGA) or a System-on-Chip (SoC). Depending on the context, the term processor may be used to refer to a chip or a core. [The underlying concepts behind devices such as FPGAs and SoCs – and also ASICs and ASSPs as mentioned later in this paper – are explained in excruciatingly interesting detail in my book Bebop to the Boolean Boogie (An Unconventional Guide to Computers), ISBN: 0750675438.]
The first commercial microprocessor was the Intel 4004, which was introduced in 1971. This device had a 4-bit CPU with a 4-bit data bus and a 12-bit address bus (the data and address buses were multiplexed through the same set of four pins because the package was pin-limited). Comprising only 2,300 transistors and with a system clock of only 108 KHz, the 4004 could execute only 60,000 operations per second.
For the majority of the three and a half decades since the 4004's introduction, increases in computational performance and throughput have been largely achieved by means of relatively obvious techniques as follows:
- Increasing the width of the data bus from 4 to 8 to 16 to 32 to the current 64 bits used in high-end processors.
- Adding (and then increasing the size of) local high-speed "cache" memory.
- Shrinking the size – and increasing the number – of transistors; today's high-end processors can contain hundreds
of millions of transistors.
- Increasing the sophistication of processor architectures, including pipelining and adding specialized execution blocks, such as dedicated floating-point units.
- Increasing the sophistication of such things as branch prediction and speculative execution.
- Increasing the frequency of the system clock; today's high-end processors have core clock frequencies of 3 GHz (that's three billion clock cycles a second) and higher.
The problem is that these approaches can only go so far, with the result that traditional techniques for increasing computational performance and throughput are starting to run out of steam. When a conventional processor cannot meet the needs of a target application, it becomes necessary to evaluate alternative solutions such as multiple processors (in the form of chips or cores) and/or configurable processors (in the form of chips or cores).
The Computing Universe
For the purposes of this paper, we will consider the term computing in its most general sense; that is, we will understand "computing" to refer to the act of performing computations. There are many different types of computational tasks we might wish to perform, including – but not limited to – general-purpose office-automation applications (word-processing, spreadsheet manipulation, etc.); extremely large database manipulations such as performing a Google search; one-dimensional digital-signal processing (DSP) applications such as an audio codec; and two-dimensional DSP applications such as edge-detection in robotic vision systems.
In many cases, these different computational tasks are best addressed by a specific processing solution. For example, an FPGA may be configured (programmed) to perform certain DSP tasks very efficiently, but one typically wouldn't consider using one of these devices as the main processing element in a desktop computer. Similarly, off-the-shelf Intel and AMD processor chips are applicable to a wide variety of computing applications, but you wouldn't expect to find one powering a cell phone (apart from anything else, the battery life of the phone would be measured in seconds).
Fundamentally, there are three main approaches when it comes to performing computations. At one end of the spectrum we have a single, humongously large processor; at the other end of the spectrum we have a massively-parallel conglomeration of extremely fine-grained functions (which some may call "a great big pile of logic gates"); and in the middle we have a gray area involving multiple medium- and coarse-grained processing elements. (Note that this paper focuses on the microprocessor/CPU/DSP arenas; mainframe computers and supercomputers are outside the scope of these discussions.)
The classical processing solution for many applications is to use a single, humongously large "off-the-shelf" processor, such as a general-purpose CPU chip from Intel (www.intel.com) or AMD (www.amd.com) or a special-purpose DSP chip from Texas Instruments (www.ti.com). Similarly, in the case of embedded applications, one might choose to use a single general-purpose processor core from ARM or ARC or a DSP core from TI.
At some stage, a single processor simply cannot meet the needs of a target application, in which case it becomes necessary to evaluate alternative solutions as discussed in the following topics.
Coprocessors and Accelerators
If you are in the process of creating a new chip from the ground up, one technique is to augment a pre-defined processor core with one or more dedicated coprocessors and/or hardware accelerators. For example, CriticalBlue (www.criticalblue.com) has a tool called Cascade that accepts as input compiled applications (which may be referred to as binaries) in the form of executable ARM machine code. By means of a simple interface, the user selects which functions are to be accelerated, and Cascade then generates the register transfer level (RTL) description for a dedicated coprocessor (and the microcode to run on that coprocessor) to implement the selected functions.
A somewhat similar approach is that taken by Binachip www.binachip.com), whose tools also take compiled (binary) programs. However, these tools first read the binary code into a neutral format, then they allow you to select which functions will be implemented in hardware and which functions are to be realized in software. Finally, they re-generate the binary code for the software portions of the system and generate register transfer level (RTL) representations for the accelerators used to implement the hardware portions of the system.
An alternative technique is that adopted by Poseidon Systems (www.poseidon-systems.com), whose Triton tool suite allows users to analyze ANSI standard C source code, to identify areas of the code to be accelerated, and to generate accelerators/coprocessors that can be used in conjunction with ARM, PowerPC, Nios, or MicroBlaze hard and soft processor cores implemented in SoCs and/or FPGAs.
And then there are the tools from Synfora (www.synfora.com) can also analyze ANSI standard C source code and generate register transfer level (RTL) representations for corresponding hardware accelerators.
In reality, there are quite a few other players in this arena; these include (but are not limited to) Altera (www.altera.com) with its C2H (ANSI C to hardware accelerator) technology, Celoxica (www.celoxica.com) with its Agility Compiler (SystemC to hardware accelerator) and DK Suite (Handel-C to hardware accelerator) approaches, Forte Design Systems (www.forteds.com) with its Cynthesizer (SystemC/C++ to hardware accelerator) suite, and Mentor Graphics (www.mentor.com) with its Catapult BL and SL (C to hardware accelerator) technology.