Low power is not an attribute commonly associated with field-programmable gate array (FPGA) design. Yet the same design aspects that allow an FPGA to outperform a microprocessor also produce substantially reduced power-consumption numbers, even without resorting to tricks such as gated clocks. The secret to the magic is in the efficiencies gained by the low-level parallelism available in hardware designed for a specific task.
The vast majority of signal processor designs over the last quarter century have been approached using microprocessors specifically designed for signal-processing applications, usually referred to as digital signal processors, or DSPs. Since the term can also refer to machines not using a microprocessor, "digital-signal-processing microprocessors" or "DSP microprocessors" may be a better name for avoiding confusion.
But whatever they're called, their popularity stems from the extreme flexibility they offer and from the low per-unit costs afforded by large-scale production. Generally speaking, they consist of one or more arithmetic-logic units (ALUs) capable of basic arithmetic and logic functions including multiply-accumulates (MACs), an instruction decoder and some data path logic to move data between the ALU and memory. Instructions are sequentially fetched from a programmable instruction store and decoded to control the function of the ALU and the flow of data between the ALU and the data store. Complex algorithms are implemented by stringing together a sequence of these instructions to manipulate the data in the desired way.
This general-purpose structure allows the programmed instruction sequence, rather than the hardware architecture, to define the function. The result is a generic processor that can be mass-produced to fit a large variety of applications.
While these DSP microprocessors have special features to reduce control overhead and improve throughput for heavily arithmetic applications, they still process the data using a serial instruction stream. The data throughput is limited by the complexity of the algorithm and the instruction cycle time of the processor. Signal-processing applications typically require tens or even hundreds of instructions per data sample, so even with DSP microprocessors running at 200 MHz or more, the maximum data sample rate per processor is usually less than 10 megasamples/second. For example, each coefficient in a digital filter's equation requires a MAC, which in most cases constitutes one instruction cycle (some DSP processors, such as the TMS320C6x, are capable of two MACs per cycle). A modest filter with 20 unique coefficients needs 20 instruction cycles/sample to execute, not including any overhead. The only ways to improve performance for a given algorithm on a DSP microprocessor are to increase the clock rate or partition the algorithm over multiple DSP microprocessors.
Until fairly recently, the only alternatives to DSP microprocessors were offered by custom hardware in the form of an application-specific integrated circuit (ASIC), a function-specific board design constructed from discrete logic or a combination of the two. In these cases, the resulting fixed-function hardware is expensive to develop and is not easily changed. The high cost of custom hardware had kept these solutions out of the hands of all but the designers of high-volume and cost-doesn't-matter applications. FPGAs have changed this by offering a low-cost path to custom hardware, making efficient function-specific processors available to a much wider audience. The reconfigurability of many FPGAs also brings flexibility approaching that enjoyed by microprocessors to custom hardware.
Like microprocessors, FPGAs are mass-produced generic parts customized to a particular application by a program loaded by the end user. Rather than a set of sequential instructions, the FPGA program is a long string of static bits that control the function of hundreds or thousands of small logic blocks and those blocks' interconnections. The logic blocks are typically four-input binary lookup tables, usually with a capability for adding a flip-flop to the table output. The FPGA program sets the table values for each table, which in turn determines the Boolean function of that logic block. This structure provides a rich fabric of uncommitted logic resources usable as the building blocks for a custom hardware design.
Basically, a DSP microprocessor solution is constrained by the processor structure to iterating the partial results through the same hardware for each basic operation in an algorithm. Even though this uses the hardware efficiently and without sacrificing generality, it is expensive in terms of processing time. A custom hardware solution, whether in an FPGA or in some other form, presents the opportunity to unroll the algorithm so that each part of the process is done by dedicated hardware arranged in a pipeline, much like an assembly line. This yields three important benefits:
- First, since each part of the pipeline performs only one task, then passes the result to the next function in the chain, there is no waiting for hardware availability. A new sample can be processed as soon as the previous partial result is passed to the next pipeline stage, generally on every clock cycle regardless of the complexity of the algorithm. This clearly has a huge performance advantage over a microprocessor, even at greatly reduced clock rates.
- Second, since each stage in the pipeline is dedicated to a particular task, it can be optimized specifically for that task. There is no need to include the extra logic that would be required to control functionality in a general-purpose design. Likewise, the custom logic is not bound to a particular data word width-the designer is free to select the exact precision required at each stage in the process. The ability to define the function at the gate level rather than with the higher-level primitives represented by microprocessor instructions often reduces the logic even further.
This hardware customization leads to greatly simplified logic, which in turn reduces the propagation delays and lowers the power consumption.
- Third, the output of each pipeline stage usually connects only to the input of the next stage. This not only eliminates shared data buses, control and storage, but also keeps the length and fanout of the connections between stages small. The highly localized wiring reduces propagation delays, wire capacitance and loading so the power and cycle times are both reduced.
Further logic reductions are sometimes available by modifying the algorithm to obtain a more hardware-friendly implementation. This may be as simple as rearranging the order of operations, as is the case with distributed arithmetic. Distributed arithmetic is a bit-level rearrangement of a sum of products that hides the multiplications by constant coefficients, leaving small lookup tables and a tree of adders.
Other times, a totally different approach to the problem can slash hardware. For example, the natural inclination for computing a vector magnitude is to use the root of the sum of squares, but that is prohibitively expensive in hardware. When 10 percent accuracy is good enough, the sum of the larger and half the smaller vector component magnitudes will yield a satisfactory result.
Where more accuracy is needed, an algorithm that rotates the vector to a cardinal axis where the magnitude can be read directly uses a fraction of the logic of the root-sum-of-squares algorithm. The rotations are performed using a shift-add rotation algorithm known as Cordic. The Cordic solution has the added benefit of producing a measurement of the phase angle of the vector.
Less power per sample
The hardware efficiencies of a custom hardware design, whether in an FPGA or some other medium, obviously provide an opportunity for a significant performance advantage over DSP microprocessors. The power saved by the custom hardware may not be as apparent.
Consider what happens if we obtain our custom hardware pipeline simply by duplicating the processor's ALU for each instruction and arranging the copies in a pipeline. Since the power dissipation in modern CMOS logic is proportional to the number of gates switching per unit time, the dissipation per sample of the unrolled pipeline is about the same as that of the microprocessor. Although there are several times the number of gates, the number of logic transitions occurring per sample processed is unchanged.
That ignores the fact that the ALU controls, instruction logic and data path controls at each stage in the unrolled processor are static, since each stage performs a fixed function. These now-static controls represent a significant portion of the logic in the microprocessor. Eliminating the switching currents due to the controls, intermediate storage and shared data paths means the unrolled pipeline dissipates considerably less power per sample than the microprocessor.