Introduction
Digital signal processing (DSP) occurs in communications, audio, and multimedia devices, imaging and medical equipment, smart antennas, automotive electronics, MP3 players, radar and sonar, and barcode readers, to name but a few. According to market research firm Forward Concepts' November 1, 2005 DSP/Wireless Market Bulletin, the estimated market for programmable DSP chips should exceed $8B in 2005.
DSP platforms
DSP algorithms can be implemented in many different ways. The most popular being as follows:
- General-purpose microprocessors (e.g. Pentium) and general-purpose microcontrollers (e.g. 8051) can run DSP algorithms of arbitrary complexity.
- Programmable DSP chips or DSP microprocessors (uP). The internal structure of DSP uPs is optimized to run many DSP algorithms much faster and more efficiently. For example, the DSP chips have one or more built-in fast hardware multiplier-accumulator (MAC) to perform MAC operations DSP algorithms make heavy use of.
- FPGA. An FPGA can be configured to run a particular DSP algorithm, thereby dedicating FPGA resources to particular DSP tasks. Also, the FPGA can run hundreds of MAC units in parallel. As a result, the performance may far exceed that of DSP uPs.
- ASIC. An ASIC offers even higher levels of "dedication" than the FPGA. ASICs are the champion when comparing performance per square millimeter of silicon. It is important to note, however, that the gap between the ASIC and the FPGA tends to narrow as the FPGA grows in size (e.g. larger than 1 million gates).
Parallel computing vs. Turing machines
There is a big performance difference between the various DSP platforms based on how the platform performs computations. Both general-purpose and specialized DSP processors belong to the class of Turing machines, which perform instructions one at a time. For example, in order to add two numbers A and B, the Turing machine would need to do something similar to the following:
- Fetch instruction 1 and decode it.
- Execute the instruction 1, i.e. fetch data A and put it in the accumulator.
- Fetch instruction 2 and decode it.
- Execute the instruction 2, i.e. fetch data B and add it to the accumulator.
- Fetch instruction 3 and decode it.
- Execute the instruction 3, i.e. put the accumulated result where it needs to be.
The FPGA and ASIC are 'deprived' of this limitation. In fact, there are few flexibility and performance limitations a modern FPGA puts on a system developer. The FPGA can run parallel processing (i.e. execute multiple instructions at a time); implement Turing machine(s) as needed, including instantiation of soft microprocessor cores; and carry virtually any practical combination of parallel processors and Turing machines on the same silicon. The parallel processing dramatically improves performance of common DSP functions, such as FIR filter, FFT and correlator.
A 4-tap FIR filter structure is shown in Fig 1. While a processor needs to run the computations one by one, the FPGA instantiates all the necessary components – four multipliers, four adders, and three delay elements – and enables them to work in parallel. As a result, such a structure can process a new input sample every clock period as compared with the 8 clock cycles per data sample required by the microprocessor.
Not every DSP algorithm can efficiently utilize parallel processing. An IIR filter is an example of such a category. On the other hand, there are several techniques, such as CORDIC or error-correction algorithms, where the FPGA technology, despite the limited application of parallel processing, has been proven to be more efficient than a DSP processor.
DSP and general-purpose processors are trying to catch up with parallel computation machines. In some cases, modern DSP processors can perform a few instructions at a time, such as certain HW co-processors or accelerators (e.g. Viterbi decoder, FFT engine). But the FPGA also does not sit quietly as it too can carry "soft" processor cores, thereby enjoying all the benefits Turing machines provide.
Practical Considerations
Why would anyone buy a processor to run a DSP application if the FPGA is so much better? In fact, there are a few reasons for this as follows:
First, processors have a longer history than FPGAs. Further, the necessary support infrastructure has been developed over that time, including compilers, assemblers, automatic converters from high-level language to assembly code, and extensive libraries. At some point in time, almost any practical DSP application was implemented on the microprocessor; so, theoretically, one could grab a ready off-the-shelf implementation. Additionally, many new DSP algorithms emerge as software routines so they are pretty much ready for the processor platform.
Second, dealing with the FPGA requires a different set of skills than those common in the DSP community. When deciding which platform to choose, today's rule of thumb is: if a general-purpose processor can keep up with the spec, stay there. If not, pick the DSP processor when it meets the necessary MAC rate. Finally, if one needs outstanding performance, the FPGA is the best choice. Again, the challenge is that the number of SW development experts far exceeds the number of DSP processor programmers, which in turn is larger than a number of the FPGA designers capable of implementing the DSP algorithms. Finally, even with appropriate FPGA expertise available, creating a good DSP design that capitalizes on the FPGA's major benefits is a time consuming and elaborate process.
What about the ASIC? The FPGAs offer many of the same advantages as ASICs, such as reduction in size, weight and power dissipation; higher throughput; better design security against unauthorized copies; reduced device and inventory cost; and reduced board test cost.
ASICs lose to FPGAs when it comes to reduction in development time by a factor of three to four; ability to modify the configuration, including remote in-circuit programmability; and lower NRE costs that a customer pays prior to obtaining an actual ASIC device.
Development Flows
Typical DSP development flow
Atypical DSP development flow chart is shown in Fig 2. Usually, an algorithm makes its first appearance as a floating-point software model. The algorithm gets tested, evaluated and verified using an appropriate test bench. It is worth noting the test bench development often takes the same or even more effort as the algorithm design. Floating-point representation lets algorithm creators take advantage of high-precision computations while not caring about dynamic range. At this stage, the algorithm is implementation independent.
As soon as the algorithm is found to be useful, it needs to be converted into the fixed-point representation. Implementing floating-point calculations directly, while possible in principle, takes a huge amount of silicon resources, and/or the computation rate is painfully slow. This stage is implementation dependent. In many cases, the conversion is not a straightforward process. On the contrary, it may take several iterations and experiments to obtain acceptable results. Whoever does the conversion, the person or group needs to possess good knowledge of the algorithm and reasonable familiarity with the implementation platform.
After the fixed-point algorithm gets verified, the actual implementation may start. Until that point the DSP architects use one of the common programming languages like C/C++, or an environment supporting a higher level of abstraction, like the MATLAB-Simulink package from Mathworks. In order to implement the algorithm on the FPGA, it needs to be handed over to the FPGA designers.