The implementation illustrated in Figure 3
is known as a multiply-and-accumulate or MAC-type implementation. This is almost certainly the way a filter would be implemented in a classical DSP processor. The maximum performance of a 31-tap FIR filter implemented in this fashion in a typical DSP processor with a core clock rate of 1.2 GHz is about 9.68 MHz, or a maximum incoming data rate of 9.68 Megasamples per second.
Figure 3 – MAC implementation in a classical DSP
An FPGA, on the other hand, offers many different implementation and optimization options. If a very resource-efficient implementation is desired, the MAC engine technique may prove ideal. Using a 31-tap filter as an example illustrates the impact of filter specifications on required logic resources. A block diagram of the implementation is shown in Figure 4
Figure 4 – MAC engine FIR filter in an FPGA
Memory is required for data and coefficient storage. This may be a mixture of RAM and ROM internal to the FPGA. RAM is used for the data samples and is implemented using a cyclic RAM buffer. The number of words is equal to the number of filter taps and the bit width is set by sample size. ROM is required for the coefficients. In the worst case, the number of words will be the same as the number of filter taps, but if symmetry exists, this may be reduced. The bit width must be large enough to support the largest coefficient. A full multiplier is required since both the data sample and coefficient data change on every cycle. The accumulator adds the results as they are produced. The capture register is needed because the accumulator output changes on every clock cycle as the filter is sampling data. Once a full set of N samples has been accumulated, the output register captures the final result.
When used in MAC mode, the DSP48 is a perfect fit. The input registers, output registers and adder unit are present in the DSP48 slice. The resources required for this 31-tap MAC engine implementation are one DSP48, one 18-kbit block RAM and nine logic slices. There are a few additional slices required for sample and coefficient address generation and control. If a 600-MHz clock were available in the FPGA, this filter could run at an input sample rate of 19.35 MHz, or 19.35 Msamples/s in a -3 speed grade Xilinx® 7 series device.
If the system specification required a higher-performance FIR filter, a parallel structure could be implemented. Figure 5
shows a block diagram of a Direct Form Type I implementation.
Figure 5 – Direct Form I FIR filter in an FPGA
The Direct Form I filter structure provides the highest-performance implementation within an FPGA. This structure, which is also commonly referred to as a systolic FIR filter, uses pipelining and adder chains to exploit maximum performance from the DSP48 slice. The input is fed into a cascade of registers that acts as the data sample buffer. Each register delivers a sample to a DSP48 which is then multiplied by the respective coefficient. The adder chain stores the partial products that are then successively combined to form the final result.
No external logic is required to support the filter and the structure is extendable to support any number of coefficients. This is the structure that can achieve maximum performance, because there is no high-fanout input signal. The resources required to implement a 31-tap FIR filter are only 31 DSP48 slices. If a 600-MHz clock were available in the FPGA, this filter could perform at an input sample rate of 600 MHz, or 600 Msamples/s, in a -3 speed grade 7 series device.
From this example, you can clearly see that the FPGA not only significantly outperforms a classic digital signal processor, but it does so with much lower clock rates (and therefore lower power consumption).
This example illustrates only a couple of implementation techniques for FIR filters in FPGA. The device may be further tailored to take advantage of data sample rate specifications that may fall in between the extremes of sequential MAC operation and full parallel operation. You may also consider additional trade-offs between performance and resource utilization involving symmetric coefficients, interpolation, decimation, multiple channels or multirate. The Xilinx CORE Generator™ or System Generator utilities will help you exploit all of these design variables and techniques.