# FPGA-based FFT engine handles four times more input data

*Rodger Hosking, Vice President, Richard Kuenzler, Engineer, Pentek Inc., Upper Saddle River, N.J.*

For years, field-programmable gate array (FPGA) technology has been a major cornerstone of board-level product design for embedded software radio and communication systems. FPGAs are ideal for implementing the data formatting, timing, and the specialized glue logic needed to connect real-time peripherals such as modems, A/D converters, and digital receivers to programmable processors. However, with their newly acquired digital signal processing (DSP) capabilities, FPGAs are now expanding these traditional roles to help offload computationally intensive digital signal processing functions from the processor.

As one of the most classic algorithms for DSP benchmarking, the fast Fourier transform (FFT) is deployed in a wide range of communications, radar, and signal intelligence applications. One of the most efficient methods of performing the FFT calculation is an iteration of the radix-4 "butterfly" algorithm. Each butterfly consists of multipliers and adders that accept four input points and compute four output points based on suitably chosen coefficients from a sine table. For a 4,096-point FFT, six stages of butterflies are required, representing a total of 60 multipliers. With 96 multipliers available simultaneously, some FPGAs can perform all 60 multiplication operations in parallel. This illustrates the fundamental performance advantage that FPGAs offer for this type of calculation over even the latest general-purpose processors, in which two or four multipliers must be time-shared.

Because the FFT is inherently a block-oriented algorithm, the FFT operates most efficiently when quick access to all input and output samples is supported by a freely-addressable RAM. However, this ideal model of random data availability is contrary to the sequential input data samples streaming from the A/D converter. By using a proprietary memory structure implemented by configuring the block RAM resources of the FPGA, four input-data memory ports feed the butterfly engine in parallel, thus solving the data availability problem. This unique memory architecture allows subsequent input blocks to be processed in a continuous, systolic manner so that all of the multipliers in all six stages can be productively engaged all the time.

Each radix-4 butterfly operates on four input samples within one clock cycle. Therefore, when the FPGA processing clock is equal to the A/D clock, the architecture can run four times faster than real-time. So, with suitable hardware multiplexing schemes, this same engine can be used to handle four streams of input data instead of just one.

For example, in building a dual-channel software-radio mezzanine module attached to a multiprocessor VME board, two 12-bit A/D converters digitize RF signals at sampling rates up to 100 MHz. The FPGA receives real outputs from both A/D converters as well as complex baseband outputs from both of the digital receivers. The FPGA implements a dual velocity interface mezzanine (VIM) interface that delivers streaming data at up to 400 Mbytes/second directly into each DSP or PowerPC on the multiprocessor platform. FIFO memories buffer the data to support efficient DMA transfers. In this case, we used the Virtex-II family of FPGAs from Xilinx Inc. (San Jose, Calif.) to extend the capability of this product to handle DSP algorithms. Specifically, the XC2V3000, with 96 dedicated 18x18 multiplier blocks and over 200 kbytes of block RAM, supports even quite substantial signal processing tasks. The FPGA still performs all the conventional tasks of timing, formatting, and glue logic for the various devices on board. Yet, after these standard features are incorporated, approximately 94 percent of the logic and all of the multipliers are available for additional functions.

**Faster execution**

In our example, with two A/D converters operating at 100 MHz, the FPGA is only working at half capacity. But with a little extra effort, the engine can be set to handle 50 percent input overlap processing of both channels to fully use the hardware. In this case, the pipelined execution time is 10.24 microseconds for each FFT. This is four times faster than the time it takes to collect the 4,096 input points at a 100-MHz sampling rate, consistent with performing four FFTs in real-time. In fact, this execution time is more than ten times faster than an optimized FFT algorithm running on a 400-MHz G4 PowerPC.

Since only 60 of the 96 multipliers were used for the FFT algorithm, additional features were incorporated. At each of the four complex input streams, an optional Hanning window can be applied, consuming an extra eight multipliers. Since coefficients for the FFT and for the Hanning window tap separate FPGA table memories, alternate input windowing functions can be substituted for the Hanning window.

Eight more multipliers are used to perform an optional power calculation at the FFT output, in which the real and imaginary components of each of the four outputs are squared and then added together. Finally, an averager stage adds the two outputs of the 50 percent input overlap FFTs to improve signal-to-noise characteristics. At the output of the FPGA, a multiplexer allows the results of each signal processing stage to be directed to the processor interface.

Several proprietary techniques were also employed to reduce rounding and truncation errors due to integer arithmetic, yielding a calculation dynamic range of better than 90 dB. After all of these features were accommodated and fully optimized by deploying the available FPGA resources, the final design consumed 76 of the 96 multipliers plus a large percentage of the logic and memory resources of the XC2V3000 device.

Although this FPGA is still expensive because of its recent introduction, two concentric subsets of its ball-grid array footprint pattern accommodate two smaller devices in the same family, to save costs for less demanding applications.

We believe this FPGA-based FFT engine, with its tenfold speed advantage over programmable processors, offers a very powerful front end for many signal-processing systems. This effective strategy for offloading well-defined, CPU-intensive tasks from programmable processors will become increasingly more common as these new DSP-capable FPGAs find their way into board-level products.