# How to build ultra-fast floating-point FFTs in FPGAs

Engineers targeting DSP to FPGAs have traditionally used fixed-point arithmetic, mainly because of the high cost associated with implementing floating-point arithmetic. That cost comes in the form of increased circuit complexity and often degraded maximum clock performance. Certain applications demand the dynamic range offered by floating-point hardware but require speeds and circuit sizes usually associated with fixed-point hardware. The fast Fourier transform (FFT) is one DSP building block that frequently requires floating-point dynamic range and high speed.

A textbook construction of a pipelined floating-point FFT engine capable of continuous input entails dozens of floating-point adders and multipliers. The complexity of these circuits quickly exceeds the resources available on a single FPGA. We fit the FFT design into a single FPGA without sacrificing speed or floating-point performance by using an alternative FFT algorithm and a hybrid of fixed- and floating-point hardware.

The resulting design has IEEE single-precision floating-point inputs and outputs that match the precision obtained with more conventional designs, yet is capable of as much as 1.2 gigasamples-per-second continuous data throughput and fits into one Xilinx Virtex-4 XC4VSX55 FPGA. The design performs 32-, 64-, 128-, 256-, 512-, 1,024-, or 2,048-point complex input Fourier transform, with size selected on the fly.

**The Algorithm**

The FFT can be factored in a variety of different ways; each way results in a different algorithm. The most common factorization is the Cooley-Tukey algorithm, which recursively factors each N point transform into a pair of N/2 point transforms combined by a "butterfly" operation until the reduced transforms are each a single sample long. Each butterfly operation comprises a complex multiply by a phasor or "twiddle factor" (that is, a phase rotation of the input) followed by a two-point FFT, which is just a complex sum and difference. The regularity of the algorithm and data sequencing is ideal for a software implementation. The number of multiplies and floating-point operations for a software implementation are of little consequence because all of the operations are performed serially by a single floating-point unit.

Other factorizations of the Fourier transform can result in a more efficient hardware design. The Winograd FFT algorithm is particularly interesting for hardware implementations because it is a factorization designed to minimize the number of multiplies needed to perform the transform. The price you pay for the reduced number of operations is a highly irregular addressing sequence, which makes it very inefficient to perform with a microprocessor.

A hardware solution using distributed memory mostly avoids the problems posed by an irregular address sequence. A 16-point Winograd FFT decomposes into three layers of add/subtract operations on each side of a real multiplier (Figure 1), which performs 74 add and 18 multiply operations per FFT. This represents about one third of the hardware of a Cooley-Tukey implementation, which requires 176 add and 72 multiply operations. Andraka Consulting Group has developed optimized fixed-point Winograd kernels for maximum performance 4-, 8-, and 16-point FFTs targeted at all Virtex families. This particular design uses a redesign of our Winograd kernel that takes advantage of the DSP48 primitives; is capable of selectable 1-, 4-, 8-, and 16-point operation; and can operate at clock rates as high as the maximum clock rate of the DSP48 slice (400 MHz in the -10 speed grade device).

*(Click to enlarge)*

Figure 1 – A 16-point Winograd FFT.

The Winograd FFT is awkward to factor for sizes larger than 16 point, so rather than attempting to compute the full FFT directly, we use the mixed-radix algorithm to construct larger FFTs from small 4-, 8-, and 16-point Winograd FFT kernels. The mixed-radix algorithm arranges the data into a matrix, performing a small FFT down each column, phase-rotating the results, and then doing an additional FFT across each row. The 32- to 2,048point design uses the mixed-radix algorithm to obtain a 32-, 64-, 128-, or 256-point FFT from a pair of 4/8/16 kernels, and then uses the algorithm a second time to combine that composite FFT with an additional 8-point kernel to obtain the 512-, 1,024-, and 2,048-point FFTs (Figure 2). This design is easily extended to 4,096 points by using a 16-point kernel for the extra stage and may be extended to larger FFTs using additional stages, rotators, and reorder memory.

*(Click to enlarge)*

Figure 2 – Combination of small Winograd kernels using mixed-radix algorithm to achieve larger FFTs.