# Using parallel FFT for multi-gigahertz FPGA signal processing

While there are a number of ways to implement FFTs, a parallelized version of the Radix2 Multi-Path Delay Commutator kernel (Radix2-MDC) [**3**] works very well as a modular method to create configurable parallel-FFT cores that scale well in advanced FPGA devices. The Radix2-MDC is a classical approach to building pipelined FFTs of varying lengths, as shown in Figure 2a for a 16-length FFT. It breaks the input sequence into two parallel data streams flowing forward with the correct “distance” between data elements that are entering the butterfly (a subelement of FFT algorithms) and that are scheduled by proper delays. The Radix2-MDC is relatively easy to parallelize using a wider data path and vector operations, as shown in * Figure 2b*. MDC structures also lend themselves easily to flow control and dynamic length reconfiguration, as opposed to single-path delay feedback (SDF) structures, where the incorporation of flow control (stall) signals typically reduces maximum throughput considerably.

**Figure 2: The Radix2-MDC kernel (a, at top) can effectively be parallelized and used in a modular way to create parallel-FFT implementations (b)***Click on image to enlarge*

Another choice that can affect scalability is the complex-multiplier implementation—that is, either 4multiply (4M) or 3multiply (3M) structures. Choosing a 3M complex multiply often leads to lower area usage in your design, but at the expense of slower clock speeds. [

**4**] This trade-off is also very dependent on the DSP hardware of the FPGA device. Below are the most important parameters and the choices we made in the case study that we are about to present:

• Length = 1024

• Input precision = 16 bits

• Radix2-MDC architecture using 4Mult-5add complex multipliers

• Data path precision = 1-bit growth per stage (10 stages / bits for L=1024)

• Dynamic length programmability included

• Optional flow control and synchronization turned on