# Using parallel FFT for multi-gigahertz FPGA signal processing

High-speed fast Fourier transform (FFT) cores are an essential requirement for any real-time spectral-monitoring system. As the demand for monitoring bandwidth grows in pace with the proliferation of wireless devices in different parts of the spectrum, these systems must convert time-domain to frequency-domain ever more rapidly, necessitating faster FFT operations. Indeed, in most modern monitoring systems, it is often necessary to use parallel FFTs to run at sample throughputs of multiple times the pace of the highest clock rate achievable in state-of-the-art FPGAs, such as the Xilinx Virtex-7, taking advantage of wideband A/D converters that can easily attain sample rates of 12.5 Gigasamples/second and more. [**1**]

At the same time, as communications protocols become increasingly packetized, the duty cycles of signals that need to be monitored are decreasing. This phenomenon requires a dramatic decrease in scan repeat time, which necessitates low-latency FFT cores. Parallel FFTs can help in this regard as well, since the latency scales down almost proportionally to the ratio of sample rate to clock speed.

For all of these reasons, let’s delve into the design of a parallel FFT (PFFT) with runtime-configurable transform length, taking note of the throughput and utilization numbers that are achievable when using parallel FFT.

**Hardware parallelism for FFTs**

Due to the complexity of implementing FFTs directly in logic, many hardware designers use off-the-shelf FFT cores from various vendors. [

**2**] However, most off-the-shelf FFT cores use “streaming” or “block” architectures that process only one or fewer samples per clock, which limits the throughput to the maximum clock speed achievable by the FPGA or ASIC device. A PFFT offers a faster alternative. A PFFT can accept multiple samples per clock and process them in parallel, to deliver multiple output samples per clock. This architecture multiplies the throughput beyond the achievable device clock speed, but comes at an additional cost in area and complexity. Thus, to use a PFFT you will have to make trade-offs in throughput vs. area. The trade-offs for a typical Virtex-7 FPGA design are outlined in

*and*

**Figure 1***.*

**Table 1**

*Figure 1: A parallel FFT processes multiple samples at a time to scale throughput beyond achievable system clocks of the target device. Optional features include flow control,*

synchronization and dynamic length programmability

synchronization and dynamic length programmability

*Click on image to enlarge*

**Table 1 – Area scalability is generalized by hardware multiplier utilization. Throughput scalability vs. area is slightly better than linear and generally very usable for increasing throughput to multigigahertz sample rates***Click on image to enlarge*

Looking at the table, a few general features can be seen in the trade-off curve:

**1.**As parallel throughput increases, multiplier (area) utilization increases, with a slightly lower multiple (better than linear).

**2.**Slower system clocks and timing closure yield sublinear throughput growth as parallelism increases. However, on modern FPGAs this degradation is diminishing.

**3.**Overall better-than-linear throughput/area growth is realized due to No. 1 and No. 2 above.

**4.**Latency decreases as parallelism increases.

Note that the specific numbers measured in

*are valid only for a given target and configuration of the FFT. In this case, that is a length of 1024, with 16-bit input, dynamic length programmability (4 through 1024) and flow control. Flow control is very important for applications such as spectral monitoring, where side-channel information is often utilized to change the FFT size (in order to change the resolution bandwidth) or to temporarily stall the FFT while other operations, such as acquisition, are going on. In theory, you can accomplish flow control by inserting buffers before the transform operation. But for acquisition-driven operations like spectral monitoring, it’s not easy to precompute the size of the buffer required, resulting in the need to maintain large, fast and expensive memory banks.*

**Table 1**