# A more efficient FFT for communications applications using a massively parallel processor array

The introduction of complex communications protocols based on Orthogonal Frequency Division Multiplexing (OFDM) has brought with it a growing demand for high-speed Fast Fourier Transform (FFT) processing. OFDM is used in the higher speed versions of the IEEE802.11 standard and in both fixed and mobile WiMAX (802.16d and 802.16e). It also forms the basis for the ADSL standard, where it is known as Discrete MultiTone (DMT), and has been proposed as a means of increasing data rates in enhanced versions of 3G and 4G cellular communications systems. This article describes how the power of a massively parallel processor array architecture can be harnessed to improve the efficiency of FFT processing in these complex applications, and presents some benchmark data on running complex FFTs on this architecture. The use of parallel processing also brings additional benefits when multiple antennas are used, since this multiplies the total number of FFTs and inverse FFTs that must be performed by the same factor.

All these systems demand high-speed FFT processing, but they also require flexibility. Frequently, market pressure can drive vendors to release products that comply with early versions of a standard, but which must have the flexibility to be upgraded to the final version using only a simple software update. For example, during the development of 802.16e, it shifted from a constant FFT size (256 for OFDM, 2048 for OFDMA) to a scalable PHY, with the FFT size shifting for different channel bandwidths – with a maximum of either 1024 or 2048 being argued. This demanded a solution that could be implemented on programmable silicon. Although numerous algorithms and architectures exist for implementing the FFT, the use of a software-programmable platform often mandates the use of an algorithm that is optimised for software processing, even though it may be slower or less power-efficient than a hardware-oriented design. Changing the FFT size for a traditional, block-based implementation also poses significant issues to the overall system performance due to scheduling considerations.

However, it is possible to adopt a programmable platform that allows the efficient implementation of hardware-oriented algorithms on a software-based engine. An example is the high-performance PC102 from picoChip that combines the time-to-market and abstraction benefits of a software development environment with the performance benefits gained by exploiting parallelisms within an algorithm.

The FFT is simply an efficient implementation of the Discrete Fourier Transform (DFT). For a N-point DFT, a direct implementation requires of the order of N^2 complex multiply-and-add operations. But as a perfect example of how a clever algorithm can deliver incredible efficiency gains, the classic FFT only requires of the order of N × log(2)N operations. An (unscaled) definition of the DFT, which acts as the starting point for the FFT algorithm, is presented in *Fig 1*.

*1. FFT with (a) decimation in frequency, (b) decimation in time.*

Two approaches exist for reducing the DFT into a series of simpler calculations. One is to perform decimation in frequency and the other is to perform decimation in time. Both approaches require the same number of complex multiplications and additions. The key difference between the two is that decimation in time takes digit-reversed inputs and generates normal-order outputs, whereas decimation in frequency takes normal-order inputs and generates digit-reversed outputs (depending on the implementation of the FFT algorithm). The manipulation of inputs and outputs is carried out by so-called butterfly stages. The use of each butterfly stage involves multiplying an input by a complex twiddle factor:

e^{-j2(Pi)n/N}

A pipeline FFT is characterised by the real-time, non-stop processing of a sequential input stream. A hardware-orientated approach strives to minimise the cost or area of silicon by minimising the number of complex multipliers needed. That allows more elements to be computed in parallel for a given area.

The FFT algorithm involves the temporal separation of data, a task that is performed by the butterfly stage. Because samples will be taken from different points in the input stream, pipeline FFTs need a way of buffering and reordering data. Various architectures exist for this. The example FFT is based on the standard radix-4 decimation in frequency algorithm.

**Array elements**

The PC102 consists of an array of 300 individual processors or array elements (AEs), each a 16-bit Harvard architecture, with local memory, interconnected by a sophisticated high-speed switching fabric called the picoBus. There are four different types of array elements (AEs), which are detailed in *Table 1*. Minor differences exist between the three programmable AE types (STAN2, MEM2 and CTRL2). These differences include the size of instruction/data memory, additional processing unit and instructions supported (e.g. multiply-accumulate, multiply). A long instruction word (LIW) of up to 64 bits allows up to three execution units to be targeted in a single cycle at 160MHz. Each AE has a number of ports for communicating with other AEs within the array using the picoBus.

*Table 1. PC102 processor variants and memory distribution.*

The software is written in C or assembler, and is targeted at a particular AE type depending on the processing units used and memory required. The C or ASM code for each AE is contained within a picoVHDL wrapper, which defines the ports and the type of AE used.

In addition to the STAN2, MEM2 and CTRL2 AE types specified in *Table 1*, software for the PC102 can also be targeted at the ANY2 AE type implying that:

- The function does not use any AE-specific instructions.
- The code and data memory requirements can be met by all AE types.

**picoBus**

As in the case of a FPGA, the picoBus switching fabric is allocated at compile time, so that all communications and processing is predictable and deterministic. This differs from a conventional DSP, with RTOPS, scheduling and arbitration, which are only statistically predictable. Unlike a FPGA, however, the buses are all 32 bits wide, and have a TDM-cycle, so a number of independent signals can share the same path. Bus slots are allocated every 2^x cycles (where 0 < x < 11) and are defined in software. For example, a bus slot allocated a rate of 2^4 cycles i.e. every 16 cycles, is said to have an "at rate" of @16. In this way, any processor can talk to any other, and an extremely complex web of communications can be created in a very efficient way. By default, communication between array elements is data blocking. This architecture provides a communications structure that will be familiar to hardware designers as the communications mechanisms, from a logic-layer standpoint, looks like a set of point-to-point links each operating at a set fraction of the total picoBus bandwidth, with FIFOs used to buffer data.

In many respects, the picoArray combines the best of both worlds of a DSP and a FPGA: it has the familiar easy programming model of a processor, yet delivers the performance, scalability and predictable performance of a hardware implementation.