# How to achieve 1 trillion floating-point operations-per-second in an FPGA

Based on recent technological developments, high-performance floating-point signal processing can, for the very first time, be easily achieved using FPGAs. To date, virtually all FPGA-based signal processing has been implemented using fixed-point operations. This article describes how floating-point technology in FPGAs is not only practical today, but that the processing rates of one trillion floating-point operations per second (teraFLOPS) are feasible and can be implemented on a single FPGA die.

**What’s changed?**

Recently announced 28-nm FPGAs can enable much higher levels of both fixed- and floating-point digital signal processing (DSP) than ever before. A key aspect of this is the new and innovative variable-precision DSP architecture that efficiently supports both fixed- and floating-point implementations.

FPGA resources and architecture are by themselves are not sufficient to build floating point designs. Verilog and VHDL have poor to basically non-existent support for floating-point representation. There are no synthesis tools available today that support floating point. However, the traditional approach that is used in floating-point processors will not work with FPGAs. Therefore, a new “fused-datapath” toolflow has been designed to specifically build floating-point datapaths while taking into account the hardware implementation issues inherent in FPGAs. This design tool allows designers, for the first time, to create high-performance floating-point implementations of large FPGA designs.

By combining the capabilities of FPGAs with the fused-datapath toolflow, a 1-TeraFLOPS processing rate can be easily supported. This toolflow has been used for several years to build floating-point IP and reference designs and is easily replicated by customers. A brief discussion of the floating-point performance offered by today’s FPGAs is summarized below and how this IP is being used by customers today.

To maximize fixed- and floating-performance, this new variable-precision DSP architecture for 28-nm FPGAs has been developed. With this architecture designers have the option to “dial” the DSP block to the required precision. The variable-precision architecture can efficiently support the existing 18x18- and 18x25-bit fixed-point applications, as well as offering the higher precision required for floating-point applications. The 27x27-and 54x54-bit modes in particular are designed to support single- and double-precision floating-point applications. The efficiency of this new variable-precision DSP block is critical in supporting 1-TeraFLOPS performance on a single FPGA.

**New levels of DSP resources**

Floating-point processing rates are limited by multiplier resources. The density and architecture of the latest 28-nm FPGAs are optimized for floating-point applications and offer the highest single-precision FPGA multiplier density per die in conjunction with the fused-datapath toolflow.

For conventional floating-point implementation in a microprocessor, the input and output data structure for each floating-point instruction conforms to the 754-2008 IEEE Standard for Floating-Point Arithmetic. This representation of floating-point numbers is very inefficient to implement within an FPGA, so the “twos complement” representation, which is well suited to digital hardware implementation, is not used. Instead, the sign bit is separated, and there is an implicit “one” that must be added to each mantissa value.

Specially designed circuitry is necessary to accommodate this representation, which is why microprocessors or DSP blocks typically are optimized for either fixed- or floating-point operations, but usually not both. Further, in a microprocessor, there is no knowledge of the floating-point operations before or after the current instruction; therefore no optimization can be performed. This means the circuit implementation must assume that the logic-intensive normalization or denormalization must be performed on each instruction data input and output. Because of the inefficiency resulting from these issues, virtually all FPGA-based designs today are performed in fixed-point operations, even when the algorithm being implemented would work better with the high dynamic range of floating-point operations.