Based on recent technological developments, high-performance floating-point signal processing can, for the very first time, be easily achieved using FPGAs. To date, virtually all FPGA-based signal processing has been implemented using fixed-point operations. This article describes how floating-point technology in FPGAs is not only practical today, but that the processing rates of one trillion floating-point operations per second (teraFLOPS) are feasible and can be implemented on a single FPGA die.
Recently announced 28-nm FPGAs can enable much higher levels of both fixed- and floating-point digital signal processing (DSP) than ever before. A key aspect of this is the new and innovative variable-precision DSP architecture that efficiently supports both fixed- and floating-point implementations.
FPGA resources and architecture are by themselves are not sufficient to build floating point designs. Verilog and VHDL have poor to basically non-existent support for floating-point representation. There are no synthesis tools available today that support floating point. However, the traditional approach that is used in floating-point processors will not work with FPGAs. Therefore, a new “fused-datapath” toolflow has been designed to specifically build floating-point datapaths while taking into account the hardware implementation issues inherent in FPGAs. This design tool allows designers, for the first time, to create high-performance floating-point implementations of large FPGA designs.
By combining the capabilities of FPGAs with the fused-datapath toolflow, a 1-TeraFLOPS processing rate can be easily supported. This toolflow has been used for several years to build floating-point IP and reference designs and is easily replicated by customers. A brief discussion of the floating-point performance offered by today’s FPGAs is summarized below and how this IP is being used by customers today.
To maximize fixed- and floating-performance, this new variable-precision DSP architecture for 28-nm FPGAs has been developed. With this architecture designers have the option to “dial” the DSP block to the required precision. The variable-precision architecture can efficiently support the existing 18x18- and 18x25-bit fixed-point applications, as well as offering the higher precision required for floating-point applications. The 27x27-and 54x54-bit modes in particular are designed to support single- and double-precision floating-point applications. The efficiency of this new variable-precision DSP block is critical in supporting 1-TeraFLOPS performance on a single FPGA.
New levels of DSP resources
Floating-point processing rates are limited by multiplier resources. The density and architecture of the latest 28-nm FPGAs are optimized for floating-point applications and offer the highest single-precision FPGA multiplier density per die in conjunction with the fused-datapath toolflow.
For conventional floating-point implementation in a microprocessor, the input and output data structure for each floating-point instruction conforms to the 754-2008 IEEE Standard for Floating-Point Arithmetic. This representation of floating-point numbers is very inefficient to implement within an FPGA, so the “twos complement” representation, which is well suited to digital hardware implementation, is not used. Instead, the sign bit is separated, and there is an implicit “one” that must be added to each mantissa value.
Specially designed circuitry is necessary to accommodate this representation, which is why microprocessors or DSP blocks typically are optimized for either fixed- or floating-point operations, but usually not both. Further, in a microprocessor, there is no knowledge of the floating-point operations before or after the current instruction; therefore no optimization can be performed. This means the circuit implementation must assume that the logic-intensive normalization or denormalization must be performed on each instruction data input and output. Because of the inefficiency resulting from these issues, virtually all FPGA-based designs today are performed in fixed-point operations, even when the algorithm being implemented would work better with the high dynamic range of floating-point operations.
The Altera people deserve congratulations!
A 2008 web published paper available at
highlighted both the advantages of variable precision and of mixing the fix and float formats, skipping normalization until really required, etc. Altera people went on, and really did it. Bravo Altera! The extra innovation to keep mantissa in two's comp., instead of sign and mag., was not even predicted then. However, the paper described also the HDL extensions and what was feasible at that time in VHDL, and especially in Verilog. Maybe, people will pay attention and focus also on bringing HDL standards, to the level where designers could use them to push progress of FP computations (instead of confusing the execution machine FP formats with the designed FP capability).
The key is "how many seconds do you have?" In order to achieve faster response to demanding operations such as 3D transformation, speech recognition, radar processing, image analysis, and the like, more flops are needed, so the answer is found in the market requirements for products.
David Patterson, known for his pioneering research that led to RAID, clusters and more, is part of a team at UC Berkeley that recently made its RISC-V processor architecture an open source hardware offering. We talk with Patterson and one of his colleagues behind the effort about the opportunities they see, what new kinds of designs they hope to enable and what it means for today’s commercial processor giants such as Intel, ARM and Imagination Technologies.