Design Article
How to implement double-precision floating-point on FPGAs
Danny Kreindler, Altera Corporation
10/3/2007 2:19 PM EDT
Introduction
An increasing number of applications in many vertical market segments, from financial analytics to military radar to various imaging applications, are relying on computations with floating-point (FP) numbers. These applications implement various basic functions and methods such as fast Fourier transforms (FFTs), finite impulse response (FIR) filters, synthetic aperture radar (SAR), matrix math, and Monte Carlo. Many of these implementations use single-precision FP, where FPGAs can provide up to ten times the sustained performance compared to traditional CPUs. Recently, there has been increasing interest in double-precision performance to see how well FPGAs can compete with CPUs, especially for designs that have power and cooling constraints.
In a recent article titled FPGA Floating-Point Performance – A Paper and Pencil Evaluation, the author – Dave Strenski – discusses how to estimate the double-precision (64-bit) peak FP performance of an FPGA. In this article, his method is evaluated and – more importantly – he expands on it with "real-world" considerations for estimating the sustained FP performance in an FPGA. These considerations are validated using a matrix multiplication design running in an Altera Stratix II FPGA.
The double-precision general matrix multiply (DGEMM) routine is referenced here. DGEMM is a common building block for many algorithms and is the most important component of the scientific LINPACK benchmark commonly used on CPUs. The Basic Linear Algebra Subprograms (BLAS) include DGEMM in the Level 3 group. The DGEMM routine calculates the new value of matrix C based on the product of matrix A and matrix B and the previous value of matrix C using the formula C = αAB + βC (where α and β are scalar coefficients).
For this analysis, α = β = 1 is used, though any scalar value can be used as it can be applied during the data transfer in and out. As can be seen, this operation results in a 1:1 ratio of adders and multipliers. This analysis also takes into account the logic required for a microprocessor interface protocol core and adds the following considerations:
- Memory interface module for low latency access to local data
- Data paths from memory interface to FPGA memory
- Data path from FPGA memory to FP cores
- Decrease to FP core FMAX when the FPGA is full
- Unusable FPGA logic due to routing challenges of a full FPGA
The FPGA benchmark focuses on the performance of an implementation of the AB matrix multiplication with data from a locally attached SRAM. The effort to extend this core to include the accumulator to add the old value of C is a relatively minor effort.
Peak performance calculations
The scenario for peak performance uses the same approach used in the aforementioned article. Table 1 shows the resources that are available on the EP2S180 FPGA:

Table 1. EP2S180 resources.
Each adaptive logic module (ALM) contains two 6-input functions or adaptive look-up tables (ALUTs); with four inputs and two select signals that allow more efficient use of the logic. Thus, this FPGA has 71,760 × 2 = 143,520 ALUTs. In addition, the device has 384 18-bit × 18-bit hardware multiplier/accumulators. The floating point (FP) adder function uses ALUTs from the FPGA fabric; meanwhile, the FP multiplier uses dedicated multipliers as well as ALUTs. By reserving 22,000 ALUTs for a processor-interface protocol core, 121,520 ALUTs remain for function units.
Table 2 summarizes the resource utilization for the FP cores that are planned for release.

Table 2. Stratix II FP core performance/resource utilization.



