# Understanding Peak Floating-Point Performance Calculations

FPGAs with hardened floating-point DSP blocks are now available, providing up to 10,000 GFLOPs in high-end devices.

DSPs, GPUs, and FPGAs serve as accelerators for many CPUs, providing both performance and power efficiency benefits. Given the variety of computing architectures available, designers need a uniform method to compare performance and power efficiency. The accepted method is to measure floating-point operations per second (FLOPs), where a FLOP is defined as either an addition or multiplication of single (32 bit) or double (64 bit) precision numbers in conformance with the IEEE 754 standard. All higher-order functions, such as division, square root, and trigonometric operators, can be constructed using adders and multipliers. As these operators, as well as other common functions such as fast Fourier transforms (FFTs) and matrix operators, require both adders and multipliers. There is commonly a 1:1 ratio of adders and multipliers in all these architectures.

Let's look at how we go about comparing the performance of the DSP, GPU, and FPGA architectures based on their peak FLOPS rating. The peak FLOPS rating is determined by multiplying the sum of the adders and multipliers by the maximum operation frequency. This represents the theoretical limit for computations, which can never be achieved in practice, since it is generally not possible to implement useful algorithms that can keep all the computational units occupied all the time. It does however provide a useful comparison metric.

First, we consider DSP GFLOPS performance. For this we selected an example device such as Texas Instruments' TMS320C667x DSP. This DSP contains eight DSP cores, with each core containing two processing subsystems. Each subsystem contains four single-precision floating-point adders and four single-precision floating-point multipliers. This is a total of 64 adders and 64 multipliers. The fastest version available runs at 1.25 GHz, providing a peak of 160 GigaFLOPs (GFLOPs).

GPUs have become very popular devices, particularly for image processing applications. One the most powerful GPUs is the Nvidia Tesla K20. This GPU is based upon CUDA cores, each with a single floating-point multiple-adder unit that can execute one per clock cycle in single-precision floating-point configuration. There are 192 CUDA cores in each Streaming Multiprocessor (SMX) processing engine. The K20 actually contains 15 SMX engines, although only 13 are available (due to process yield issues, for example). This gives a total of 2,496 available CUDA cores, with two FLOPs per clock cycle, running at a maximum of 706 MHz. This provides a peak single-precision floating-point performance of 3,520 GFLOPs.

FPGA vendors such as Altera now offer hardened floating-point engines in their FPGAs. A single-precision floating-point multiplier and adder have been incorporated into the hard DSP blocks embedded throughout the programmable logic structures. A medium-sized FPGA, in Altera's midrange Arria 10 FPGA family, is the 10AX066. This device has 1,678 DSP blocks, each of which can perform two FLOPs per clock cycle, resulting in 3,376 FLOPs each clock cycle. At a rated speed of 450 MHz (for floating point -- the fixed-point modes are higher), this provides for 1,520 GFLOPs. Computed in a similar fashion, Altera states 10,000 GFLOPs, or 10 TeraFLOPs, of single-precision performance will be available in the high-end Stratix 10 FPGAs, achieved with a combination of both clock rate increases and larger devices with much more DSP computing resources.

Floating point routines have always been available in FPGAs using the programmable logic of the FPGA. Furthermore, with programmable logic-based floating point, an arbitrary precision level can be implemented that is not restricted to industry-standard single and double precision. For example, Altera offers seven different levels of floating-point precision. However, determining the peak floating-point performance of a given FPGA using a programmable logic implementation is not at all straightforward.

Therefore, the peak floating-point rating of any Altera FPGA is based solely on the capability of the hardened floating-point engines, and assumes that the programmable logic is not used for floating point, but rather for the other parts of a design, such as the data control and scheduling circuits, I/O interfaces, internal and external memory interfaces, and other required functionality.