FPGAs enable specific optimizations for floating-point
FPGAs have specific characteristics lacking in microprocessors, and these features can be leveraged to produce a more optimal floating-point flow. First, FPGAs, unlike microprocessors, have thousands of hardened multiplier circuits. These can be used for both mantissa multiplication, and used as shifters. Shifting of the data is required to perform the normalization to set mantissa decimal point, and denormalization of mantissas as needed to align exponents. Using a barrel shifter structure would require very high fan-in multiplexers for each bit location, and the routing to connect each of the possible bit inputs. This leads to very poor fitting, slow clock rates, and excessive logic usage, which has discouraged use of floating-point operations in FPGAs previously.
Second, an FPGA has the ability to use larger mantissas than an IEEE 754 representation. This is possible because the variable-precision DSP blocks support 27x27 and 36x36 multiplier sizes, which can be used for 23-bit single-precision floating-point datapaths. Using configurable logic, the remainder of the circuits can by definition be made whatever mantissa size is desired. By using a mantissa size of a few extra bits, such as 27 bits instead of 23 bits, allows for extra precision to be carried from one operation to the next, significantly reducing normalization and denormalization.
The fused-datapath tool analyzes the need for normalization in the design, and inserts these stages only where necessary. This analysis leads to a dramatic reduction in logic, routing, and multiplier-based shifting resources. It also results in much higher fMAX
or achievable clock rates, even in very large floating-point designs as shown graphically in Figure 1.
Figure 1. Fused datapath optimizations
Because an IEEE 754 representation is still necessary to comply with the floating-point world, all of the floating-point functions support this interface at the boundaries of each function, whether an fast Fourier transform (FFT), a matrix inversion, sine function, or a custom datapath specified by customers. But whether the fused-datapath toolflow provides the same results as the IEEE 754 approach used by microprocessors, and how verification is performed, are still under question. Even microprocessors have different floating-point results, depending on how they are implemented.
The main reason for these differences is that floating-point operations are not associative, which can be proved easily by writing a program in C or MATLAB to sum a bunch of floating-point numbers. Summing the same set of numbers in the opposite order will result in several different least significant bits (LSBs). To verify the fused-datapath method, the designer must discard the bit–by-bit matching of results typically used in fixed-point data processing. The tools allow the designer to declare a tolerance, and to compare the hardware results output from the fused-datapath toolflow to the simulation model results.
A large single-precision floating-point matrix inversion function can be implemented using the fused-datapath toolflow, and tested across different-size input matrices. These results can also be computed using to an IEEE 754-based Pentium processor. The reference result is computed on the processor, using double-precision floating-point operations, which provides perfect results compared to single-precision architecture. By comparing both the IEEE 754 single-precision results and the single-precision fused-datapath results, and computing the Frobenious norm of the differences, it can be shown that the fused-datapath toolflow gives more precise results than the IEEE 754 approach, due to the extra mantissa precision used in the intermediate calculations.
Table 1 lists the mean and the standard deviation and Frobenious norm where the SD subscript refers to IEEE 754-based single-precision architecture in comparison with to the reference double-precision architecture, and the HD subscript refers to hardware-based fused-datapath single–precision architecture in comparison with the reference double-precision architecture.
Table 1. Fused datapath precision results