# Tutorial: Floating-point arithmetic on FPGAs

*[Editor's note: For an intro to fixed-point math, see Fixed-Point DSP and Algorithm Implementation. For a comparison of fixed- and floating-point hardware, see Fixed vs. floating point: a surprisingly hard choice.]*

Inside microprocessors, numbers are represented as integers—one or several bytes stringed together. A four-byte value comprising 32 bits can hold a relatively large range of numbers: 2^{32}, to be specific. The 32 bits can represent the numbers 0 to 4,294,967,295 or, alternatively, -2,147,483,648 to +2,147,483,647. A 32-bit processor is architected such that basic arithmetic operations on 32-bit integer numbers can be completed in just a few clock cycles, and with some performance overhead a 32-bit CPU can also support operations on 64-bit numbers. The largest value that can be represented by 64 bits is really astronomical: 18,446,744,073,709,551,615. In fact, if a Pentium processor could count 64-bit values at a frequency of 2.4 GHz, it would take it 243 years to count from zero to the maximum 64-bit integer.

**Dynamic Range and Rounding Error Problems**

Considering this, you would think that integers work fine, but that is not always the case. The problem with integers is the lack of dynamic range and rounding errors.

The quantization introduced through a finite resolution in the number format distorts the representation of the signal. However, as long as a signal is utilizing the range of numbers that can be represented by integer numbers, also known as the dynamic range, this distortion may be negligible.

*Figure 1* shows what a quantized signal looks like for large and small dynamic swings, respectively. Clearly, with the smaller amplitude, each quantization step is bigger relative to the signal swing and introduces higher distortion or inaccuracy.

*(Click to enlarge)*

Figure 1: Signal quantization and dynamic range

The following example illustrates how integer math can mess things up.

**A Calculation Gone Bad**

An electronic motor control measures the velocity of a spinning motor, which typically ranges from 0 to10,000 RPM. The value is measured using a 32-bit counter. To allow some overflow margin, let's assume that the measurement is scaled so that 15,000 RPM represents the maximum 32-bit value, 4,294,967,296. If the motor is spinning at 105 RPM, this value corresponds to the number 30,064,771 within 0.0000033%, which you would think is accurate enough for most practical purposes.

Assume that the motor control is instructed to increase motor velocity by 0.15% of the current value. Because we are operating with integers, multiplying with 1.0015 is out of the question—as is multiplying by 10,015 and dividing by 10,000—because the intermediate result will cause overflow.

The only option is to divide by integer 10,000 and multiply by integer 10,015. If you do that, you end up with 30,094,064; but the correct answer is 30,109,868. Because of the truncation that happens when you divide by 10,000, the resulting velocity increase is 10.6% smaller than what you asked for. Now, an error of 10.6% of 0.15% may not sound like anything to worry about, but as you continue to perform similar adjustments to the motor speed, these errors will almost certainly accumulate to a point where they become a problem.

What you need to overcome this problem is a numeric computer representation that represents small and large numbers with equal precision. That is exactly what floating-point arithmetic does.

**Floating Point to the Rescue**

As you have probably guessed, floating-point arithmetic is important in industrial applications like motor control, but also in a variety of other applications. An increasing number of applications that traditionally have used integer math are turning to floating-point representation. I'll discuss this once we have looked at how floating-point math is performed inside a computer.