# Doing Math in FPGAs, Part 3 (Floating-Point)

Floating-point numbers are similar to the "scientific notation" we learned in high school, but they are stored and manipulated using binary representations.

For the purposes of this column we will focus on the IEEE 754 2008 floating-point standard (hereinafter referred to as "754"). 754 defines a couple of implementations, primarily based on the width of their mantissa. These are *Half*, *Single*, *Double*, *Double Extended*, and *Quad Precision*. The binary representations of these would be as follows:

754 also includes some special formatting for certain values, such as NaN (not a number), infinity, and some others. I'll leave it to you to research those. For clarity, I'll stick to the half-precision (16-bit) format in this article. Except for the range of possible values and biases (which I'll blather on about in a bit), things work the same for each type.

First, there's the sign bit. If our number is negative, then the sign bit will be a "1," otherwise it will be a zero (negative zero is possible to keep divide by zeroes "honest"). Easy, right?

Next is the exponent. Here, there is a trick; as the exponent does not have a sign bit, and we (should) all know that exponents can be negative. The trick is that the exponent has a *bias* which must be subtracted in order to find its true value. The bias can be computed as follows:

Or, equivalently:

Where:

- b = the bias
- n = the number of bits in the exponent

More simply, the biases are shown in the table below:

This means that in our half-precision number, our exponent can have the range [-15 to 16]. That's a lot of zeroes! But this introduces one of the drawbacks of floating point numbers, and that is binimal point alignment. So, imagine that we're adding two numbers with different exponents -- we first need to shift one of the numbers (or both) until their binimal points are aligned. We also need to keep track of the result of the addition and update the exponent if there was a carry out (overflow).

The final part of our floating point number is the mantissa. In our half-precision implementation there are 11 bits of information. "Wait," you might say, "there are actually only 10 bits!" This is true, but the trick is that the 11th bit (the most-significant bit) is implied. Basically, you keep shifting the mantissa left (and modifying your exponent accordingly) until you find the first/last '1' (depending on the sign of the exponent), at which point you "throw that '1' away." Here's an example -- let's store the number 0.02 as follows:

So, that's how we store our number into the floating-point format. Even though the floating-point format has the advantages of high dynamic range in a fairly compact space, we can also see that there are some disadvantages as follows:

- Floating-point (like all binary representations) does not map well into decimal (that is, the precision limits the translation).
- You may need to apply the bias prior to an operation (not necessary for multiplication or division, as the exponents add/subtract).
- To add two numbers, you must first "unroll" the exponents to align their binimal points.
- After a math operation, you must "reroll" the exponents.
- The act of "unrolling" the exponents can lead to a loss of precision if your registers are too narrow.

There is also a disadvantage in that you may need to "intelligently" decide which number to unroll for a given operation -- that is, there needs to be a decision made about which value is *more significant* so you don't lose (or gain) significance during the unrolling operations. As an example, consider 3.24 + 0.02001; which of these should lose bits if it proves necessary to do so? The answer is 0.02001, as the result cannot be more "precise" than any of the inputs.

Another drawback that should be obvious at this point is *truncation*. 754 defines a couple of different ways to perform *rounding*, but I'm betting that in many cases we might not want to waste the extra hardware to do that. Unfortunately, truncation (unlike rounding) always "pushes" our results towards zero (smaller magnitude). This may not seem like a big deal for a single operation, but truncation errors tend to "stack up" over multiple operations and can badly skew results. This is a big enough deal that IBM used to add guard bits in their machinery back in the bad old days.

On the plus side, the floating-point format provides a relatively compact way to store data, and you can always "roll your own" format for customized needs. Having said this, when talking to other computing machinery, the IEEE standard is widely used, so sticking with this might make life easier for a lot of other people.

Of course, there are libraries available that claim to do all the hard stuff for you, but since I don't have any experience with those, I won't comment on the veracity of their claims.

Once again, as I stated in the last couple articles, it's *your* job to understand *your* needs and to choose the method that is best for *your* application. Next time, we'll take a look at fixed-point representations; in the meantime, please post any questions or comments below.

**Related posts:**