# Doing Math in FPGAs, Part 3 (Floating-Point)

Floating-point numbers are similar to the "scientific notation" we learned in high school, but they are stored and manipulated using binary representations.

For the purposes of this column we will focus on the IEEE 754 2008 floating-point standard (hereinafter referred to as "754"). 754 defines a couple of implementations, primarily based on the width of their mantissa. These are *Half*, *Single*, *Double*, *Double Extended*, and *Quad Precision*. The binary representations of these would be as follows:

754 also includes some special formatting for certain values, such as NaN (not a number), infinity, and some others. I'll leave it to you to research those. For clarity, I'll stick to the half-precision (16-bit) format in this article. Except for the range of possible values and biases (which I'll blather on about in a bit), things work the same for each type.

First, there's the sign bit. If our number is negative, then the sign bit will be a "1," otherwise it will be a zero (negative zero is possible to keep divide by zeroes "honest"). Easy, right?

Next is the exponent. Here, there is a trick; as the exponent does not have a sign bit, and we (should) all know that exponents can be negative. The trick is that the exponent has a *bias* which must be subtracted in order to find its true value. The bias can be computed as follows:

Or, equivalently:

Where:

- b = the bias
- n = the number of bits in the exponent

More simply, the biases are shown in the table below:

This means that in our half-precision number, our exponent can have the range [-15 to 16]. That's a lot of zeroes! But this introduces one of the drawbacks of floating point numbers, and that is binimal point alignment. So, imagine that we're adding two numbers with different exponents -- we first need to shift one of the numbers (or both) until their binimal points are aligned. We also need to keep track of the result of the addition and update the exponent if there was a carry out (overflow).

The final part of our floating point number is the mantissa. In our half-precision implementation there are 11 bits of information. "Wait," you might say, "there are actually only 10 bits!" This is true, but the trick is that the 11th bit (the most-significant bit) is implied. Basically, you keep shifting the mantissa left (and modifying your exponent accordingly) until you find the first/last '1' (depending on the sign of the exponent), at which point you "throw that '1' away." Here's an example -- let's store the number 0.02 as follows:

So, that's how we store our number into the floating-point format. Even though the floating-point format has the advantages of high dynamic range in a fairly compact space, we can also see that there are some disadvantages as follows:

- Floating-point (like all binary representations) does not map well into decimal (that is, the precision limits the translation).
- You may need to apply the bias prior to an operation (not necessary for multiplication or division, as the exponents add/subtract).
- To add two numbers, you must first "unroll" the exponents to align their binimal points.
- After a math operation, you must "reroll" the exponents.
- The act of "unrolling" the exponents can lead to a loss of precision if your registers are too narrow.

There is also a disadvantage in that you may need to "intelligently" decide which number to unroll for a given operation -- that is, there needs to be a decision made about which value is *more significant* so you don't lose (or gain) significance during the unrolling operations. As an example, consider 3.24 + 0.02001; which of these should lose bits if it proves necessary to do so? The answer is 0.02001, as the result cannot be more "precise" than any of the inputs.

Another drawback that should be obvious at this point is *truncation*. 754 defines a couple of different ways to perform *rounding*, but I'm betting that in many cases we might not want to waste the extra hardware to do that. Unfortunately, truncation (unlike rounding) always "pushes" our results towards zero (smaller magnitude). This may not seem like a big deal for a single operation, but truncation errors tend to "stack up" over multiple operations and can badly skew results. This is a big enough deal that IBM used to add guard bits in their machinery back in the bad old days.

On the plus side, the floating-point format provides a relatively compact way to store data, and you can always "roll your own" format for customized needs. Having said this, when talking to other computing machinery, the IEEE standard is widely used, so sticking with this might make life easier for a lot of other people.

Of course, there are libraries available that claim to do all the hard stuff for you, but since I don't have any experience with those, I won't comment on the veracity of their claims.

Once again, as I stated in the last couple articles, it's *your* job to understand *your* needs and to choose the method that is best for *your* application. Next time, we'll take a look at fixed-point representations; in the meantime, please post any questions or comments below.

**Related posts:**

Author

anon5532556 1/13/2014 3:11:34 PM

The 9511 was a 32 bit floating point chip with 8 bit bus. Easy to hook to a Z80 etc. It directly handled a bunch of curve type functions and the like. Mostly used it for sin/cos/tan things doing earth curvature work. The 9512 was much simpler but wider inside. Both ran hot, and cost a lot.

Oddly enough it looks like MicroMega currently sells an FPU for microcontroller projects. That has to be going away though, as the ARM 32F4 part I'm using today does floating point so fast I regularly use it in interrupt routines.

Author

TanjB 1/11/2014 2:39:00 AM

16 bit FP is making a comeback. You can find it supported in some current GPUs. I believe it is used mostly to represent high dynamic range graphical data but there are probably other uses.

Of course, 8 bit FP was actually hugely important. The A-law and mu-law codecs used by all phone networks in the ISDN days, still used in some landlines and voice exchanges, were essentially FP with a sign, 3 bit exponent, and 4 bit fraction (with implied leftmost 1, just like IEEE formats).

Author

TanjB 1/11/2014 2:32:48 AM

One of my colleagues wrote an infinite (ulimited rationals) precision arithmetic package and we used that to get some insights and to check what the true optimal solutions were for some test cases. It was educational but too slow for real world use.

The field has changed enormously since JvN's time. Heck I think he died in that car crash before Simplex even became widespread. Numerical optimization theory blossomed in the 1980s with real insights into non-linear, and then the implementations accelerated enormously in the 1990s and 2000s. Only the square root of the improvement due to hardware, the rest due to clever algorithms. I'm sure that John would love the kinds of optimization which we do today for monster problems like deep neural networks but it is a hugely different field than what he helped start.

Author

Max The Magnificent 1/8/2014 3:39:43 PM

once you've done your numerical analysis you've already completed most of the work needed to represent your problem using fixed-point arithmetic.LOL I think the main thing is to understand what one is trying to do and take the expected data and application into account. As you note, if you perform Y + X where Y is a very big value and X is a very small one, you will end up with just Y .... but if X and Y are both in the same ball-park size-wise, then the problem is much reduced.

Author

anon5532556 1/8/2014 3:22:44 PM

Nice write up btw :)

PS: Anybody else use the AMD 9511 or 9512?

Author

betajet 1/8/2014 2:07:22 PM

floating-point numbers are not real numbers, so the normal laws of real numbers -- like associativity of addition --do not apply. When you add a tiny floating-point number X to a big floating-point number Y, all the bits of X fall into the bit bucket and you end up with Y, not X+Y. Sometimes you need to use algebraic tricks to re-write your formulas into expressions that are stable for your problem and hope the compiler doesn't "optimize" them.I've read that John von Neumann greatly disliked floating-point because (1) he'd rather use those exponent bits for more precision, and (2) once you've done your numerical analysis you've already completed most of the work needed to represent your problem using fixed-point arithmetic.

Author

Max The Magnificent 1/8/2014 12:01:49 PM

as for the need for precision in a world where resistors might be accurate only to a percent, it is amazing how easy it is to get yourself into trouble with the math once you start doing simulations and...VERY good point!!!

Author

TanjB 1/8/2014 11:51:31 AM

FP calculations (in any radix) are common in engineering, science, and anything approximate. Even in finance they are perfectly fine to use in situations like estimating future or present value, or allocating budgets.

When it comes to accounting for the cents, however, fixed point is more likely what you want. Most of those operations are multiplies, adds and subtracts, which are exact in fixed point, with the occasional fraction like taxes which have rounding rules built in.

And as for the need for precision in a world where resistors might be accurate only to a percent, it is amazing how easy it is to get yourself into trouble with the math once you start doing simulations and (much, much trickier) optimizations. Simple components like transformers are nearly singularities. Numerical optimization packages are black arts mostly because of the clever tweaks needed to efficiently detect and work around problems with the limited (!) precision of 64 bit doubles.

Author

Max The Magnificent 1/8/2014 9:58:37 AM

Sic transit gloria mundi"Obesa cantavit" (The fat lady has sung :-)

Author

Max The Magnificent 1/8/2014 9:52:35 AM

The IEEE 754 standard (2008) has introduced decimal floating point.Didn't it actually introduce multi-radix floating-point, of which decimal is one incarnation, or was decimal singled out?