# Doing Math in FPGAs, Part 3 (Floating-Point)

Floating-point numbers are similar to the "scientific notation" we learned in high school, but they are stored and manipulated using binary representations.

It seems I have something of a mini-series of blogs going on here.

First, I muttered some inanities about multiplication and division by 10 (see Doing Math in FPGAs, Part 1). Next, I rambled on about doing math in BCD (see Doing Math in FPGAs, Part 2 (BCD)). Now, it seems it's time to mutter something about floating-point representations of numbers and how to do some math with them. I considered using floating-point representations for this mysterious project that I've been alluding to (I'll get to that, one of these days, maybe), so I took a quick look at how to implement them.

Now, of course, there are plenty of ways one could represent a floating-point number. You can do it your way, I can do it my way, or we can all agree to follow a standard such as the IEEE 754 2008 standard, for example. Of course, I'm not the first person here on EE Times to cover the topic of floating-point representations; in fact, Mr. Kjodavix described this way back in 2006 (see Tutorial: Floating-point arithmetic on FPGAs). Because of Mr. Kjodavix's article, I wondered whether I should even bother expounding on floating-point concepts. However, we all *speak* a little differently and we all *learn* a little differently, so maybe my take on this will make someone else's grasp a little better (I *do* recommend reading Mr. Kjodavix's article, though).

So what are floating-point numbers? Well, let's start with the fact that, due to the way in which we build our computers using two-state logic (let's not worry about experiments with tertiary, or three-state, logic), we have to store numbers using some form of binary representation. It's relatively easy to use binary values to represent integers, but they don't lent themselves to directly storing *real numbers*; that is, numbers that include fractional values with digits after the decimal point. In other words, it's relatively easy to use binary to represent a value like 3, but it's less easy to represent a value like 3.141592. Similarly, it's relatively easy to create logic functions to implement mathematical operations on integer values, but it's less easy to work with real numbers.

Of course, we can store numbers in *BCD* (I talked about this in my previous blog), or we could use *fixed-point* representations (I will talk about this next time), but what do we actually mean by *floating-point*? Well, it's a lot like the "scientific notation" we learned at high school (e.g. 31.41592x10 ^{-1}), but it's stored and manipulated using binary representations.

So, how we might perform the mighty feat of representing a *real number* in binary? If we would just assume a binimal point (the binimal point is the same as the decimal point in base 10, only it's the binary equivalent in base 2) at some fixed point in the middle, then we'd have a fixed-point representation as illustrated below:

I won't yammer on about this right now (that's for next time); suffice it to say that we would need a lot of bits to represent either a really big number or a really small one. Floating-point solves this problem by breaking the number up into three pieces: the *sign*, the *mantissa* (a.k.a. *significand* or *coefficient*), and the *exponent* (a.k.a. *characteristic* or *scale*). This gives us a fairly large dynamic range. The generic form is as follows:

Where:

- n = the number being represented
- ± = the sign of the number
- x = the mantissa of the number
- b = the number system base (10 in decimal; 2 in binary)
- y = the exponent (power) of the number (which can itself be positive, or negative)

Easy, right? Well, maybe not so -- there are some tricks involved, as well as a variety of benefits and drawbacks. So, how do we represent floating-point in our device? Well, there's plenty of different ways to do this, there's your way, there's my way, and there's some other guy's way.

For example, the exponent is usually an integer. We could extend this by allowing the exponent to have a fractional representation if we really wanted. In general, though, I don't know why we'd want to do that, as the result would just be another fractional number that we could easily represent (unless the exponent and the mantissa were both negative, in which case we'd have a complex number, and there are easier ways to represent those).

Author

tom-ii 1/8/2014 8:43:43 AM

Sorry to disappoint you, but 101.011 = 5.375, not 5.15.D'oh! Would you believe I'm no good at math?

Sic transit gloria mundiAnd if this is how the world progresses, then we're in trouble... Oh, wait...

Author

tom-ii 1/8/2014 8:37:33 AM

I believe the main reason for S/M representation is that it simplifies multiplication and division. 2's complement multiplication is a pain and requires more logic (IIRC) and division is hard enough without dealing with signs. With S/M, you just do unsigned multiplication and division and then XOR the sign bits.Shhh! Spoilers!

Author

azassoko 1/8/2014 8:11:23 AM

Sorry to disappoint you, but 101.011 = 5.375, not 5.15.

Sic transit gloria mundi...

AZ

Author

cpetras 1/7/2014 5:04:49 PM

Author

betajet 1/7/2014 4:57:00 PM

Author

Max The Magnificent 1/7/2014 4:52:30 PM

If you try comparing floating-point numbers using fixed-point compare instructions you deserve what you get :-)LOL

Author

betajet 1/7/2014 4:47:25 PM

2's complement arithmetic is good if you're always adding signed numbers, but if your logic needs to both add and subtract, then you've got to complement one of the operands anyway. I believe floating-point hardware does S/M add/sub using one's complement arithmetic -- I remember studying the IBM 360/91 paper that shows this in detail. The fun part of one's complement arithmetic is the end-around carry. To do this fast, you need

carry look-ahead end-around carrylogic, which turns out to be highly regular and beautiful.Regarding +0 and -0: I should think the floating-point compare instructions take care of this. If you try comparing floating-point numbers using fixed-point compare instructions you deserve what you get :-)

Author

Max The Magnificent 1/7/2014 4:41:04 PM

Let's start with the fact that whatever format we decide to use to represent numbers inside a computer, they all end up being stored as a sequence of 1s and 0s.

The thing is that, given a field of a fixed size, what sort of numbers can you store in it? Let's take the arduino, because that's what I'm playing with at the moment. Consider a variable of type "long" -- this consumes 32 bits (4 bytes) and can be used to store integers in the range -2,147,483,648 to 2,147,483,647. These are big numbers and they are stored with absolute precision (so long as we only want integers), but what if we want to represent some value outside this range?

This is where floating-point comes in. A floating point value in an Arduino also consumes 32 buts (4 bytes), but using the format you discuss (sign, mantissa, exponent), it can be used to represent values as large as 3.4028235E+38 and as small as -3.4028235E+38. This gives a humongous dynamic range, but at the loss of precision (these values have only 6-7 decimal digits of precision).

Author

tom-ii 1/7/2014 4:33:16 PM

Author

Max The Magnificent 1/7/2014 4:30:56 PM

One alternative, as you say, is to "roll your own". I've often toyed with the idea of creating my own 16-bit (Half Precision) library (providing reasonable dynamic range with limited precision, which would be applicable to certain applications) with stripped-down functionality so as to reduce the memory footprint ... but there never seems to be enough time ... I wonder if any of the other readers have done this?