# Doing Math in FPGAs, Part 4 (Fixed-Point)

In my previous blog, I rambled on about floating-point representations. Today's column will be the last of these rantings about things you probably already know (at least, related to the representation of real numbers). In my silly little project, I looked into BCD, floating-point, and fixed-point. What I eventually settled on was fixed-point.

Like the other implementations we've discussed, there are plenty of ways to roll your own, but I settled on what I believe is generally called the Q,N Format. Or maybe I did roll my own. Regardless, I'll talk about the version I am using in my little project, and we'll go with that. It seems to me that there are as many ways to implement this as there are people trying to implement it, so be forewarned. This is *my* implementation (and I'm a certified professional idiot).

I will start by giving some credit to Sam Skalicky and his OpenCores.Org project, though I have modified his material extensively. I will also point you to this paper. In addition to glossing over some things, I'm also a know-nothing bozo (in case you haven't already figured that out).

I think that, right this second, we'd all be better served if I rehashed some of the things I said in my previous blog. This time, I'll try to relate things to fixed-point math and how I am applying it here. Sorry, this may go long, so skip down about three paragraphs if you already know about fixed-point and your eyes start to glaze over.

Let's start by looking at how we represent numbers in our base-10 system, which originated because most of us have 10 fingers.

This shouldn't be anything new to you. We all learned this in elementary school. It's elementary, Watson. (Sorry, I couldn't resist.) If I think of this as a computer register with 10 possible states (0-9), then the above could represent a register. If the maximum number of digits we have on either side of the decimal point is three, then the biggest number I can represent is ±999.999, assuming I keep that far left bucket as a sign bit.

The same is true of a base-2 binary system, only the powers are of two, rather than 10.

This is just like with the base-10 system, only now we're dealing with powers of 2. So, this is how we translate a binary number into our decimal system. If we have the same limits imposed as before -- three digits on either side of the binimal point -- then we can represent the range of ±7.65, provided I did my math correctly. And this is the basis for how we perform fixed-point math, except for the fact that we get to choose where the binimal point is; never mind about two's complement format, etc.

Practically speaking, we want to keep the maximum number of fractional bits (to get better resolution) without sacrificing errors on our integer bits. We want as much dynamic range as we need, but no more. Fractional errors add up over time, but integer errors skew the results immediately.

First, let's talk about addition. Adding fixed-point binary numbers is easy. We just perform a normal add (with a *caveat*) as if it were all an integer number. If one of the numbers is negative, we take the two's complement (invert and add 1) to get the negative representation. Do you remember ten's complement from the last installment? Yep, any base has its own complement series, and they all work basically the same way.

Our *caveat* is that the number of bits required to add two numbers, the biggest of which has N bits, is N+1. This means that, in order to add a four-bit number to another number of four or fewer bits, I will need a five-bit field in which to store the result. Here's an example.

This is not a formal proof, mind you, but it does demonstrate that we may need to be careful later. As long as we're sure we don't overflow, this is awesome. Our binimal point is in the same location as when we started. The overflow is basically into our sign bit.