Fractional fixed point (often referred to as "Q" format) is efficient -- you choose exactly the precision you need, no less and no more -- and the bookkeeping exercise of keeping track of the radix point is not a big deal.

This is actually what I used in my final implementation - We'll get to that in a few weeks, I suppose.... (Max willing)

These are clever tricks, but why are you stuck on dealing with base 10, when ultimately you're implementing all the operations with shifts, adds & subtracts in base 2?

I didn't intend to imply that I was stuck with base 10. Base 10 is just natural for us humans, and there are plenty of applications that use it a lot, especially when the end result is decimal math. In this specific application, I am sending the computation engine decimal numbers, and expecting them in return. It was worth my while to run down the rabbit hole to see if there was an immediately easy way to get it done this way.

Not only do floating point numbers not obey associativity, they also lack precision. Sure, if 24 bits of precision doesn't meet your needs, you can go to double precision, but both formats are wasteful when your doing arithmetic in hardware.

Fractional fixed point (often referred to as "Q" format) is efficient -- you choose exactly the precision you need, no less and no more -- and the bookkeeping exercise of keeping track of the radix point is not a big deal.

Floating-point numbers are not real numbers. Real numbers obey the associative law of addition. Floating-point numbers do not. Try adding 1 to an accumulator 10^9 times with six-digit floating point. Once the accumulator has reached 10^6, adding more ones doesn't change the accumulator, so the sum of 10^9 ones is 10^6 instead of 10^9. If you add the 1's in groups of 10, and then add those sums in groups of 10, and so on, you'll get the correct value. However, since the result depends on the order of addition, the floating-point numbers violate the associative law. Don't expect floating-point to behave like real numbers without considering these effects.

Non-negative integers, OTOH, do behave mathemically like modulo 2^n numbers so you do get the correct result modulo 2^n.

I agree with the above poster regarding using a decimal radix. Why not use base 2 like IEEE floating point or base 16 like IBM/360?

The way I look at this is as a modified form of Q15 arithmetic. For starters, I can represent a fractional number as a 16 bit fixed point signed integer by the following relationship:

-32768 <--> -.5 32768 <--> .5

Thus .5 is 0x 7FFF (almost). To get .1, I divide by 5 and .1 is 0x199A. This is where the 0x199A factor comes from. When I multiply 2 Q15 numbers, I get a 32 bit result a 32-bit Q31 result. This means .25*.1 is as follows:

.25 => 0x4000 .1 => 0x199A

0x4000 * 0x199A => 0x06668000 0x666800 >> 16 is 0x0666 => which corresponds to .025

These are clever tricks, but why are you stuck on dealing with base 10, when ultimately you're implementing all the operations with shifts, adds & subtracts in base 2?

To save this item to your list of favorite EE Times content so you can find it later in your Profile page, click the "Save It" button next to the item.

If you found this interesting or useful, please use the links to the services below to share it with other readers. You will need a free account with each service to share an item via that service.