It would slightly simplify your equations for determining the required number of bits to simply use 'log'--for the quotient of two logs, the base does not matter. (I also thought the base was presented as a subscript rather than a superscript, but that may be a cultural difference.)
It seems that a high result multiplier would be more desirable than a full precision (doubled precision result) or low result multiplier. I.e., one tends to care about the most significant bits. (Since a normalized FP multiply uses the high result, I would guess that FPGAs support such.)
Also with multiplication (and division) shifts can be done before the operation or after the operation as long as the multiplier will not lose necessary precision. (When one operand is a constant, this can allow one to avoid shifting.)

I am guessing that your polynomial constant A is 131.29 (as in the original formula and in the scaled version in the table) not 133.29 (as in the table and the scaling factor calculation).
(By the way, using a superscript for squaring would seem to be clearer than using x2. I am guessing that this was a conversion to html issue.)

No problem, it is hard to determinine what to include in these articles. Regarding the Log10 they will of course work in any base but most calculators have base log10 and ln hence my use of the log10. but any base will do as you say. Thanks for your comments

It is good to see these issues being discussed in this forum - thank you.
A couple of issues for clarity:
1. You don't need more bits to represent a larger number in fixed point, as there is no reason to require the unit digit to be part of the bit representation, e.g. we could have "01" representing 1x2^8 if we like. Equally it could represent 1x2^{-8}. So long as we know the scaling, it doesn't matter - in effect - whether the binary point is inside or outside the number. Thus the number of bits defines the dynamic range, but not the range of representable numbers. This is not quite captured by the notion of "integer bits" used here.
2. To add and subtract, you must align scalings of fixed point arguments, as you say. But you don't need to do this for division.
Readers may be interested in the latest developments on this and related issues in IEEE Design & Test magazine: "Numerical Data Representations for FPGA-based Scientific Computing", G.A. Constantinides, N. Nicolici, A.B. Kinsman, IEEE Design and Test 28(4).

George
Thanks for the kind comments.
With respect to point one I did cover storing different scaling factors in a vector as opposed to the actual width. The key becomes can you accurately represent the number in the vector width available i.e. dynamic range as you correctly point out.
I made the point about aligning the numbers for division as while it is possible to divide none aligned numbers. the scaling of the result will be the difference between the two and you have to be careful not to send them negative. As this is a basic how to article I did not want to introduce to many concepts. I will address this in my blog over at programmable planet however as it is an important concept.
Thanks again for taking the time to read it I do appreciate it ;)

Thanks for a good article. Your warning, however, about overflow producing an incorrect result is not concern in all cases and has important practical implications. I think it is less known and quite astonishing: It goes..
"You can add any quantity of fixed point signed numbers (say W bits wide), in _any_ order and ignore overflow - PROVIDED that the final result is within the range of the accumulator. The result will always be correct!"
Eksample: W=3 bits and for simplicity of example - no fractional bits, so numbers can be [-4,-3,-2,-1-0,1,2,3].
We all agree on the following example calculation using decimal numbers:
-4-3+2+3=-2
No do the summing from left to right using only 3 bits in the accumulator (Bxxx is in binary:
-4-3=-7 (B100+B101=B1001 overflow! - remove the excess bits) Result=B001=1,
1+2=3 (B001+B010=B011)
3+3=6 (B011+B011=B110) but B110 is the same as -2
I.e. the result we were looking for.
All this is due to the modulus arithmetic in operation.
Be cautious though. This does NOT work if you - in a mistaken attempt to be cautious and careful to catch errors - implement the adder with saturation! The result will be totally wrong. So in an FIR (Finite Impulse Response filter) for instance where such a long sum is bread-and-butter, one should _not_ use a saturating adder but simply truncate the overflowing bits.
By the way in this example you could do with just 2 bits in the accumulator (and each number for that matter) because the result -2 can be represented by a 2 bits. Only the result determines the size required by the accumulator excess bits can be ignored.
Henning E. Larsen

To save this item to your list of favorite EE Times content so you can find it later in your Profile page, click the "Save It" button next to the item.

If you found this interesting or useful, please use the links to the services below to share it with other readers. You will need a free account with each service to share an item via that service.