
It would slightly simplify your equations for determining the required number of bits to simply use 'log'for the quotient of two logs, the base does not matter. (I also thought the base was presented as a subscript rather than a superscript, but that may be a cultural difference.)
It seems that a high result multiplier would be more desirable than a full precision (doubled precision result) or low result multiplier. I.e., one tends to care about the most significant bits. (Since a normalized FP multiply uses the high result, I would guess that FPGAs support such.)
Also with multiplication (and division) shifts can be done before the operation or after the operation as long as the multiplier will not lose necessary precision. (When one operand is a constant, this can allow one to avoid shifting.)
Sorry, some of that was already mentionedI should have read more carefully!
I am guessing that your polynomial constant A is 131.29 (as in the original formula and in the scaled version in the table) not 133.29 (as in the table and the scaling factor calculation).
(By the way, using a superscript for squaring would seem to be clearer than using x2. I am guessing that this was a conversion to html issue.)
It's 11:30pm in the UK as I pen these words  I'm sure Adam will respond in the morning  Max
re: The basics of FPGA mathematics
AdamTaylor
8/8/2012 6:50:56 AM
Sorry gents you are correct it should be 131.29 my appologies
re: The basics of FPGA mathematics
AdamTaylor
8/8/2012 6:53:11 AM
No problem, it is hard to determinine what to include in these articles. Regarding the Log10 they will of course work in any base but most calculators have base log10 and ln hence my use of the log10. but any base will do as you say. Thanks for your comments
It is good to see these issues being discussed in this forum  thank you.
A couple of issues for clarity:
1. You don't need more bits to represent a larger number in fixed point, as there is no reason to require the unit digit to be part of the bit representation, e.g. we could have "01" representing 1x2^8 if we like. Equally it could represent 1x2^{8}. So long as we know the scaling, it doesn't matter  in effect  whether the binary point is inside or outside the number. Thus the number of bits defines the dynamic range, but not the range of representable numbers. This is not quite captured by the notion of "integer bits" used here.
2. To add and subtract, you must align scalings of fixed point arguments, as you say. But you don't need to do this for division.
Readers may be interested in the latest developments on this and related issues in IEEE Design & Test magazine: "Numerical Data Representations for FPGAbased Scientific Computing", G.A. Constantinides, N. Nicolici, A.B. Kinsman, IEEE Design and Test 28(4).
re: The basics of FPGA mathematics
AdamTaylor
8/8/2012 5:33:23 PM
George
Thanks for the kind comments.
With respect to point one I did cover storing different scaling factors in a vector as opposed to the actual width. The key becomes can you accurately represent the number in the vector width available i.e. dynamic range as you correctly point out.
I made the point about aligning the numbers for division as while it is possible to divide none aligned numbers. the scaling of the result will be the difference between the two and you have to be careful not to send them negative. As this is a basic how to article I did not want to introduce to many concepts. I will address this in my blog over at programmable planet however as it is an important concept.
Thanks again for taking the time to read it I do appreciate it ;)
re: The basics of FPGA mathematics
larsen
8/9/2012 8:20:51 PM
Thanks for a good article. Your warning, however, about overflow producing an incorrect result is not concern in all cases and has important practical implications. I think it is less known and quite astonishing: It goes..
"You can add any quantity of fixed point signed numbers (say W bits wide), in _any_ order and ignore overflow  PROVIDED that the final result is within the range of the accumulator. The result will always be correct!"
Eksample: W=3 bits and for simplicity of example  no fractional bits, so numbers can be [4,3,2,10,1,2,3].
We all agree on the following example calculation using decimal numbers:
43+2+3=2
No do the summing from left to right using only 3 bits in the accumulator (Bxxx is in binary:
43=7 (B100+B101=B1001 overflow!  remove the excess bits) Result=B001=1,
1+2=3 (B001+B010=B011)
3+3=6 (B011+B011=B110) but B110 is the same as 2
I.e. the result we were looking for.
All this is due to the modulus arithmetic in operation.
Be cautious though. This does NOT work if you  in a mistaken attempt to be cautious and careful to catch errors  implement the adder with saturation! The result will be totally wrong. So in an FIR (Finite Impulse Response filter) for instance where such a long sum is breadandbutter, one should _not_ use a saturating adder but simply truncate the overflowing bits.
By the way in this example you could do with just 2 bits in the accumulator (and each number for that matter) because the result 2 can be represented by a 2 bits. Only the result determines the size required by the accumulator excess bits can be ignored.
Henning E. Larsen
Excellent point about not using saturation arithmetic in FIR filters, and just allowing modulo arithmetic to do it's thing.




7/26/2017 4:23:45 AM
7/26/2017 4:23:45 AM
7/26/2017 3:48:34 AM
7/26/2017 3:21:02 AM
7/26/2017 2:48:56 AM
7/26/2017 2:30:57 AM
7/26/2017 2:30:37 AM
7/26/2017 2:02:04 AM
7/25/2017 7:00:01 PM
7/25/2017 4:51:40 PM

