If you do financial calculations for instance, then better design for worst case, but when it comes to FIR filter for signal processing, designing for the absolute worst case with respect to maximum amplitude at the input is not always necessary. To illustrate: Take a narrow band FIR filter (f0). The maximum output ampltude will be with a sinewave at the centre frequency f0. But if you know that such an input signal will never be present in practice, then you can relax on the maximum supported ampltude and discard some MSB as illustrated in my small numeric example with modulus arithmetic. A realistic example of such an input could be if the sinewave at f0 is always accompanied by other signals - noise for example, then you know that the input sinewave can never fill the whole input amplitude, because if it did, there would be clipping or overrun already at the input. The out-of-band (f0) signals would be filtered away in the FIR filter and give little contribution to the output - only the sinewave at f0 gets through. I.e. the net result is that you can spare some of the MSB's without damaging your signal. But again, using saturation type of adders in the intermediate results would result in distortion at a lower input level than without saturation.
To determine if the filter is properly designed you would need to simulate or at least be prepared to do some iterations of test and redesign.

Whether writing code or designing hardware one should always ensure it works in the worse case. With code you need to ensure you hit all of the corner cases. You can model it in excel, matlab or mathcad etc to ensure the worse case is captured.

Thanks for a good article. Your warning, however, about overflow producing an incorrect result is not concern in all cases and has important practical implications. I think it is less known and quite astonishing: It goes..
"You can add any quantity of fixed point signed numbers (say W bits wide), in _any_ order and ignore overflow - PROVIDED that the final result is within the range of the accumulator. The result will always be correct!"
Eksample: W=3 bits and for simplicity of example - no fractional bits, so numbers can be [-4,-3,-2,-1-0,1,2,3].
We all agree on the following example calculation using decimal numbers:
-4-3+2+3=-2
No do the summing from left to right using only 3 bits in the accumulator (Bxxx is in binary:
-4-3=-7 (B100+B101=B1001 overflow! - remove the excess bits) Result=B001=1,
1+2=3 (B001+B010=B011)
3+3=6 (B011+B011=B110) but B110 is the same as -2
I.e. the result we were looking for.
All this is due to the modulus arithmetic in operation.
Be cautious though. This does NOT work if you - in a mistaken attempt to be cautious and careful to catch errors - implement the adder with saturation! The result will be totally wrong. So in an FIR (Finite Impulse Response filter) for instance where such a long sum is bread-and-butter, one should _not_ use a saturating adder but simply truncate the overflowing bits.
By the way in this example you could do with just 2 bits in the accumulator (and each number for that matter) because the result -2 can be represented by a 2 bits. Only the result determines the size required by the accumulator excess bits can be ignored.
Henning E. Larsen

George
Thanks for the kind comments.
With respect to point one I did cover storing different scaling factors in a vector as opposed to the actual width. The key becomes can you accurately represent the number in the vector width available i.e. dynamic range as you correctly point out.
I made the point about aligning the numbers for division as while it is possible to divide none aligned numbers. the scaling of the result will be the difference between the two and you have to be careful not to send them negative. As this is a basic how to article I did not want to introduce to many concepts. I will address this in my blog over at programmable planet however as it is an important concept.
Thanks again for taking the time to read it I do appreciate it ;)

It is good to see these issues being discussed in this forum - thank you.
A couple of issues for clarity:
1. You don't need more bits to represent a larger number in fixed point, as there is no reason to require the unit digit to be part of the bit representation, e.g. we could have "01" representing 1x2^8 if we like. Equally it could represent 1x2^{-8}. So long as we know the scaling, it doesn't matter - in effect - whether the binary point is inside or outside the number. Thus the number of bits defines the dynamic range, but not the range of representable numbers. This is not quite captured by the notion of "integer bits" used here.
2. To add and subtract, you must align scalings of fixed point arguments, as you say. But you don't need to do this for division.
Readers may be interested in the latest developments on this and related issues in IEEE Design & Test magazine: "Numerical Data Representations for FPGA-based Scientific Computing", G.A. Constantinides, N. Nicolici, A.B. Kinsman, IEEE Design and Test 28(4).

No problem, it is hard to determinine what to include in these articles. Regarding the Log10 they will of course work in any base but most calculators have base log10 and ln hence my use of the log10. but any base will do as you say. Thanks for your comments

As we unveil EE Times’ 2015 Silicon 60 list, journalist & Silicon 60 researcher Peter Clarke hosts a conversation on startups in the electronics industry. Panelists Dan Armbrust (investment firm Silicon Catalyst), Andrew Kau (venture capital firm Walden International), and Stan Boland (successful serial entrepreneur, former CEO of Neul, Icera) join in the live debate.

To save this item to your list of favorite EE Times content so you can find it later in your Profile page, click the "Save It" button next to the item.

If you found this interesting or useful, please use the links to the services below to share it with other readers. You will need a free account with each service to share an item via that service.