datasheets.com EBN.com EDN.com EETimes.com Embedded.com PlanetAnalog.com TechOnline.com
Events
UBM Tech
UBM Tech

Design Article

#### Comment

larsen

8/13/2012 5:56 PM EDT

If you do financial calculations for instance, then better design for worst ...

More...

Ad_Ceng

8/13/2012 3:14 PM EDT

Whether writing code or designing hardware one should always ensure it works in ...

More...

# The basics of FPGA mathematics

## 8/7/2012 1:35 PM EDT

Editor's Note: This article first appeared in Xcell Journal Issue 80 Third Quarter 2012, and is reproduced here with the kind permission of Xcell's publisher, Mike Santarini.

One of the many benefits of an FPGA-based solution is the ability to implement a mathematical algorithm in the best possible manner for the problem at hand. For example, if response time is critical, then we can pipeline the stages of mathematics. But if accuracy of the result is more important, we can use more bits to ensure we achieve the desired precision. Of course, many modern FPGAs also provide the benefit of embedded multipliers and DSP slices, which can be used to obtain the optimal implementation in the target device.

Let's take a look at the rules and techniques that you can use to develop mathematical functions within an FPGA or other programmable device.

Representation of numbers
There are two methods of representing numbers within a design, fixed- or floating-point number systems. Fixed-point representation maintains the decimal point within a fixed position, allowing for straightforward arithmetic operations. The major drawback of the fixed-point system is that to represent larger numbers or to achieve a more accurate result with fractional numbers, you will need to use a larger number of bits. A fixed-point number consists of two parts, integer and fractional.

Floating-point representation allows the decimal point to "float" to different places within the number, depending upon the magnitude. Floating-point numbers, too, are divided into two parts, the exponent and the mantissa. This scheme is very similar to scientific notation, which represents a number as A times 10 to the power of B, where A is the mantissa and B is the exponent. However, the base of the exponent in a floating-point number is base 2, that is, A times 2 to the power of B. The floating-point number is standardized by IEEE/ANSI standard 754-1985. The basic IEEE floating-point number utilizes an 8-bit exponent and a 24-bit mantissa.

Due to the complexity of floating-point numbers, we as designers tend wherever possible to use fixed-point representations. The above fixed-point number is capable of representing an unsigned number between 0.0 and 255.9906375 or a signed number between –128.9906375 and 127.9906375 using two's complement representation. Within a design you have the choice – typically constrained by the algorithm you are implementing – to use either unsigned or signed numbers. Unsigned numbers are capable of representing a range of 0 to 2n – 1, and always represent positive numbers. By contrast, the range of a signed number depends upon the encoding scheme used: sign and magnitude, one's complement, or two's complement.

The sign-and-magnitude scheme utilizes the left-most bit to represent the sign of the number (0 = positive, 1 = negative). The remainder of the bits represent the magnitude. Therefore, in this system, positive and negative numbers have the same magnitude but the sign bit differs. As a result, it is possible to have both a positive and a negative zero within the sign-and-magnitude system.

One's complement uses the same unsigned representation for positive numbers as sign and magnitude. However, for negative numbers it uses the inversion (one's complement) of the positive number.

Two's complement is the most widely used encoding scheme for representing signed numbers. Here, as in the other two schemes, positive numbers are represented in the same manner as unsigned numbers, while negative numbers are represented as the binary number you add to a positive number of the same magnitude to get zero. You calculate a negative two's complement number by first taking the one's complement (inversion) of the positive number and then adding one to it. The two's complement number system allows you to subtract one number from another by performing an addition of the two numbers. The range a two's complement number can represent is given by:

– (2n-1) to + (2n-1 – 1)

One way to convert a number to its two's complement format is to work right to left, leaving the number the same until you encounter the first "1." After this point, each bit is inverted.

Fixed-point mathematics
The normal way of representing the split between integer and fractional bits within a fixed-point number is x,y where x represents the number of integer bits and y the number of fractional bits. For example, 8,8 represents 8 integer bits and 8 fractional bits, while 16,0 represents 16 integer and 0 fractional. In many cases, you will determine the correct number of integer and fractional bits required at design time, normally following conversion from a floating-point algorithm. Thanks to the flexibility of FPGAs, we can represent a fixed-point number of any bit length; the number of integer bits required depends upon the maximum integer value the number is required to store, while the number of fractional bits will depend upon the accuracy of the final result. To determine the number of integer bits required, use the following equation:

For example, the number of integer bits required to represent a value between 0.0 and 423.0 is given by:

That means you would need 9 integer bits, allowing a range of 0 to 511 to be represented. Representing the number using 16 bits would allow for 7 fractional bits. The accuracy this representation would be capable of providing is given by:

You can increase the accuracy of a fixed-point number by using more bits to store the fractional number. When designing, there are times when you may wish to store only fractional numbers (0,16), depending upon the size of the number you wish to scale up. Scaling up by 216 may yield a number that still does not provide an accurate enough result. In this case you can multiply up by the power of 2, such that the number can be represented within a 16-bit number. You can then remove this scaling at a further stage within the implementation. For example, to represent the number 1.45309806319x10-4 in a 16-bit number, the first step is to multiply it by 216.

65536 • 1.45309806319x10-4 = 9.523023

Storing the integer of the result (9) will result in the number being stored as 1.37329101563x10-4 (9 / 65536). This difference between the number required to be stored and the stored number is substantial and could lead to an unacceptable error in the calculated result. You can obtain a more accurate result by scaling the number up by a factor of 2. The result will be between 32768 and 65535, therefore still allowing storage in a 16-bit number. Using the earlier example of storing 1.45309806319x10-4, multiplying by a factor of 228 will yield a number that can be stored in 16 bits and will be highly accurate of the desired number.

268435456 • 1.45309806319x10-4 = 39006.3041205

The integer of the result will give you a stored number of 1.45308673382x10-4, which will provide for a much more accurate calculation, assuming you can address the scaling factor of 228 at a later stage within the calculation. For example, multiplying the scaled number with a 16-bit number scaled 4,12 will produce a result of 4,40 (28 + 12). The result, however, will be stored in a 32-bit result.

Fixed-point rules
To add, subtract or divide, the decimal points of both numbers must be aligned. That is, you can only add to, subtract from or divide an x,8 number by a number that is also in an x,8 representation. To perform arithmetic operations on numbers of a different x,y format, you must first ensure the decimal points are aligned. To align a number to a different format, you have two choices: either multiply the number with more integer bits by 2X or divide the number with the fewest integer bits by 2X. Division, however, will reduce your accuracy and may lead to a result that is outside the allowable tolerance. Since all numbers are stored in base-two scaling, you can easily scale the number up or down in an FPGA by shifting one place to the left or right for each power of 2 required to balance the two decimal points. To add together two numbers that are scaled 8,8 and 9,7, you can either scale up the 9,7 number by a factor of 21 or scale the 8,8 format down to a 9,7 format, if the loss of a least-significant bit is acceptable.

For example, say you want to add 234.58 and 312.732, which are stored in an 8,8 and a 9,7 format respectively. The first step is to determine the actual 16-bit numbers that will be added together.

234.58 • 28 = 60052.48
312.732 • 27 = 40029.69

The two numbers to be added are 60052 and 40029. However, before you can add them you must align the decimal points. To align the decimal points by scaling up the number with a largest number of integer bits, you must scale up the 9,7-format number by a factor of 21.

40029 • 21 = 80058

You can then calculate the result by performing an addition of:

80058 + 60052 = 140110

This represents 547.3046875 in a 10,8 format (140110 / 28).

When multiplying two numbers together, you do not need to align the decimal points, as the multiplication will provide a result that is X1 + X2, Y1 + Y2 wide. Multiplying two numbers that are formatted 14,2 and 10,6 will produce a result that is formatted as 24 integer bits and 8 fractional bits.

You can multiply by a fractional number instead of using division within an equation through multiplying by the reciprocal of the divisor. This approach can reduce the complexity of your design significantly. For example, to divide the number 312.732, represented in 9,7 (40029) format, by 15, the first stage is to calculate the reciprocal of the divisor.

This reciprocal must then be scaled up, to be represented within a 16-bit number.

65536 • 0.06666 = 4369

This step will produce a result that is formatted 9,23 when the two numbers are multiplied together.

4369 • 40029 = 174886701

The result of this multiplication is thus:

While the expected result is 20.8488, if the result is not accurate enough, then you can scale up the reciprocal by a larger factor to produce a more accurate result. Therefore, never divide by a number when you can multiply by the reciprocal.

Issues of overflow
When implementing algorithms, the result must not be larger than what is capable of being stored within the result register. Otherwise a condition known as overflow occurs. When that happens, the stored result will be incorrect and the most significant bits are lost. A very simple example of overflow would be if you added two 16-bit numbers, each with a value of 65535, and the result was stored within a 16-bit register.

65535 + 65535 = 131070

The above calculation would result in the 16-bit result register containing a value of 65534, which is incorrect. The simplest way to prevent overflow is to determine the maximum value that will result from the mathematical operation and use this equation to determine the size of the result register required.

If you were developing an averager to calculate the average of up to fifty 16-bit inputs, the size of the required result register could be calculated.

50 • 65535 = 3276750

Using this same equation, this would require a 22-bit result register to prevent overflow occurring. You must also take care, when working with signed numbers, to ensure there is no overflow when using negative numbers. Using the averager example again, taking 10 averages of a signed 16-bit number returns a 16-bit result.

10 • -32768 = -327680

Since it is easier to multiply the result by a scaled reciprocal of the divisor, you can multiply this number by: 1/10 • 65536 = 6554 to determine the average.

-32768 • 6554 = -2147614720

This number when divided by 216 equals -32770, which cannot be represented correctly within a 16-bit output. The module design must therefore take the overflow into account and detect it to ensure you don't output an incorrect result.

Real-world implementation
Let's say that you are designing a module to implement a transfer function that is used to convert atmospheric pressure, measured in millibars, into altitude, measured in meters.

-0.0088x2 + 1.7673x + 131.29

The input value will range between 0 and 10 millibars, with a resolution of 0.1 millibar. The output of the module is required to be accurate to +/-0.01 meters. As the module specification does not determine the input scaling, you can figure it out by the following equation.

Therefore, to ensure maximum accuracy you should format the input data as 4 integer and 12 fractional bits. The next step in the development of the module is to use a spreadsheet to calculate the expected result of the transfer function across the entire input range using the unscaled values. If the input range is too large to reasonably achieve this, then calculate an acceptable number of points. For this example, you can use 100 entries to determine the expected result across the entire input range.

Once you have calculated the initial unscaled expected values, the next step is to determine the correct scaling factors for the constants and calculate the expected outputs using the scaled values. To ensure maximum accuracy, each of the constants used within the equation will be scaled by a different factor.

The scaling factor for the first polynomial constant (A) is given by:

The second polynomial constant (B) scaling factor is given by:

The final polynomial constant (C) can be scaled up by a factor of 216, as it is completely fractional.

These scaling factors allow you to calculate the scaled spreadsheet, as shown in Table 1. The results of each stage of the calculation will produce a result that will require more than 16 bits.

Table 1. Real results against the fixed-point mathematics

The calculation of the Cx2 will produce a result that is 32 bits long formatted 4,12 + 4,12 = 8,24. This is then multiplied by the constant C, producing a result that will be 48 bits long formatted 8,24 + 0,16 = 8,40. For the accuracy required in this example, 40 bits of fractional representation is excessive. Therefore, the result will be divided by 232 to produce a result with a bit length of 16 bits formatted 8,8. The same reduction to 16 bits is carried out upon the calculation of Bx to produce a result formatted 5,11.

The result is the addition of columns Cx2, Bx and A. However, to obtain the correct result you must first align the radix points, either by shifting up A and Cx2 to align the numbers in an x,11 format, or shifting down the calculated Bx to a format of 8,8, aligning the radix points with the calculated values of A and Cx2.

In this example, we shifted down the calculated value by 23 to align the radix points in an 8,8 format. This approach simplified the number of shifts required, thus reducing the logic needed to implement the example. Note that if you cannot achieve the required accuracy by shifting down to align the radix points, then you must align the radix points by shifting up the calculated values of A and Cx2. In this example, the calculated result is scaled up by a power of 28. You can then scale down the result and compare it against the result obtained with unscaled values. The difference between the calculated result and the expected result is then the accuracy, using the spreadsheet commands of MAX() and MIN(), for the maximum and minimum error of the calculated result that can be obtained across the entire range of spreadsheet entries.

Once the calculated spreadsheet confirms that you can achieve the required accuracy, you can write and simulate the RTL code. If desired, you could design the testbench such that the input values are the same as those used in the spreadsheet. This allows you to compare the simulation outputs against the spreadsheet-calculated results to ensure the correct RTL implementation.

RTL implementation
The RTL example uses signed parallel mathematics to calculate the result within four clock cycles. Because of the signed parallel multiplication, you must take care to correctly handle the extra sign bits generated by the multiplications.

ENTITY transfer_function IS PORT(
sys_clk : IN std_logic;
reset : IN std_logic;
data : IN std_logic_vector(15 DOWNTO 0);
new_data : IN std_logic;
result : OUT std_logic_vector(15 DOWNTO 0);
new_res : OUT std_logic);
END ENTITY transfer_function;

ARCHITECTURE rtl OF transfer_function IS
-- this module performs the following transfer function -0.0088x2 + 1.7673x + 131.29
-- input data is scaled 8,8, while the output data will be scaled 8,8.
-- this module utilizes signed parallel mathematics

TYPE control_state IS (idle, multiply, add, result_op);
CONSTANT c : signed(16 DOWNTO 0) := to_signed(-577,17);
CONSTANT b : signed(16 DOWNTO 0) := to_signed(57910,17);
CONSTANT a : signed(16 DOWNTO 0) := to_signed(33610,17);
SIGNAL current_state : control_state;
SIGNAL buf_data : std_logic; --used to detect rising edge upon the new_data
SIGNAL squared : signed(33 DOWNTO 0); -- register holds input squared.
SIGNAL cx2 : signed(50 DOWNTO 0); --register used to hold Cx2
SIGNAL bx : signed(33 DOWNTO 0); -- register used to hold bx
SIGNAL res_int : signed(16 DOWNTO 0); --register holding the temporary result

BEGIN
fsm : PROCESS(reset, sys_clk)
BEGIN
IF reset = '1' THEN
buf_data <= '0';
squared <= (OTHERS => '0');
cx2 <= (OTHERS => '0');
bx <= (OTHERS => '0');
result <= (OTHERS => '0');
res_int <= (OTHERS => '0');
new_res <= '0';
current_state <= idle;
ELSIF rising_edge(sys_clk) THEN
buf_data <= new_data;
CASE current_state IS
WHEN idle =>
new_res <= '0';
IF (new_data = '1') AND (buf_data = '0')
THEN --detect rising edge new data
squared <= signed( '0'& data) * signed('0'& data);
current_state <= multiply;
ELSE
squared <= (OTHERS =>'0');
current_state <= idle;
END IF;
WHEN multiply =>
new_res <= '0';
cx2 <= (squared * c);
bx <= (signed('0'& data)* b);
current_state <= add;
WHEN add =>
new_res <= '0';
res_int <= a + cx2(48 DOWNTO 32) +
("000"& bx(32 DOWNTO 19));
current_state <= result_op;
WHEN result_op =>
result <= std_logic_vector(res_int (res_int'high -1 DOWNTO 0));
new_res <= '0';
current_state <= idle;
END CASE;
END IF;
END PROCESS;
END ARCHITECTURE rtl;

The architecture of FPGAs makes them ideal for implementing mathematical functions, although the implementation of your algorithm may take a little more initial thought and modeling in system-level tools such as MATLAB or Excel. You can quickly implement mathematical algorithms once you have mastered some of the basics of FPGA mathematics.

About the author
Adam Taylor is Principal Engineer at EADS Astrium. Adam can be contacted at aptaylor@theiet.org

If you found this article to be of interest, visit Programmable Logic Designline where – in addition to my Max's Cool Beans blogs – you will find the latest and greatest design, technology, product, and news articles with regard to programmable logic devices of every flavor and size (FPGAs, CPLDs, CSSPs, PSoCs...).

Also, you can obtain a highlights update delivered directly to your inbox by signing up for my weekly newsletter – just Click Here to request this newsletter using the Manage Newsletters tab (if you aren't already a member you'll be asked to register, but it's free and painless so don't let that stop you [grin]).

Paul A. Clayton

8/7/2012 5:58 PM EDT

It would slightly simplify your equations for determining the required number of bits to simply use 'log'--for the quotient of two logs, the base does not matter. (I also thought the base was presented as a subscript rather than a superscript, but that may be a cultural difference.)

It seems that a high result multiplier would be more desirable than a full precision (doubled precision result) or low result multiplier. I.e., one tends to care about the most significant bits. (Since a normalized FP multiply uses the high result, I would guess that FPGAs support such.)

Also with multiplication (and division) shifts can be done before the operation or after the operation as long as the multiplier will not lose necessary precision. (When one operand is a constant, this can allow one to avoid shifting.)

Sign in to Reply

Paul A. Clayton

8/7/2012 6:07 PM EDT

Sorry, some of that was already mentioned--I should have read more carefully!

Sign in to Reply

Ad_Ceng

8/8/2012 2:53 AM EDT

No problem, it is hard to determinine what to include in these articles. Regarding the Log10 they will of course work in any base but most calculators have base log10 and ln hence my use of the log10. but any base will do as you say. Thanks for your comments

Sign in to Reply

Paul A. Clayton

8/7/2012 6:18 PM EDT

I am guessing that your polynomial constant A is 131.29 (as in the original formula and in the scaled version in the table) not 133.29 (as in the table and the scaling factor calculation).

(By the way, using a superscript for squaring would seem to be clearer than using x2. I am guessing that this was a conversion to html issue.)

Sign in to Reply

Max the Magnificent

8/7/2012 6:24 PM EDT

It's 11:30pm in the UK as I pen these words -- I'm sure Adam will respond in the morning -- Max

Sign in to Reply

Ad_Ceng

8/8/2012 2:50 AM EDT

Sorry gents you are correct it should be 131.29 my appologies

Sign in to Reply

George Constantinides

8/8/2012 5:48 AM EDT

It is good to see these issues being discussed in this forum - thank you.

A couple of issues for clarity:

1. You don't need more bits to represent a larger number in fixed point, as there is no reason to require the unit digit to be part of the bit representation, e.g. we could have "01" representing 1x2^8 if we like. Equally it could represent 1x2^{-8}. So long as we know the scaling, it doesn't matter - in effect - whether the binary point is inside or outside the number. Thus the number of bits defines the dynamic range, but not the range of representable numbers. This is not quite captured by the notion of "integer bits" used here.

2. To add and subtract, you must align scalings of fixed point arguments, as you say. But you don't need to do this for division.

Readers may be interested in the latest developments on this and related issues in IEEE Design & Test magazine: "Numerical Data Representations for FPGA-based Scientific Computing", G.A. Constantinides, N. Nicolici, A.B. Kinsman, IEEE Design and Test 28(4).

Sign in to Reply

Ad_Ceng

8/8/2012 1:33 PM EDT

George
Thanks for the kind comments.

With respect to point one I did cover storing different scaling factors in a vector as opposed to the actual width. The key becomes can you accurately represent the number in the vector width available i.e. dynamic range as you correctly point out.

I made the point about aligning the numbers for division as while it is possible to divide none aligned numbers. the scaling of the result will be the difference between the two and you have to be careful not to send them negative. As this is a basic how to article I did not want to introduce to many concepts. I will address this in my blog over at programmable planet however as it is an important concept.

Thanks again for taking the time to read it I do appreciate it ;)

Sign in to Reply

larsen

8/9/2012 4:20 PM EDT

Thanks for a good article. Your warning, however, about overflow producing an incorrect result is not concern in all cases and has important practical implications. I think it is less known and quite astonishing: It goes..

"You can add any quantity of fixed point signed numbers (say W bits wide), in _any_ order and ignore overflow - PROVIDED that the final result is within the range of the accumulator. The result will always be correct!"

Eksample: W=3 bits and for simplicity of example - no fractional bits, so numbers can be [-4,-3,-2,-1-0,1,2,3].

We all agree on the following example calculation using decimal numbers:
-4-3+2+3=-2
No do the summing from left to right using only 3 bits in the accumulator (Bxxx is in binary:
-4-3=-7 (B100+B101=B1001 overflow! - remove the excess bits) Result=B001=1,
1+2=3 (B001+B010=B011)
3+3=6 (B011+B011=B110) but B110 is the same as -2
I.e. the result we were looking for.

All this is due to the modulus arithmetic in operation.

Be cautious though. This does NOT work if you - in a mistaken attempt to be cautious and careful to catch errors - implement the adder with saturation! The result will be totally wrong. So in an FIR (Finite Impulse Response filter) for instance where such a long sum is bread-and-butter, one should _not_ use a saturating adder but simply truncate the overflowing bits.

By the way in this example you could do with just 2 bits in the accumulator (and each number for that matter) because the result -2 can be represented by a 2 bits. Only the result determines the size required by the accumulator excess bits can be ignored.
Henning E. Larsen

Sign in to Reply

Frank Eory

8/9/2012 6:01 PM EDT

Excellent point about not using saturation arithmetic in FIR filters, and just allowing modulo arithmetic to do it's thing.

Sign in to Reply

I_B_GREEN

8/12/2012 4:48 PM EDT

Yes but when writing code how do you know the outcome? don't you have to assume worst case?

Sign in to Reply

Ad_Ceng

8/13/2012 3:14 PM EDT

Whether writing code or designing hardware one should always ensure it works in the worse case. With code you need to ensure you hit all of the corner cases. You can model it in excel, matlab or mathcad etc to ensure the worse case is captured.

Sign in to Reply

larsen

8/13/2012 5:56 PM EDT

If you do financial calculations for instance, then better design for worst case, but when it comes to FIR filter for signal processing, designing for the absolute worst case with respect to maximum amplitude at the input is not always necessary. To illustrate: Take a narrow band FIR filter (f0). The maximum output ampltude will be with a sinewave at the centre frequency f0. But if you know that such an input signal will never be present in practice, then you can relax on the maximum supported ampltude and discard some MSB as illustrated in my small numeric example with modulus arithmetic. A realistic example of such an input could be if the sinewave at f0 is always accompanied by other signals - noise for example, then you know that the input sinewave can never fill the whole input amplitude, because if it did, there would be clipping or overrun already at the input. The out-of-band (f0) signals would be filtered away in the FIR filter and give little contribution to the output - only the sinewave at f0 gets through. I.e. the net result is that you can spare some of the MSB's without damaging your signal. But again, using saturation type of adders in the intermediate results would result in distortion at a lower input level than without saturation.
To determine if the filter is properly designed you would need to simulate or at least be prepared to do some iterations of test and redesign.

Sign in to Reply

#### Please sign in to post comment

Navigate to related information

## Most Popular

 1 Arduino board plugs DIYers into the cloud for \$69 2 Slideshow: Google I/O puts 7 platforms on parade 3 Qualcomm, Samsung pass AMD in microprocessor sales 4 Intel's new CEO shakes things up 5 ST to deny rivals FDSOI access 6 Slideshow: Maker Faire returns to Silicon Valley 7 Top 10 engineering Kickstarter projects 8 Could Intel enable \$200 Ultrabook? 9 Teardown: Samsung Galaxy S4 10 Taiwan reversing brain drain