FPGA Synthesis of Custom DSP Blocks Using Distributed Arithmetic
Post a comment
ABOUT THE AUTHORS
M.
MartinezPeiró, R. Gadea, R. Colom, F. Ballester, and V.
Herrero are teachers on Digital Applications at the Technical
University of Valencia (Spain). They are currently working on
video, custom DSP, HDL simulators, digital filtering, and image
lifting.


The high density of currently available FPGAs allows designers to develop singlechip FPGAbased DSP blocks. The combination of DSP technology and FPGA implementation is sometimes known as a Custom Digital Signal Processor (CDSP). A CDSP lets you select the precision, inner data width, and word length for a particular application. In this paper, we explore CDSP synthesis using Distributed Arithmetic (DA). Initial studies of DA started thirty years ago. Peled and Liu, Croisier et al. , and Zohar enunciated the theoretical basis of DA and how to use it to design digital filters. Several DA reviews of DA are available. This type of arithmetic takes advantage of technologies rich in memory elements such as RAMbased FPGAs. In these devices, designers can use lookuptables and adders to compute the inner products of a mathematical computation.
This paper describes a methodology to implement FPGAbased FIR filters. The authors use a VHDL description to design a parameterized, fullparallel, FIR filter macromodel and have validated the design by compiling several versions in Altera 10K503 FPGAs. The use of DA for FIR filter design is reviewed and the basic cells used to build the macromodel is described. In addition, the results of the FPGA implementation of the FIR filter are shown.
where A_{t} are the fixed coefficients, T is the number of filtertaps, and x_{t} are the input data words (Equation 2).
where N is the bitnumber of the data, x_{t0} the sign bit, and x_{t,n} is bit "n" of x_{t}.
You can rewrite Equation 1 as:
which suggest the base of the distributed arithmetic. You can precalculate the following terms in Equations 4 and 5. Each x_{t,n} takes the value '0' or '1' and _{n} can take one of the 2^{(N1)} possible values, all of which are possible sums of coefficients A_{t}.
Equations 4 and 5 produce a new expression shown as Equation 6:
You can expand Equation 6 into individual product terms where x_{1,0} is the bit 0 of the first sample, x_{2,0} is the bit 0 of the second sample, and so on.
Each term in brackets in Equation 7 is stored in ROM, RAM, or a LookUp Table (LUT). The 2^{(N1)} different values represented by these terms are addressed by x_{t,n}. Table 1 shows how the term storage is accomplished.
Address  
x_{t,N1}  ...  x_{t,2}  x_{t,1}  x_{t,0}  Content 
0  ...  0  0  0  0 
0  ...  0  0  1  A_{1} 
0  ...  0  1  0  A_{2} 
0  ...  0  1  1  A_{2} + A_{1} 
1  ...  1  1  1  A_{T}... + A_{2} + A_{1} 
Table 1: A ROM, RAM, or LUT stores distributed arithmetic precalculated terms
By accumulating the N values of _{n} with appropriate weighting factors, you can implement digital filters using distributed arithmetic by using registers, memory elements, and a scaling accumulator. Figure 1 represents a serial implementation of a digital FIR filter based on the expression of Equation 6.
Figure 1: A fullserial FIR filter based on distributed arithmetic
The scaling accumulator adds each _{n} between _{1} and _{n1}, and subtracts the _{0} term, which corresponds to the MSB of the input samples. You can also implement his structure in a mixed format (Figure 2), or in a fullparallel fashion (Figure 3).
Figure 2: Mixedformat version of a DAbased FIR filter
Figure 3: Fullparallel version of a DAbased FIR filter
The main advantage of FIR filters is their linearphase response. These filters achieve this response with symmetrical or antisymmetrical structure, which also offer hardware reductions. You can therefore express Equation 7 as:
Equation 8 reduces the number of multiplications to T/2 or (T1)/2 for even or odd filters respectively. If you need an antisymmetrical FIR filter, change the addition in the parenthesis to a subtraction.
For example, for a 16tap FIR filter (T=16) the number of storage elements you need for a direct implementation of _{n} is 2^{8}. If the sum is divided by K=2, needing ^{4}_{n} sums (^{8}_{n}=^{4}_{n}+^{4}_{n}), the storage you need becomes 2^{4}+2^{4} = 2^{5} words.
Number of Taps  Number of _{n}  Number of Taps  Number of _{n} 
7, 8  ^{4}_{n}  17, 18  2^{4}_{n}+^{1}_{n} 
9, 10  ^{4}_{n}+^{1}_{n}  21, 22  2^{4}_{n}+^{3}_{n} 
11, 12  ^{4}_{n}+^{2}_{n}  43, 44  5^{4}_{n}+^{2}_{n} 
15, 16  2^{4}_{n}  63, 64  8^{4}_{n} 
Table 2: Distributed Arithmetic Blocks reduce FPGA resource requirements for implementing filter functions
This algorithm can be easily computed in a recursive manner:
for i in 4 down to 1
(T/2) mod i = n^{umber of}(^{i}_{n}) ;
(T/2)=(T/2)i* n^{umber of}(^{i}_{n}) ;
where mod is the arithmetic operator "module" and n^{umber of}(^{i}_{n} is the number of type ^{i}_{n} blocks.
Figure 4: EightTap LinearPhase FIR filter
Figure 4 shows a structure of an eighttap linearphase FIR filter. You delete the shadowed register and adder for an odd FIR filter. To design filters with a number of taps higher than eight, for example a 12tap FIR filter, the structure comprises two blocks. One block represents an eighttap filter (with only fourinput LUTs) and the other a threetap filter (with only twoinput LUTs).
Figure 5: 12tap FIR filter pipeline implementation
Figure 6:43tap FIR filter
Figure 7:81tap FIR filter
Pipeline and Parameter Description
Pipeline implementation increases the speed of the filter model.
However, FPGA speed is limited by the device's routing delays. The
pipeline has been done by registering each cell, shown by the dark
lines in Figure 5. This option does not increment the size
of functional area, since each logic cell is composed of a LUT and
a register. In the pipeline versions, the latency you get is
5+log_{2}(number of block), where 5 is the latency of the
block.
You can use VHDL to describe each of the above blocks in a parameterized macromodule. With the model, you can vary several FIRfilter parameters including the number of taps, the kind of structure (symmetrical or antisymmetrical) and the use of pipelining. The filter implementation, based on company recommendations, uses Altera's lpm_add_sub function.
Figure 8: Maximum frequency of the DA FIR filters
Figure 8 shows the speed of both pipelined and nonpipelined versions of the filters. Note that the pipelinedcircuit frequencies are practically constant, close to 40MHz, even with increasing number of filter taps. It effect is caused by the path between two pipelined registers, that can reach a high interconnection delay. In nonpipelined filters, the maximum frequency is similar for all versions, around 14MHz, and speed is essentially tapindependent.
Figure 9: Occupation of the DA FIR filters
Figure 9 shows the occupation for both pipelined and nonpipelined filters. The horizontal lines represent the maximum capacity of each chip in the FLEX10K family (from 10K10 to 10K50). The graph shows that you can fit up to a 13tap FIR filter (nonpipelined) in a EFP10K10. Extrapolating the graph show that you can implement up to 45tap FIR filter in an EPF10K50.
We have described a methodology to construct highspeed FIR filters in Altera FPGAs. You can use this methodology to design symmetric or antisymmetric filter structures, and can also include pipelining. Use of distributed arithmetic contributes filters' highspeed. We have also built a parameterized macromodel using synthesizable VHDL. You can use the resulting filters in realtime applications up to 40 MHz. This upperfrequency is essentially independent of the number of filter taps. Results also show that you can implement a 45tap FIR filter in an Altera EPF10K50 FPGA.