, R. Gadea, R. Colom, F. Ballester, and V.
Herrero are teachers on Digital Applications at the Technical
University of Valencia (Spain). They are currently working on
video, custom DSP, HDL simulators, digital filtering, and image
The high density of currently available
FPGAs allows designers to develop single-chip FPGA-based DSP
blocks. The combination of DSP technology and FPGA
implementation is sometimes known as a Custom Digital Signal
Processor (CDSP). A CDSP lets you select the precision, inner data
width, and word length for a particular application. In this paper,
we explore CDSP synthesis using Distributed Arithmetic (DA).
Initial studies of DA started thirty years ago. Peled and Liu, Croisier et al. , and Zohar enunciated the theoretical basis of DA and how to
use it to design digital filters. Several DA reviews of DA are
available. This type of arithmetic takes advantage of
technologies rich in memory elements such as RAM-based FPGAs. In
these devices, designers can use lookup-tables and adders to
compute the inner products of a mathematical computation.
This paper describes a methodology to implement FPGA-based FIR
filters. The authors use a VHDL description to design a
parameterized, full-parallel, FIR filter macro-model and have
validated the design by compiling several versions in Altera
10K50-3 FPGAs. The use of DA for FIR filter design is reviewed and
the basic cells used to build the macro-model is described. In
addition, the results of the FPGA implementation of the FIR filter
In FIR filtering, you compute inner
products with the expression:
where At are the fixed coefficients, T is the number
of filter-taps, and xt are the input data words
where N is the bit-number of the data, xt0 the sign
bit, and xt,n is bit "n" of xt.
You can rewrite Equation 1 as:
which suggest the base of the distributed arithmetic. You can
pre-calculate the following terms in Equations 4 and
5. Each xt,n takes the value '0' or '1' and n can take one of the
2(N-1) possible values, all of which are possible sums
of coefficients At.
Equations 4 and 5 produce a new expression shown
as Equation 6:
You can expand Equation 6 into individual product terms
where x1,0 is the bit 0 of the first sample,
x2,0 is the bit 0 of the second sample, and so on.
Each term in brackets in Equation 7 is stored in ROM,
RAM, or a Look-Up Table (LUT). The 2(N-1) different
values represented by these terms are addressed by xt,n.
Table 1 shows how the term storage is accomplished.
||A2 + A1
||AT... + A2 +
Table 1: A ROM, RAM, or LUT stores distributed
arithmetic pre-calculated terms
By accumulating the N values of n with appropriate weighting
factors, you can implement digital filters using distributed
arithmetic by using registers, memory elements, and a scaling
accumulator. Figure 1 represents a serial implementation of a digital
FIR filter based on the expression of Equation 6.
Figure 1: A full-serial FIR filter based on
The scaling accumulator adds each n between 1 and n-1, and subtracts the 0 term, which corresponds to the MSB
of the input samples. You can also implement his structure in a
mixed format (Figure 2), or in a full-parallel fashion
Figure 2: Mixed-format version of a DA-based FIR filter
Figure 3: Full-parallel
version of a DA-based FIR filter
The main advantage of FIR filters is their linear-phase
response. These filters achieve this response with symmetrical or
anti-symmetrical structure, which also offer hardware reductions.
You can therefore express Equation 7 as:
Equation 8 reduces the number of multiplications to T/2
or (T-1)/2 for even or odd filters respectively. If you need an
anti-symmetrical FIR filter, change the addition in the parenthesis
to a subtraction.
DA-Based Altera-Oriented Design
Each logic cell in Altera FPGAs has a
four-input LUT. With the logic cell, you can perform any Boolean
function of four variables. The most efficient way in terms of area
and speed to implement filters in these FPGAs is to limit the
number of variables of the n
function to four. Note that you
need three four-input LUTs to implement a five-variable function.
Due to this requirement, the partitioning of the sum of T terms
into K sums of four variables reduces required memory to store n
due to symmetry considerations) to:
For example, for a 16-tap FIR filter (T=16) the number of
storage elements you need for a direct implementation of n is 28. If the sum is
divided by K=2, needing 4n sums (8n=4n+4n), the storage you need
becomes 24+24 = 25 words.
Table 2: Distributed Arithmetic Blocks reduce FPGA
resource requirements for implementing filter functions
This algorithm can be easily computed in a recursive manner:
for i in 4 down to 1
(T/2) mod i = number of(in) ;
(T/2)=(T/2)-i* number of(in) ;
where mod is the arithmetic operator "module" and
number of(in is the number of type
Figure 4: Eight-Tap Linear-Phase FIR filter
Figure 4 shows a structure of an eight-tap linear-phase
FIR filter. You delete the shadowed register and adder for an odd
FIR filter. To design filters with a number of taps higher than
eight, for example a 12-tap FIR filter, the structure comprises two
blocks. One block represents an eight-tap filter (with only
four-input LUTs) and the other a three-tap filter (with only
Figure 5: 12-tap FIR filter pipeline
Figure 6:43-tap FIR filter
Figure 7:81-tap FIR filter
Pipeline and Parameter Description
Pipeline implementation increases the speed of the filter model.
However, FPGA speed is limited by the device's routing delays. The
pipeline has been done by registering each cell, shown by the dark
lines in Figure 5. This option does not increment the size
of functional area, since each logic cell is composed of a LUT and
a register. In the pipeline versions, the latency you get is
5+log2(number of block), where 5 is the latency of the
You can use VHDL to describe each of the above blocks in a
parameterized macro-module. With the model, you can vary several
FIR-filter parameters including the number of taps, the kind of
structure (symmetrical or anti-symmetrical) and the use of
pipelining. The filter implementation, based on company
recommendations, uses Altera's lpm_add_sub function.
We have implemented several filters in an
Altera FPGA. The chip chosen for the implementations is the
EPF10K50VRC240-3, which comprises 3184 flip-flops, 2880 Logic
Elements and 20,480 RAM bits. Filter implementation uses default
placement and routing compilation options. Furthermore, the
implementation used carry-chain lines to improve the speed of the
Figure 8: Maximum frequency of the DA FIR
Figure 8 shows the speed of both pipelined and
non-pipelined versions of the filters. Note that the
pipelined-circuit frequencies are practically constant, close to
40MHz, even with increasing number of filter taps. It effect is
caused by the path between two pipelined registers, that can reach
a high interconnection delay. In non-pipelined filters, the maximum
frequency is similar for all versions, around 14MHz, and speed is
Figure 9: Occupation of the DA FIR filters
Figure 9 shows the occupation for both pipelined and
non-pipelined filters. The horizontal lines represent the maximum
capacity of each chip in the FLEX10K family (from 10K10 to 10K50).
The graph shows that you can fit up to a 13-tap FIR filter
(non-pipelined) in a EFP10K10. Extrapolating the graph show that
you can implement up to 45-tap FIR filter in an EPF10K50.
We have described a methodology to construct high-speed FIR
filters in Altera FPGAs. You can use this methodology to design
symmetric or anti-symmetric filter structures, and can also include
pipelining. Use of distributed arithmetic contributes filters'
high-speed. We have also built a parameterized macro-model using
synthesizable VHDL. You can use the resulting filters in real-time
applications up to 40 MHz. This upper-frequency is essentially
independent of the number of filter taps. Results also show that
you can implement a 45-tap FIR filter in an Altera EPF10K50