FPGABased FIR Filters using Distributed Arithmetic
ABOUT THE AUTHORS
M.
MartinezPeir, R. Colom, F. Ballester, and R.
Gadea are teachers with a focus on digital applications
at the Technical University of Valencia, Spain. They
are currently working on video compression, Custom DSP,
HDL simulators, digital filtering, and image lifting.


Cascade and lattice structures present several interesting properties such as low quantification error and highstability in the filter coefficients. Moreover, you can expand lattice cells without a full redesign. The goal of this article is to implement FPGAbased directform, cascade, and lattice highorder FIR filters using bitserial DA. We start by comparing the resultant topologies in both area and speed. The designs use an HDL to include pipeline techniques and scalable parameters. We also describe DA error models of the three structures. The next section reviews DA fundamentals and proposed architectures for each kind of filter. This article also presents the results of the FPGA implementation of the structures and discusses an error model for the filter structures.
(1)
(2)
You can precalculate the terms in brackets in Equation 2, save the results in memory, and address these terms by x_{t,n} in Table 1. Considering that each x_{tn} can only take two values (0 or 1), each product term reaches one of the 2^{(N1)} possible values.







































Table 1:Distributed Arithmetic precalculated terms
DA Direct Form Implementation
A DA bitserial implementation of a FIR filter addresses each
product term once per bit (the MSB bit is the sign bit). After
obtaining the last productterm, this term is added, with its
appropriate shift, to the rest of the product term previously
added.
Figure 1: Bitserial DA directform FIR filter
Figure 1 shows the structure representing the directform FIR filter. Considering that the FPGA this paper discusses has fourinputs LUTs, the productterms larger than four need to be divided into r parts such that 4T/r, where T is the number of taps of the filter. In other words, the adders in the tree structure add the r LUT outputs. Eventually, you need a shift accumulator to add and shift each product term. Figure 1 represents a bitserial implementation of a filter with samples of 8 bits. The output of the filter occurs each eight clock cycles. When the signbit arrives, a subtraction instead of an addition in the shiftaccumulator is done. By using carry save adders before the LUT, you can implement a symmetrical filter"detailed information of this operation is found in Croisier and Proakis.
Additionally, you can extend the range of processing speed by pipelining the structure. Equation 3 expresses the operation frequency (fs), where L is the latency and n the number of the bits of each input sample.
(3)
Despite the increment of registers in the DA pipeline version, the final area resources increase slightly, due to the FPGA structure.
DA CascadeFilter Implementation
You can factor a linearphase FIR filter into several 4th order
sections to obtain an area reduction. The structure of these
sections can be DA adapted with the symmetry equation
(Equation 4) that represents the kth section of a
Torder filter"Equation 5 shows the expansion in DA
product terms of Equation 4. Equation 5
represents the basic cell of a cascade structure you can design
using a bitserial approach (Figure 2).
(4)
(5)
Figure 2: Bitserial DA cascade 4thorder cell
As a result of the latency generated by the pipeline, the cascade cell has extra registers (one per stage of pipeline) to synchronize the operation of the filter. Equation 6 represents the realtime frequency operation of the cascade structure.
(6)
DA LatticeFilter Implementation
You use the recursive equations that describe the lattice cell
structures (Equation 7) to obtain cascade
implementations of M cells. Both the f and g terms represent the
forward and the backward predictions respectively in a linearpredictive
filter structure.
(7)
Using the DA equations of Equation 8, we can reproduce the f and g terms with two LUTs, where g' represents the g(n1) term.
(8)
Figure 3: Bitserial DA lattice cell
Figure 4: Bitserial improved lattice cell
Figure 3 represents the formal DA implementation of Equation 8. However, in this article we also propose an improved structured in Figure 4 that reduces the memory requirements of the DA lattice cell. With the structure of Figure 4, you only need to save the coefficient (km). You can get both the km+1 and 1 values from the carry input of the scaling accumulator.
Figure 5: Operational frequency characteristics of DA bitserial filter implementations
Figure 6: Area characteristics of DA bitserial filter implementations
The bitserial implementation of the lattice structure achieves a realtime operation of 7.5 MHz. In the cascade and directform structure, filter operational frequency continuously decreases to 4 MHz as filter order increases.
In DA bitserial cascade structures (Figure 2), the error is modeled by Equation 9, where e_{m} and e_{a} (shaded red in Figure 2) are the LUT and shiftaccumulator rounding errors.
(9)
The variance of this error is expressed in Equation 10, where p_{m} and p_{a} are the number of bits in both memory and the shiftaccumulator.
(10)
Equation 11 represents a directform DA error model, where r is the number of datasample partitions or memory partitions.
(11)
Furthermore, as a result of the fourinput LUT structures, the partition of the memories in the FPGA case is limited by r<T/4 (T is the order of the filter).
Equation 12 shows the variance of the error in the directform structure.
(12)
Figure 4 shows the improved lattice cell with both error sources e_{m} and e_{a} shaded in red. Equation 13 shows the error and the variance in this cell:
(13)
As an example, we used a Torder FIR filter with p=pm=pa=8 bits to compare the three models. The results in the directform, cascade and lattice implementations are T4.2384e07+1.6953e06, T3.3907e06, and T2.5868e11, respectively. The lattice filter has the lowest error, while the cascade form has the highest error. Finally, the directform structure also has a high error compared with the latticecell structure.
 We can implement a bitserial 40thorder lattice filter
in a 10K50 device with a realtime frequency operation of 7.5
MHz. The pipelined cascade and directform bitserial
implementations reach 4.5 MHz for a 60thorder structure in
their symmetrical implementations.
 We have been presented a new improved lattice cell
reduces memory usage by using the input carry in the
shiftaccumulator.
 We presented a DA error model that shows that the lattice
structure represents the lowest rounding error while cascade
structure has the highest error.