The FIR Filter: A Simple C Implementation
If we implement a 40-sample, 16-tap FIR filter in C without making any attempt to optimize it, the code might look like this:
Figure 1. Simple C implementation of an FIR filter kernel.
On the left side of Figure 1 we show the analytical expression for the FIR filter algorithm, and on the right side we show a simple mapping of the FIR filter to a C implementation. Notice that the mapping produces two nested "for" loops—the inner loop runs over all of the taps (T), and the outer loop runs over all of the input samples (N). (We should also point out that this figure does not show the entire FIR filter implementation; for example, we are not maintaining the delay line here.) Like many (though not all) signal processing algorithms, the FIR filter makes heavy use of multiply-accumulate (MAC) operations.
For this implementation, we're using 16-bit data rather than 32-bit data—although the Cortex-R4 supports both data types, 16 bits is more commonly used in embedded signal processing, and the Cortex-R4's maximum MAC throughput is achieved with 16-bit data.
When we compile this code for the Cortex-R4, the inner loop assembly code looks like this:
(Click to enlarge)
Figure 2. Result of compiling simple C FIR filter implementation on Cortex-R4.
In this case, the compiled code does make use of the dual issue capabilities of the Cortex-R4 in two places in the inner loop. However, our naive C implementation doesn't give the compiler many hints about how to generate an effective assembly implementation, and the compiler doesn't generate code that is as efficient as it could be.
The cycle count for this implementation is 9 cycles per iteration of the inner loop—which is equivalent to 0.11 taps per cycle. In an FIR filter, each tap requires one MAC and two data loads. Given that the Cortex-R4 can perform two MACs per cycle (with an additional cycle or two for data loads) we'd expect much better throughput than what we've achieved here. Therefore, our first step is to look for places in the algorithm where the processor could be doing work more efficiently.
On the Cortex-R4, the condition for a branch can be set by arithmetic operations (such as add and subtract). We have to do some additions to update the loop counters anyway, so this is a good place to modify the C code to help the compiler generate a more efficient implementation.. In other words, we have instructions for updating the loop counters such as ADD r0,r0,#1 for the inner loop. But we can omit the CMP r0, #0x10 instruction if we use a "count down" optimization. In this case, we no longer need to explicitly do a comparison with 0, because the BNE—branch not equal—instruction will automatically take care of that case.
The Cortex-R4's instruction set includes load instructions with pointer updates, but the compiler did not use these instructions. Instead, it used separate instructions for loads and pointer updates. Rewriting the code can guide the compiler to choose the more efficient instructions.
There are several other inefficiencies in the compiler output. For example, the code has a two-cycle stall in the middle of it. That's because loads have two-cycle latencies, and the MAC instruction needs data available in the register one cycle prior to issuing the MAC. Furthermore, although the Cortex-R4 has 64 bits of data bandwidth, the LDRH (load half-word) only uses 16 bits; thus we're using more load/store instructions than we really need. And finally, the compiler did not recognize the opportunity to use SIMD multiplication.
Giving the Compiler a Hand
Rewriting the C implementation of the filter can help the compiler recognize more opportunities for optimization. In Figure 3, we've re-written the C code in a way that's more compiler friendly.
(Click to enlarge)
Figure 3. A more compiler-friendly FIR Filter.
In the new version, we're using the "count down" strategy instead of counting up. And we're using explicit pointer increments to encourage the Cortex-R4 compiler to choose the load instructions with pointer updates. The resulting assembly output is shown in Figure 4.
(Click to enlarge)
Figure 4. Output from "compiler-friendly" FIR filter on Cortex-R4.
This implementation requires 0.17 taps per cycle, which is a lot better than the earlier implementation—1.5X better to be exact. Because of our changes, the compiler chose the "LDRH" instruction that automatically updates the base register pointers (R1 and R2) instead of issuing a separate ADD instruction. And as we'd hoped, the compiler used the BNE instruction for the loop counter and removed the CMP instruction, thus further reducing the cycle count.
But we're still not really taking the Cortex-R4's pipeline into account, nor are we taking advantage of the Cortex-R4's ability to do SIMD operations, which are crucial to the Cortex-R4's FIR filter performance. Furthermore, we are still using only 16 bits of data bandwidth.
Unfortunately, this is about as good as a compiler typically gets. So if we want better performance, we're going to have to write some assembly code.