The first assembly code modification we'll make is to modify the FIR filter inner loop to use the Cortex-R4's SIMD dual-MAC instructions (SMLAD) and increase the size of each load from 16 bits to 64 bits (using LDRD, load double-word). This will enable significant performance benefits. As mentioned earlier, however, a key limitation of the Cortex-R4's dual-issue capabilities is that it cannot issue a multiply instruction (or a dual-MAC) in parallel with any other instruction—so although we can modify the code to use SIMD, we cannot sustain two MACs per cycle, even with assembly-level optimizations.
The modified assembly code is shown in Figure 5.
(Click to enlarge)
Figure 5. Output from "compiler-friendly" FIR filter on Cortex-R4.
The resulting performance is 0.5 taps per cycle, which is about a 3X improvement over the improved compiler output. But now more than a third of the cycles in the inner loop are stalls. This happens because both the loads and the MACs have multi-cycle latencies, and the code is not currently arranged in a way that enables the processor to do useful work during those stall cycles. To get rid of the stalls, we'll need to use software pipelining.
Software Pipelining, Algorithmic Transformations
Software pipelining is an optimization technique in which the assembly programmer (or compiler) re-orders instructions to eliminate stalls and allow the processor to do useful work when it would otherwise be idle.
In Figure 6, we show an improved version of the inner loop, using software pipelining to eliminate the stall cycles. Note that this small code snippet uses 12 out of 16 registers available on the Cortex-R4; it's easy to imagine that you could run out of registers pretty quickly on more complex algorithms. Software pipelining increases the throughput to 0.62 taps per cycle—a big improvement, but we can still do better.
(Click to enlarge)
Figure 6. Add Software Pipelining
In Figure 7, we show a well-optimized FIR filter inner loop that uses loop unrolling, the "zipping" optimization (commonly used in FIR filters) and careful instruction scheduling to improve performance. (Here, the register names have been replaced with "x's" because this code is proprietary.) In this version, we've unrolled the outer loop four times and unrolled the inner loop completely. Unrolling the inner loop eliminates its loop overhead, while unrolling the outer loop enables the use of zipping to reduce memory accesses. That is, each of the four outputs computed in a loop iteration shares most of its operands with other outputs, so we need many fewer loads compared with the previous versions of the code.
(Click to enlarge)
Figure 7. Fully Optimized FIR Inner Loop for Cortex-R4
In this version, we've also scheduled the instructions to avoid stalls between LDRD (load double-word) and SMLAD (dual-MAC) instructions. The resulting code is very similar to what you would see for the single-issue ARM11; there is very little opportunity for Cortex-R4 instructions to dual-issue in this loop because neither the SMLAD instructions nor the LDRD instructions can dual issue. Nevertheless, this version yields much better FIR filter throughput than what we started with—0.99 taps/cycle. But of course, this improvement didn't come for free—it took an expert programmer about 20 hours to implement, and it requires many more instructions (and thus, more memory) than the simple implementation.
The Cortex-R4 provides much higher signal processing throughput of the ARM9E, but in part because of its deeper pipeline and SIMD capabilities, the Cortex-R4 is also a more challenging target for software optimization. Achieving its maximum performance will require careful optimization at several levels, and programmers will need to trade off code portability and optimization effort against processor performance. Like with all processors, the key is to become familiar with all of the instruction variants, pipeline effects, and other architecture details, and to understand the limitations of the compiler.
BDTI provides the industry's most trusted and widely used benchmarks for digital signal processing and video applications. Through benchmarks and analysis, BDTI enables engineers, marketers, and managers to make confident technical and business decisions about technologies for signal processing applications. For more BDTI resources, see www.BDTI.com.