Part 1 explains the DSP features of the Cortex-R4 and shows how the Cortex-R4 stacks up against the competitors.
Applications that involve real-time signal processing often have fairly stringent performance targets in terms of speed, energy efficiency, or memory use. As a result, engineers developing signal processing software often must carefully optimize their code to meet these constraints. Appropriate optimization strategies depend on the metric being optimized (e.g., speed, energy, memory), the target processor architecture, and the specifics of the algorithm.
In our last feature article, we presented signal processing benchmark results for the ARM Cortex-R4. These results were achieved by careful hand-optimization of the Cortex-R4 benchmark assembly code. In this article, we'll share some of the tips and tricks we used to develop our benchmark implementations for the Cortex-R4, starting with algorithm-level optimizations and working our way down to assembly-level optimizations.
Cortex-R4 Instruction Set
As we discussed in the previous article, the Cortex-R4 core implements the ARMv7 instruction set architecture. It uses an eight-stage pipeline and can execute up to two instructions per cycle. The core supports the Thumb2 compressed instruction set, though most of BDTI's signal processing benchmark code is implemented using standard ARM instructions because of their greater computational power and flexibility. (Signal processing algorithms are typically optimized for maximum speed rather than minimum memory use, though memory usage is often a secondary optimization goal.)
On the Cortex-R4, the instruction set is fairly simple and straightforward, and most of it will be familiar to engineers who have worked with other ARM cores, particularly the ARM11. Compared to the earlier ARM9E core, however, the Cortex-R4 is noticeably more complex to program due to its superscalar architecture and deeper pipeline (8 stages vs. 5). And, unlike the ARM9E, the Cortex-R4 supports a range of SIMD (single-instruction, multiple-data) instructions, which improve its signal processing performance but often require the use of different algorithms, different data organization, and different optimization strategies compared to approaches that worked well with earlier ARM cores.
The Cortex-R4 is in some ways similar to the ARM11, which supports a similar range of SIMD operations and also has an eight-stage pipeline. One difference between the two cores is that the Cortex-R4 is a dual-issue superscalar machine while the ARM11 is a single-issue machine. In some cases, this will mean that different optimization strategies are needed to ensure that instructions dual-issue as often as possible. But in many tight inner loops, the two cores may end up using very similar code. This is because of a key limitation on the Cortex-R4's dual-issue capabilities: it cannot execute multiply-accumulate (MAC) operations in parallel with a load, and it cannot use its maximum load bandwidth (64 bits) in parallel with any other operation. As a result, in signal processing inner loops that require maximum MAC throughput or maximum memory bandwidth, the Cortex-R4 is often limited to executing a single instruction at a time.
The benchmark results we presented in our earlier article are the result of careful hand-optimization of assembly code. But rather than diving right into assembly-level optimization, we will take a hierarchical, top-down approach: We will start with a simple C implementation of the filter, then create compiler-friendly code, then evaluate whether (and where) assembly-level optimizations are needed, and finally optimize the assembly code.
In this article, we'll describe some of the high-level and assembly-level optimization techniques we've found to be successful on the Cortex-R4. We will use an FIR filter as an illustrative example since it's a common and familiar signal processing algorithm and is amenable to a number of optimization strategies on the Cortex-R4. The optimization techniques we will cover include:
- Helping the compiler recognize optimization opportunities
- Choosing algorithms that can take advantage of the Cortex-R4's SIMD capabilities
- Using software pipelining and loop unrolling to conceal instruction latencies and reduce stalls
- Reducing memory accesses
We'll start with a simple C implementation of the FIR filter and show a progression of optimization techniques.