Help the compiler
Make intelligent use of “const” as well to indicate to the compiler that certain items are non-volatile. In embedded systems, this is crucial as sections containing “static const” items can be placed in non-volatile memory. Forgetting to declare such items as “const” in your code may confuse both the compiler and future readers and maintainers of your program. Within functions, marking data which you do not intend to modify as “const” assists the compiler in its optimization task as it does not have to work out what your assumptions are – if it can!
Turing machines are wonderful beasts. They may not be particularly capable but they are of infinite extent – principally in that they have infinite storage. We are not so lucky in the real world – in many cases we can treat memory as essentially infinite but we must acknowledge that only a relatively small part of it will be fast.
Turing machines also carry out their limited range of operations at a constant pace, everything takes the same amount of time. In real life, not all operations take the same length of time.
ARM processors, unusually, support a range of instruction sets:
- ARM – The original ARM instruction set in which all instructions are 32-bit.
- Thumb – In earlier cores (ARMv4T and ARMv5) all Thumb instructions were 16-bit, providing improved code density at the expense of some loss in performance. Later processors (ARMv6T2 and ARMv7 onwards) use Thumb-2 technology to add 32-bit instructions to it, making a complete instruction set giving an excellent compromise between code density and performance.
- NEON – NEON is a wide SIMD instruction set optionally support on ARMv7-A processors. It is an excellent target for DSP and multimedia algorithms.
- VFP – ARM’s Vector Floating Point instruction exists in several incarnations, sometimes integrated with NEON and sometimes on its own.
Make sure you are aware of the instruction sets available on the processor in your system and select which to target for selected parts of your application. In modern ARM systems supporting Thumb-2, Thumb is the instruction set of choice in the vast majority of code. ARM is often chosen for hand-crafted assembly code and when compiling high-performance code sections. NEON is chosen for particular algorithms which benefit from its SIMD vector-processing capability.
In systems which do not support Thumb-2, use ARM for performance and Thumb for code density. In such systems, it is common to compile significant parts of a program in different instruction sets and combine them using at link-time into a single body of code.
Those of you who have been using ARM for a while will be familiar with the evolution of the pipelines over the generations. We have moved from a simple three-stage pipeline in the ARM7 to a much more complex variable-length pipeline in the more recent Cortex-A9. The processors in between have had varying lengths and structures. Since all have been in-order execution units, it has historically been crucial to optimize for the pipeline structure when aiming to maximize instruction throughput. In general, the compiler takes care of this, provided that it is configured correctly for the target processor. When coding in assembler, though, it is the job of the programmer manually to order instructions appropriately. This has been one area in which it has been possible for programmers to outdo the compiler.
In modern ARM cores (Cortex-A9 onwards) more advanced execution units make use of techniques such as out-of-order completion and register renaming greatly to reduce the effects of the pipeline on throughput. This makes this kind of optimization much less important. On these processors, optimizes C/C++ code is generally a much better choice than assembler.
Working with branch prediction
All ARM processors since ARM10 have made use of branch prediction techniques to improve performance. The precise techniques employed vary from processor to processor and include static, statistical and dynamic prediction, sometimes backed with return stacks, branch target caches and branch target buffers. Generally, branch prediction is one of those things which “just works” – you turn it on and your code runs faster. However, there are some things which either don’t predict or don’t predict well.
In the case of successful prediction, the execution time of a branch instruction can be reduced to four cycles (static prediction), one cycle (dynamic prediction) and sometimes to zero cycles (branch folding). The cost of a mis-predict is dependent on the precise pipeline structure but will be at least 7 cycles.
Branches which are not PC-relative are inherently difficult or impossible to predict. Since the target address is unknown until the instruction reaches the execute stage of the pipeline, the processor has no time in which to start fetching ahead from the destination. Some cores (from ARM11 onwards) incorporate a return stack which allows them to predict return instructions as a special case but, in general, are unable to predict this kind of branch.
Also, branches which execute immediately after another branch are not predicted and out of a pair of branches appearing in the same fetch slot in memory, one will not be predicted.