Editor's Note: In this paper originally presented at Design East 2012, the author looks at issues and techniques for squeezing maximum energy from batteries in embedded systems in the following parts:
- In Part 1, the author reviews key methods for power reduction and addresses the nature of efficiency in embedded systems
- In Part 2 looks at the energy cost of memory access and power-reduction methods for memory access
- This part continues the discussion with an examination of the role of computational efficiency in extending battery life
Elephant Number 2 – Computation Efficiency
And so, having despatched the first elephant of memory accesses, we can now turn our attention to the second elephant in the jungle – that of instruction execution.
On the face of it, managing instruction execution is essentially the same problem as that of optimizing for raw performance. Executing fewer instructions inevitably leads to the consumption of less energy.
There is a wealth of existing literature on this subject so it does not warrant extensive treatment here. Here are some of the most obvious.
- Configure the tools correctly. The compiler and linker are unable to carry out even some basic optimisations unless they are fully aware of the target platform. Architecture version, core implementation, coprocessors and so on are all important.
- Write code sensibly to avoid unnecessary operations. On the ARM architecture 32-bit data types are efficient; in general 8-bit and 16-bit types, while they may occupy less storage, are less efficient to process. The packing and unpacking instructions and the SIMD operations in v6 and v7 of the architecture go some way to helping with this but be aware that, in the main, these instructions are not accessible from C as they do not map easily to C data types or operations.
- Algorithm selection is much more important. No matter what operation you are trying to carry out, there will almost certainly be not only multiple possible algorithms but also multiple possible implementations of those algorithms. Some will be more memory-friendly than others. In general, favour algorithms and implementations which favour computation over communication. A simple example would be an algorithm for image rotation. An implementation which copies pixels from a source array to a destination array, carrying out the transformation on the way will almost certainly access memory more often that one which seeks to make the transformation in place. This will be even more marked when the effect of caches is included.
- If extensive data processing is required, the amount of data need not be very large in order to justify the extra instructions involved in copying it to TCM in order to process it. Given the much lower cost of TCM accesses when compared to external memory and even to cache, it does not take much to pay back the overhead.
- Data structures and loops should also be defined in a way which lends itself to vectorization, whether undertaken by the compiler or using a dedicated vector processing engine such as ARM’s NEON architecture.
Loops should be written following some simple rules. Use unsigned integer counters, count down and test for equality with zero as a termination condition. This will produce shorter and faster loops which use fewer registers. Loops should also be written with vectorisation in mind, whether this be by code transformation or by using a vector processing engine such as NEON. Some simple rules about control structures and data declarations can make the job of the compiler much easier when trying to unroll and vectorise even the simplest of loops.
This diagram shows some figures relating to one specific loop optimization, loop unrolling.
As we would hope and expect, execution time and instruction count decreases as the unroll factor increases. We are seeing the effect of reduced loop overhead and, to a lesser extent, a reduction in address calculations. The power results are more interesting and not as obvious.