[Editor's note: Part 2 of this series shows how to optimize DSP "kernels," i.e., inner loops. For more programming tips, see the DSP programmer's guide.]
DSP applications typically have tough performance demands. Traditionally this has led DSP programmers to make extensive use of assembly programming. There is no doubt that assembly programming can be very effective, but given today's skill shortages and shrinking product deadlines, there is growing interest in keeping as much DSP software in C as possible. There are a number of reasons to use C rather than assembly language:
- C is much cheaper to develop.
- C is much cheaper to maintain.
- C is comparatively portable.
C enables you to bring in a portable program and quickly experiment with it in order to ascertain the performance potential. Management loves this because they see working—if slow—product elements from early in the development cycle. However, C has its own problems rooted in the semantic gaps between programming language, design and hardware:
- ANSI C is not designed as a signal processing language. Its emphasis was on system design rather than mathematics. Thus, ANSI C is not the most natural way to express DSP algorithms.
- DSP processors have many elements that reduce their suitability for compilation, such as specialized addressing modes. Thus, DSP processor designs often assume use of assembly language in performance-critical code.
- DSP workloads being presented to the compiler are moving rapidly into new areas, such as high definition video and cryptography. In contrast, high level languages evolve at a glacial pace. This leaves a growing gap that is difficult for the compiler to bridge. For example, it is difficult for compiler to recognize sub-word SIMD or Galois operations because they are clumsily expressed in standard C.
For all of these reasons and more, the performance of compiled code may be far less than that of hand-written assembly—or at least this is the case if you take a simplistic approach to writing your C code. In this series we will look at ways to tune up the performance of the C programs so you can avoid the assembly option.
First, some basics.
Understanding the Application and the Processor
The benefits of tweaking C are irrelevant unless you choose an efficient algorithm in the first place. Thus, it is important to start by making sure you understand the different algorithmic approaches available for your application, and the tradeoffs associated with each approach. Of course, working in C can make it more feasible to experiment with different algorithms.
You must also gain a low level awareness of the DSP processor's capability. The machine's characteristics will ultimately determine the level of performance you can achieve, so you need to understand these capabilities in order to set targets for the performance of your code. It is particularly important to understand your processor's specialized features. For example, the processor may have highly efficient operations for things like a Viteribi decoding, bit multiplexing, or vectorized multiply and accumulates (MACs). You also need to consider the processor's memory system. For example, will the bus capacity support the amount data you hope to process?
Once you understand the processor, you must then evaluate whether your algorithm will map well onto into the processor's low level facilities. You must then look at the assembly code emitted by the compiler to decide if you are actually using these facilities efficiently.
The C language offers a uniform computational model, which means a C programmer can assume his program will give the same answer on any platform. However, the C computational model can be supported in different ways on different platforms. For instance, if there are native floating point instructions on a machine, then C's assumption of 64 bit floating point everywhere is supported efficiently. But on a machine without native floating point, the compiler will automatically plant calls to an emulation library, and the speed of the code will be reduced by a factor of a hundred.
C also assumes a large flat memory model. In reality, memory access costs can be highly irregular and can dominate application performance. Thus, blindly following C conventions can ruin performance.
Perhaps more importantly, "portable" C is surprisingly machine-dependent, even in such basics as what is an "int" is. Depending on the machine, an int may be either 32 or 16 bits wide. Obviously, making the wrong assumption can greatly affect the performance of your code—and can even lead to incorrect operation.
There is often a poor match between C and the features of DSPs in the areas of accumulators, vectorization (that is, SIMD hardware), and fractional processing. These hardware features are essential to efficient processing, but they are not natively supported in ANSI C. So the message for the C programmer is that C programs can be ported with little difficulty, but if you want high efficiency, you can't ignore the underlying hardware.
Bear in mind that there is a conflict in program design between generality and explicitness. For instance, consider how you access data. If your goal is to maximize generality, you might use highly indirect pointers. The problem with this approach is that it forces the compiler to use a conservative strategy. For example, suppose you write a memcpy function as follows:
memcpy( struct->Ptr1, &(ShortArray[*PtrIndex]) , num );
In this scenario, the compiler cannot deduce the data addresses. In order to avoid overlapping or misaligned data, it will generate very slow but safe code, perhaps transferring only one data word at a time:
Cycle 1: Load 16 bits
Cycle 2: Store 16 bits
You can make the identity and alignment of the data more obvious by writing the function as:
memcpy( IntArray1, IntArray2 , num );
Now the compiler can deploy wide loads and vectorization. On a typical DSP, this will provide an eight-fold improvement in speed:
Cycle 1: Load 32 bits
Cycle 2: Load 32 bits || Store 32 bits
Occasionally you can speed up a program simply by making the C code more elegant, as shown in the memcpy example. More often than not, however, the speedup comes from specializing the program for the hardware. This process leaves you with a faster program, but it also gives you a program that is larger, more complex, and less portable. In other words, there is a price to pay for performance. In order to minimize the price, you should target your optimization work to key areas only, and resist the temptation to write everything for maximum efficiency.