Design Article
Tutorial: Programming High-Performance DSPs, Part 1
Rob Oshana, Texas Instruments
11/27/2006 11:00 AM EST
Part 2 of this series explains how to maximize performance with loop unrolling and software pipelining.
Part 3 shows how you can help the compiler produce faster code. It also explains how to optimize for minimum power consumption.]
INTRODUCTION
Many of today's digital signal processing (DSP) applications are subject to real-time constraints. And it seems many applications eventually grow to a point where they are stressing the available CPU and memory resources. Many of these applications seem like trying to fit ten pounds of algorithms into a five pound sack. Understanding the architecture of the DSP, as well as the compiler can speed up applications, sometimes by an order of magnitude. This article will summarize some of the techniques used in practice to gain orders of magnitude speed increases from high performance DSPs.
Make the common case fast
The fundamental rule in computer design as well as programming real time systems is "make the common case fast, and favor the frequent case." This is really just Amdahl's Law that says the performance improvement to be gained using some faster mode of execution is limited by how often you use that faster mode of execution. So don't spend time trying to optimize a piece of code that will hardly ever run. You won't get much out of it, no matter how innovative you are. Instead, if you can eliminate just one cycle from a loop that executes thousands of times, you will see a bigger impact on the bottom line.
Architecture and memory
Memory can be a severe bottleneck in embedded systems architectures. This problem can be reduced by storing the most often referenced items in fast, on chip memory and leave the rest in slower off chip memory. The problem is, getting the data from external memory to on-chip memory takes a lot of time. If the CPU is busy moving data, it cannot be performing other, more important, tasks.
Memories come in all flavors (Figure 1). The fastest (and most expensive) memory is generally the registers found on chip. There never seems to be enough of it and management of this valuable resource is paramount to increasing performance. The next fastest is generally the cache which holds those instructions or data the processor hopes to execute in the near future. The slowest memory is generally found off chip and referred to as external memory. As a real time programmer, you want to reduce the accesses to off chip external memory because the time to access this memory can be slow and cause huge delays in processing. The CPU pipeline must "stall" or wait for the CPU to load this memory. Use of on chip memory is one of the most effective ways of increasing performance. On chip memory can be thought of as a sort of data cache, with the main difference being that data cache needs to be managed, instead of this being done automatically.

Figure 1. Memory hierarchy for DSP devices
Hardware architecture techniques have been used to enhance the performance of processors using concepts of pipelining. The principle of a pipelined processor is not much different than an automobile assembly line. Each car moves through the assembly line, being constructed step by step. There are multiple cars in the assembly line at the same time, each car at a different point in the assembly process. At the end of the assembly line emerges a new car, followed closely by another new car, and so on. It was discovered a long time ago, that it was more cost effective to start putting a new car together before the previous one was completed. It was a way to keep the available workers busy doing more work and less time idle. In pipelined processors, the same is true. A pipelined processor can start a new task before a previous task is completed. The completion rate becomes a matter of how often a new instruction can be introduced. As shown in Figure 2a and 2b, the completion time of an instruction does not change. But the completion rate of instructions improves.
To improve performance even more, multiple pipelines can be used. This approach is called superscalar and exploits further the concept of parallelism (Figure 2c). Some of the high performance digital signal processors today have a superscalar design.

Figure 2. Non-pipelined, pipelined, and superscalar execution timeline
One way to control multiple execution units and other resources on the processor is to issues multiple instructions at the same time. Some of the latest DSPs, such as the Texas Instruments C6000 are called Very Long Instruction Word, or VLIW machines. Each instruction in a VLIW machine can control multiple execution units on the processor (Figure 3). For example, each VLIW instruction in the TI 6000 DSP is eight instructions long, one instruction for each of the eight potentially available execution units (L, S, M, D) shown in Figure 3. Again, the key is parallelism. In practice, however, it is hard to keep all of these execution units full all of the time because of various data dependencies. The possible performance improvement using a VLIW processor is excellent, especially for some DSP applications.

Figure 3. Superscalar architecture for TMS320C6000



