Many mobile phones, particularly in the United States, are given away by wireless service providers to get customers. Since the providers give the phones away, they want the cheapest implementation consistent with acceptable performance. After cost, the most important features are standby time, talk time, and the phone's feature set, in that order. The phones with the longest battery life and the smallest overall size command the highest prices.
New standards, such as General Packet Radio Service (GPRS) and High Speed Circuit Switched Data (HSCSD), which can handle large amounts of data, are beginning to be implemented as standard features in mobile phones. These new standards use multiple time slots, processed in parallel, to increase the data rate, radically increasing the amount of throughput that the DSP processor resources in the phone must handle. New audio features, such as MP3 and speech processing, are also increasing DSP processing requirements. The question is, what is the best DSP solution to meet increasing processing requirements without sending power consumption or end-product size through the roof?
Traditionally, performance increases have been achieved by either increasing the clock frequency or selecting a highly parallel processor with a Very Long Instruction Word (VLIW) architecture. Increasing the clock frequency requires a higher supply voltage and increases the power consumption exponentially. VLIW processors enable increased performance at lower clock speeds by enabling a high degree of parallelism, but require substantially more program memory. Larger memory results in higher cost and greater power consumption, as well as a larger footprint design. This inescapable trade-off between power and performance suggests that the traditional approach of using off-the-shelf DSP processors, such as those from Texas Instruments, may not be suitable when performance, cost, and power consumption are equally important.
Another approach is to use a new breed of highly parallel DSP cores that are reconfigurable and scalable, and targeted toward the application. These devices have extendable architectures, allowing designers to take maximum advantage of the dataflow characteristics (memory access) and parallelism (concurrency) that exist in the application algorithms (Figure 1).
Figure 1: Example of a reconfigurable, extendable DSP core architecture
In the last few years, Adelante Technologies, 3DSP, BOPS, and other vendors have introduced these types of flexible cores that offer a high degree of parallelism and allow the user to scale the processing power to increase processing throughput with a minimal silicon or power consumption penalty.
Devices from these vendors all support scalability. BOPS allows the addition of its own brand of processing elements (PEs) to the basic core. 3DSP has a library of multipliers, ALUs, and other resources to extend the basic DSP architecture. Finally, Adelante Technologies offers a three-stage scalability approach:
- Extending the 16-bit compact instruction set with VLIW Application Specific Instructions (ASIs)
- Extending the core inner datapath with Application Specific Execution Units
- Extending the DSP Subsystem with Application Specific Co-Processors.
As an example, Adelante's Saturn core has a highly parallel 96-bit VLIW architecture that is accessed using encoded 16-bit instruction words. The user gets the processing advantage of a VLIW architecture with the compact code of a 16-bit machine. Based on an analysis of over a million lines of wireless application code, the instruction set is optimized for wireless applications.
The Saturn core allows users to increase throughput by creating 96-bit Application Specific Instructions that fully exploit the architecture's resources and can execute as many as 12 instructions in a single clock cycle. For example, a Viterbi butterfly (Figure 2) that takes 11 clock cycles to complete in regular assembly code consumes only two clock cycles using two application-specific instructions. This is a 550% performance improvement with no additional silicon.
Figure 2: Viterby butterflyexample of a nested loop calculation
BOPS has 'indirect' VLIW instructions that can also access all the processing elements inside the core via a 32-bit instruction path. Executing VLIW instructions on a 32-bit data path could require multiple clock cycles per instruction, however. The Adelante core can execute the full 96-bits VLIW instructions in a single clock cycle.
Another means of increasing performance is to add processing elements to the core architecture. All reconfigurable cores allow the architecture to be extended. The BOPS and 3DSP cores let the user add standard hardware elements, while the Saturn core lets the designer create application-specific execution units that are designed to perform specific processing functions. All three vendors' hardware extensions run within the inner environment of the DSP core sharing the existing bus, memory, and other resources. These hardware extensions can even execute in parallel with other operating resources.
Extending the DSP core's resources provides even greater performance gains with minimal additional silicon. For example, a Saturn application execution unit to execute the Viterbi butterfly described above results in a four-fold improvement over the Application Specific Instruction implementation, and a 22x improvement over the original assembly code, while taking only 1000 additional gates.
A third means of extending DSP performance is to add tightly connected Application Specific Co-Processors with their own architecture, control and calculation resources. Most vendors of reconfigurable DSP cores offer such co-processors. The drawback of a co-processor is that it will increase the silicon area by roughly 5000 to 40,000 gates. The advantage is that any co-processor architecture has been designed from the ground up for a specific application will yield the maximum efficiency. Adelante's Viterbi co-processor adds 15,000 gates to the 40,000 gate core, but can execute 60 Viterbi butterfly calculations each clock cycle660 times faster than the original assembly implementation (Table 1).
||Application Specific Acceleration|
Table 1: Viterbi butterfly execution and silicon cost using different acceleration techniques
Integration of the DSP core and any coprocessors into the DSP sub-system and into the final system-on-a-chip (SoC) are also significant issues in the design of high-performance wireless devices. Virtually all mobile phones are multi-processor systems with a MIPS or ARM CPU alongside the DSP. DSP cores from Texas Instruments, DSP Group, and some other vendors that can provide excellent performance and power qualities may not provide any of the sub-system elements required to integrate the baseband processing with external I/O, memory, and processors. It is pretty much up to the designer to integrate and verify the DSP in the sub-system. This is a task that can take six months to a year.
This new breed of configurable, extendable DSP cores generally have much better support in the integration of the core into the sub-system (Figure 3). Adelante's Saturn core, for example, comes with a complete, configurable DSP sub-system that includes configurable program and data memory, buses, DMA, BIST, JTAG debugging, and extensive (AMBA) interfaces to external processors, peripherals, and memory. The Lunar subsystem is completely integrated with the Saturn core and fully verified prior to delivery to the customer. It is essentially a "drop-in" DSP sub-system. Other vendors provide similar support that can be critical in getting the design done.
Figure 3: DSP sub-system
Code development and debug options, plus design verification are also very important considerations in wireless designs, particularly when creating DSP subsystems for inclusion in SoCs. Features such as the JTAG interface on Adelante's Lunar DSP sub-system support debug options by including a run-time debug block for in-circuit, run-time emulation in conjunction with the core's development environment. Debugging options should include the ability to set hardware breakpoints and perform single-step code execution. The Saturn core's JTAG debugger gives the designer complete visibility into the core's registers, memories, and subsystem scan chains. These hardware-debugging features are also accessible in the final SoC. 3DSP and BOPS also offer on-chip JTAG debugging capability.
In conclusion, designers who must continually add features while squeezing power consumption out of their DSP designs should consider this new breed of reconfigurable, extendable DSP core. Such cores offer the flexibility to achieve much better performance with minimal impact on power consumption or silicon. These are particularly desirable attributes in wireless handsets where battery life and cost are the primary drivers of end-product success.
About the Author
Kees Moerman is Adelante Technologies' chief architect. He is responsible for the ongoing development of the company's DSP cores and developed the company's Saturn DSP core and Lunar DSP subsystem. Prior to the joining Adelante Technologies, Dr. Moerman was Innovation Manager in the Embedded Processor Department of Philips Semiconductors. He studied physics and computer sciences at Utrecht University in the Netherlands (1959). Adelante Technologies is located in Leuven, Belgium.