Design Article

Optimizing DSP Performance and Minimizing Risk

Jack Shandle

6/17/2004 12:00 AM EDT


Design engineers who must balance heavy-duty signal-processing performance with low power consumption and/or aggressive price points for their chips increasingly find themselves in a quandary. Their choices include but are not limited to:

  • Off-the-shelf DSPs that leave a lot of potential power savings on the table because the same programmability that makes them flexible and easy to use also consumes more cycles to execute an algorithm than a hardwired solution.

  • Configurable processors that typically require a significant investment in time and money but deliver much closer to an optimized power/performance solution. These solutions fall into the ASIC category that requires a significant investment and high-volume sales of the chip to justify the investment.

  • Programmable logic solutions that can deliver signal processing performance and efficiency by hardwiring algorithms but generally demand a price be paid in power consumption because of routing and gate utilization issues.

If an application profile reveals that 20% of the time spent executing the application is used doing control functions and 80% is used for signal processing, the case for configurability (or, mapping algorithms to gates) can be compelling. This is also one of those rare instances where designers have the best of both worlds because the custom solution uses fewer gates and, therefore, less power for more performance.

One way to view the designer's quandary graphically is to view processing as occurring in either a control plane, which is typically the responsibility of microprocessors and microcontrollers, or, a data plane, typically the province of DSPs.


Figure 1:  Architecture Options and Trends in System Design

The diagram in Figure 1 shows ARM Ltd.'s unique perspective on the problem. Its cores occupy the control/programmability quadrant (lower left). Adding single-processing extensions to the cores gives them the capability of handling low-bandwidth requirements for moderately complex algorithms such as MP3 for audio processing. Adding single-instruction multiple data (SIMD) instructions into the architecture handles applications such as video processing. ARM has also introduced application-specific accelerators for applications such as 3D graphics rendering. ARM's strategy up to now has not, however, been able to address data-plane applications in the upper right quadrant, which is why it has introduced OptimoDE technology.

Slow March Toward Configurability
But migration to configurable signal-processing has been slow. This is largely because design teams must consider more than just the power/performance tradeoffs and the learning curve for any new technology.

Software development tools, for example, are an issue because they have to be:

  1. Robust
  2. Flexible enough to change with architectural changes and user requirements.

This is a tall order for the startup companies that are typically associated with technology breakthroughs. Another software-related issue is compatibility with legacy application code. Legacy C code can be recompiled, of course, but the new architecture's compiler has to be very efficient if it hopes to maintain the performance gains that justify the migration to a new architecture in the first place.

Still another problem is the viability of the new architecture in the marketplace. In other words, design teams must consider the vendor's long-term survival prospects.

Partitioning the Problem
A middle course between an off-the-shelf DSP and a "100% configurable" solution is an approach that provides pre-configured and reasonably complete signal-processing building blocks, hardware configuration tools and a compiler to assure that software and hardware are developed in synch. The tools allow for a degree of customization as well as the assembly of the blocks chosen by the design team to implement the algorithms inherent to the application.

This was the course being pursued by Adelante Technologies' AR|T DSP coprocessor division when ARM acquired it in July 2003. Since then, ARM has extended the technology, renamed it OptimoDE, and released it as a product in May 2004.

OptimoDE is a framework that includes a configurable VLIW-styled architecture, from which specific data engines are derived, a comprehensive set of hardware configuration tools, a powerful C compiler, reference examples and an AMBA interface kit.

The data engine consists of datapath units that are pieced together in drag-and-drop fashion to execute the target algorithm. These functional datapath units include generic arithmetic and logical units, storage, and interconnect. Other specialized functions include butterfly and DCT engines. Users can extend and customize datapath units to meet application-specific processing needs.

OptimoDE tools also include a C-compiler optimized for the architecture. Providing the compiler ensures that hardware and software are optimized together.

OptimoDE Design Flow
The design team starts with a C/C++ specification for the algorithm, which is, generally speaking, the most straightforward way to specify an algorithm. Then, a data-engine template is chosen—a single ALU, single multiplier and single RAM block with a global bus and merged memory and address registers, for example.

Using the template and the ARM configuration tools, the design teams configure the architecture to meet the specified performance requirements, adding other blocks as needed. OptimoDE software is capable of extracting parallelism from the algorithm and this helps the design team significantly. Low-power optimization techniques are also built into the tools. The process is iterative. The various parameters of the architecture are tweaked and the result profiled against the performance goals, which typically consist of a cycle budget and a power budget.

The software is automatically generated and can be re-generated to handle design changes and even alternative algorithms without changing the hardware. OptimoDE also automatically generates simulation models for Verilog simulation after the data engine is incorporated within a system-on-chip (SoC) design.

The results in terms of performance improvements can be impressive. An H.263 video codec that must handle full duplex, conversion of 30 frames per second within a 384 Kbps bandwidth, for example, uses no fewer than 10 algorithms for encoding, decoding and color conversion. The results of using OptimoDE as compared to executing the algorithms in software on an ARM 9E are shown in Table 1. The speed up in the execution of the various algorithms varied between 50X and 6X. But more important, the OptimoDE implementation ran at less than 100 MHz while the ARM9E option would have theoretically required more than an 800 MHz clock to execute the same algorithm suite.

Estimated Performance Requirements ARM9E OptimoDE
H.263 CIF 352x288, 30fps, 384kbps MHz MHz Speed-Up
Decode
Deblock
154.3
15.2
10X
Dering
142.6
11.4
13X
IDCT
29.3
4.1
7X
Motion Compensation
32.7
4.3
6X
 
358.9
36.4
10X
Encode
FDCT
39.7
4.3
9X
SAD
308.4
6.2
50X
IDCT
29.3
4.1
7X
Motion Compensation
11.4
2.0
6X
 
388.9
16.6
23X
Color Space Conversion
YUV to RGB
82.7
4.8
17X
RGB to YUV
48.8
5.7
9X

Table 1:  Data engine performance compared to programmable implementation

OptimoDE also addresses most of the concerns noted above regarding the slow adoption of configurable DSP solutions. Development software tools, for example, are embedded in the application and are upgraded by ARM, which has a good track record for offering efficient tools. Since OptimoDE generates the accompanying application software, legacy software is not a problem and ARM's viability is not under question in most quarters.

Case History #1—Low Power Optimization
Headquartered in Staefa, Switzerland, Phonak AG, is one of the world's leading companies engaged in the design and manufacture of hearing instruments. Signal-processing is a critical aspect of the silicon technology that Phonak uses in its systems to execute standard and proprietary algorithms among other tasks.

Hearing instruments have all of the identifying attributes of "mobile" devices—such as size, weight, and battery life—but they take them to new levels of minimalism. This level of performance means that off-the-shelf, programmable DSPs are just not a good fit.

Power consumption, in particular must be ultra-low. For example, Phonnak's internal technology evaluation at the beginning of the project concluded that commercial DSP chips are up to one order of magnitude too power hungry for hearing instrument applications, says Hans-Ueli Roeck, manager of hearing instrument software development at Phonak.

Phonak had been using the technology developed by the AR|T DSP coprocessor division of Adelante Technologies for several years before the technology was acquired by ARM. Even in the early days, the technology had proven itself at Phonak. "It was crucial for the success of the products that integrated them," says Roeck. Following the introduction of data-engine-enabled hearing instruments, Phonak rose in rank to number three in the world.

Phonak's most recent experience has been with OptimoDE. The resulting chip is in working silicon. "The chosen DSP core by itself could not have fulfilled our targets of an ultra-low-power design." says Roeck.

Phonak's DSP system architecture analysis and its previous experience with data-engine enabled technology had taught the company an important lesson: Only sufficiently large functional blocks such as entire FFT's should be outsourced from the main DSP core into a specific data-engine to achieve the performance and power consumption goals. The resulting loosely coupled system adds to the system's ability to meet the power and performance goals. An optimized narrow interface—in this case a double-buffered RAM—was also needed to reach the ultra-low power consumption goals.

OptimoDE was used in this way to attach a specific FFT engine (128pt. FFT/inverse FFT in ca. 220 cycles) to the principal DSP core.

OptimoDE was also called on to implement a proprietary sample-rate processing engine. A specific time-domain-core (TDC) was added to the principal DSP core. "Both HW accelerators (TDC and FFT engine) worked out fully as expected," says Roeck.

Phonak specified the engines on a functional level in terms of power consumption, gate area, functionality and quality. But ARM's engineers in Leuven, Belgium provided the engineering services to actually design and implement them.

Phonak's previous experience with the technology helped quite a bit in speeding the design process. Numerous low-power optimization techniques, many of them already integrated into OptimoDE, were applied to the problem. Several rounds of power simulations over the entire DSP system were also required. A close and long standing working relationship between Phonak, ARM's service engineering team and the design house were also crucial to reach the partially conflicting targets of functionality, gate area and power consumption together. The key result in terms of low-power performance was an impressive 0.013mW/MOPS, according to ARM.

At the conclusion of this process, the power consumption, gate area, functionality and quality—that is, being right the first time—goals were within specifications and our expectations," says Roeck. "ARM's Leuven engineering service team performed a superb job."

Case History #2—Fewer Gates Deliver Better Overall Performance
From a corporate size and strategy perspective, National Semiconductor and Phonak are quite different. Phonak is a relatively small company that targets its designs in a niche market. National Semiconductor's sales are much larger and it offers a very wide range of analog and digital products.

But they have one thing in common. Just as Phonak's market ranking climbed after utilizing OptimoDE technology, so did National's in a product line where data engine technology was implemented.

A DECT phone chip designed in part with Adelante's technology prior to its acquisition by ARM propelled National to second place in a market where it had not been in the top 10 before. Beyond its single product success, National also considers data-engine technology as a contributor to its overall product strategy.

Ahmad Bahai, chief technology officer of National's Wireless and Information Appliance Group, does not give all the credit to data-engine technology for lifting the SC14408 family of CMOS chips to market prominence. The radio design and other optimizations were also new. But it did play an important role in the design of a very low power processor that executes several algorithms required for echo cancellation, tone detection and equalization, he says.

Competing products typically use an off-the-shelf DSP processor for these tasks so using a data engine optimized for a few related algorithms—but still programmable—provided a big advantage both in board space and power savings. National's project preceded ARM's acquisition of Adelante and it—like Phonak—does not use an ARM microprocessor core in the design.

Although there is just one data-engine core in the DECT chip, Bahai sees a trend toward developing multiple data engines for National products. The primary reason for this strategy is the advantages of the data-engine core itself. But the strategy also leverages National's analog expertise. For optimal performance, each data-engine core needs its own optimal voltage and power management is one on National's core competencies.

A key advantage of OptimoDE technology is the speed with which cores can be reconfigured using the ARM tools. In fact, Bahai refers to the data-engine cores as object-oriented cores because after a core is developed it can be reprogrammed using the OptimoDE compiler to handle a similar algorithm. In National's design flow, the new algorithm is programmed in MATLAB, debugged at in an object oriented language and then passed on to the compiler, which creates a binary file for the design.


About the Author
Contributing writer Jack Shandle is a former chief editor of both Electronic Design magazine and ChipCenter.com. He holds a BSEE degree and has written hundreds of articles on all aspects of the electronics OEM industry. Jack is president of eContentWorks, a consultancy that creates high-value content for publishers, eOEM corporations, and industry associations. His email address is jshandle@earthlink.net.





Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form