United Business Media EE Times


Search

HOMEMARKET INTELLIGENCE UNITFORUMSDESIGNNEW PRODUCTSCAREERSBLOGSCONTACTEVENTSSIGN UP!RSSMost Popular contentTrusted Sources

 

DSP Architectures Reach for Greater Parallelism

Meeting the demand for hi-speed and sophisticated digital signal processors will require clever coding, advanced compilers, and increasingly complex component constructs.

by Emmanuel Roy and David Crawford


Over the last few years, performance requirements for the digital signal processors (DSPs) used in telecommu-nication systems have increased signi- ficantly. Factors driving the requirements include the development of new communication technologies and the need to reduce system costs. These new technologies include high-end telephony applications such as third-generation mobile communication and the Voice Over Internet Protocol. The migration from narrowband to broadband communication has increased DSP performance requirements by an order of magnitude or greater. Additionally, the push to lower system costs has forced reductions in the number of DSPs per system board while still maintaining or improving the quality of service for the system. As a result, designs must incorporate more channels (voice or data processing) onto a single device.

Figure 1
Emerging DSP architectures enhance the capability for dual-parallel processing through a variety of strategies including (a) the dual single-ALU core device, (b) the core-plus-coprocessor device, and (c) the dual-ALU core device.
These increased DSP demands will require more than simply improving manufacturing processes. Instead, designers must continue to develop entirely new DSP architectures, including those able to perform multiple arithmetic operations in parallel. Implementing this parallel capability involves adding multiple arithmetic logic units (ALUs) to a DSP and choosing from various design strategies. Of the options available, the DSP industry today utilizes two relatively straightforward approaches: take an existing, well-proven DSP core and duplicate it, creating a dual-core device; or design a special-purpose coprocessor for specific applications, then incorporate it with an existing, well-proven DSP core.

To duality and beyond

The dual-core approach adds an extra DSP core (and therefore an extra ALU) to the device. Each DSP core contains its own dedicated internal memory, as well as some memory accessible to both cores. Interconnections between the cores permit data exchange but often limit flexibility. Therefore, minimizing data exchanges where possible helps to avoid communication bottlenecks. In general, running independent tasks on each core offers the most straightforward way of utilizing a dual-core device and requires little or no intertask communication between cores. The feasibility of this approach does depend on the application being implemented, but designers should work it in when they can, as it offers considerable advantages: each core effectively runs its own self-contained code that requires little or no modification (see Figure 1a).

Figure 2
The strategies for multi-stream parallel processing include: (a) a multiple single-ALU core device, (b) a twin dual-ALU core device, and (c) a multi-ALU core device. The increasingly complex designs rely on efficient algorithm coding to maximize effectiveness.
Depending on the coprocessor design and functionality, the coprocessor approach may also add another ALU to the device, allowing specific operations to execute in parallel with the DSP core. Independent of the core, the coprocessor executes a task on its own and the pro-gram code running on the core initiates coprocessor utilization. This approach does, however, require the implementation of certain software modifications (see Figure 1b).

In conjunction with these approaches, designers have increased the performance of a single DSP core device by incorporating multiple ALUs into the core. In this case, a single program controls both ALUs (highlighting the interdependence of the two units) and provides parallelism without the need to duplicate the program and data memory spaces. Exploiting the parallel capabilities of the dual-ALU core requires software modification to expose opportunities for parallelism within the corresponding algorithm (see Figure 1c).

Expanding on the fundamental concepts describedıdual single-ALU cores or single dual-ALU coresıa designer can introduce much-needed complexity onto a DSP device. The resultant strategies include: multiple single-ALU cores with four to eight DSP cores (each with a single ALU), twin dual-ALU cores, and multi-ALU cores with four to eight ALUs in a single core (see Figures 2aı2c). Not surprisingly, each of these approaches to parallelism offers advantages as well as disadvantages. Such factors as time to market, suitability for targeted applications, power consumption, and ease of programming help to influence the software engineerıs choice of algorithms and coding strategies.

A proliferation of cores

In general, a multiple single-ALU core device presents a relatively simple design and production challenge, since it mainly involves the duplication of existing cores. Each core incorporates an ALU, program control logic (for instance, a program decoder, address generators, and interrupt mechanism), dedicated internal memory for program code and data, and other features that would normally appear on a single-core DSP device. Note, however, that the device must have a certain amount of core-to-core interconnectivity designed into it to allow some level of communication between cores. Note also that the cores potentially compete with each other for access to peripherals. Designers may, therefore, have to add more peripherals to the device to reduce the likelihood of signal traffic congestion.
Listing 1

Self-contained cores run more or less independently of each other, making this type of device ideally suited to applications consisting of multiple subtasks that need little or no interdependency. The subtasks assigned to different cores rely on minimal intercore communication. An example of such tasks includes the processing of multiple channels of speech codingıthe cores execute the same program on different channels of data. Note, however, that for maximum performance, each core requires its own local copy of the program code as well as all data tables associated with the active algorithm. This redundancy results in significantly inefficient memory usage.

For some applications, it isnıt possible to clearly identify subtasks available to run in parallel; the subtasks need to run sequentially. If the opportunity for parallelism exists at all in such cases, it must reside within the subtasks. Achieving that parallelism then requires the software engineers to spread the execution of each subtask over several cores. However, the transferring of data between the cores may incur significant communication overhead. Furthermore, design considerations must include synchronization of the cores. These issues in general require the software developer to implement extensive modifications in the program code.

The multi-ALU core device

A DSP device based on a single multi-ALU core needs less duplication of hardware logic than a multiple single-ALU core device does. For instance, neither the program decoder, the address generators, nor the interrupt mechanisms need to be duplicated, shrinking the die size and power consumption. Furthermore, each algorithm associated with this type of design strategy requires only one copy of the program code or data tables, while data paths maintain synchronization automatically. In a single multi-ALU device the core can run only one program at a time. As mentioned, the code therefore requires modification to take advantage of the opportunities that exist for parallelism within each of the applicationıs tasks.
Listing 2

The code running multiple single-ALU core devices also requires modification when the application implemented cannot split conveniently into independent subtasks. The communication of data between ALUs is straightforward; the design confines all data to a single core and passes data via registers. If, for instance, the device has an orthogonal instruction set, data held in one register for a particular ALU can simply pass to another ALU as needed. A multiple-ALU device adds another important design option to the arsenal of choices available to the DSP designer attempting to accommodate additional complexity on a single die.

Progress in modifying application software to exploit parallelism in the algorithmsıand current technologyıhave effectively demonstrated the ability to automatically maximize the efficiency of ALU loading by exploiting opportunities for parallelism found within the C/C++ and/or assembly code. In addition, debugging technology has made significant progress in recent years as well, further easing the debugging of parallel instructions. Meanwhile, recent efforts devoted to improving C/C++ compiler technology also promise to significantly enhance the code supporting parallelization techniques and thus device performance. Tool suites include a C/C++ compiler optimizer and/or an assembly optimizer that parallelize the assembly instructions.

Linear code uses only one ALU regardless of the number of ALUs available on the DSP core and thus doesnıt fully utilize the potential of the architecture. The C/C++ compiler or the assembler optimizer technology currently available can easily perform these parallelizing tasks as long as the linear code instructions are independent of each other. Sometimes, however, algorithmic data dependencies prevent the optimizer from parallelizing the code and reduce the utilization of all ALUs in parallel.

The ability to obtain automatic parallelization for multi-ALU core devices clearly stems from the progress in C/C++ compiler technology that has taken place over the past few years. For example, the Starcore C/C++ compiler uses techniques that reduce algorithmic dependencies. One technique the compiler optimizer usesıpartial summation optimizationıallows the opti- mizer to process a sequence and determine whether or not temporary variables (for example, DSP registers) can effectively parallelize the operation (see Listing 1).

By creating more temporary variables and allocating more register resources, the optimizer fully utilizes the four ALUs available in the architecture, thus reducing the cycle time by more than 70 percent when compared with one ALU alone (see Listing 2). Note that designers should use the partial summation method with care if they need to achieve bit-exactness (in speech coding standards, for example).

Illustrating the point

Listing 3
Obtaining maximum performance from multi-ALU cores requires in-depth knowledge of the algorithms that make up the overall applicationıin particular, a clear understanding of the opportunities for parallelism. As discussed, the development tools provide some parallelization techniques; other techniques rely on the software developer. The developer can improve on the ability of the C/C++ compiler to exploit parallelism by applying knowledge of the algorithm to the structure of the C/C++ application code. For example, since filtering tasks occur frequently in digital signal processing, a multi-ALU core must be capable of processing filtering operations very efficiently. Equation 1 calculates the output of a finite impulse response (FIR) filter, where x(n-i) represents the input samples at time (n-i), w(i) represents the filter coefficients, y(n) represents the output samples, and N represents the number of filter taps (see Listing 3). In theory, a four-ALU core could calculate four multiply-and-accumulates (MACs) at a time. However, the partial summation approach suffers from two main disadvantages: it requires eight memory accesses every cycle with the loading of four coefficients and four input samples, and it fails to guarantee bit-exactness.

The eight memory accesses required consume a significant amount of data bus bandwidth. Furthermore, bit-exactness demands the preservation of the summation order of Equation 1, which precludes parallel computation of products for output y(n) using the partial summation technique. Therefore, rather than processing just one multiply and accumulate per cycle from Equation 1, the designer can use the multi-sample technique. This technique calculates four MACs in parallel, one for each output sample y(n), y(n+1), y(n+2), and y(n+3) as shown (see Equation Set 2, Listing 3). The first four input samples x(n), x(n+1), x(n+2), and x(n+3) and the first filter coefficient, w(0), are loaded into the DSP registers in one cycle (see Equation Set 3, Listing 3). The first four multiplications then take place simultaneously on the four ALUs in one cycle, and the next cycle needs only two new values: the next filter coefficient, w(1), and the previous input sample, x(n-1). The three input samples needed for the second group of four multiplications remain in internal registers from the previous register loads. This process repeats for each of the N filter coefficient values until the filtering operation on these four output samples completes. Following the calculation of the first four output values, the calculations for outputs y(n+4), y(n+5), y(n+6), and y(n+7) can begin. The number of cycles required for filtering an input sequence therefore decreases by a factor of about 4 compared to a single-ALU core, achieving optimal ALU loading. Note that this multi-sample technique will require only two memory loads per cycle. Furthermore, the order of summation for each output value maps identically to that of Equation 1, preserving bit exactness. Since filtering commonly occurs in many DSP applications, the multi-sample technique offers a very useful and versatile method for achieving parallelism.

Currently, the evolution of DSP processor architectures balances the advantages and limitations of multiple single-ALU core devices versus multi-ALU core devices. Optimization features available in multi-ALU development tools reflect progress in C/C++ compiler technology and promise further improvements going forward. Ideally, the multi-sample technique will become increasingly transparent to the software engineer.

Clearly, DSP devices capable of performing multiple arithmetic operations in parallel will become more prevalent as applications continue to demand greater performance from the devices. Such an increase will require software and tool developers to apply detailed knowledge of DSP algorithms to obtain optimal parallelism in applications.


Emmanuel Roy and David Crawford are with the Wireless Infrastructure System Division of Motorola Semiconductor Products Sector. Their work encompasses DSP applications for the wireless infrastructure market.
Send electronic versions of press releases to news@isdmag.com
For more information about isdmag.com e-mail webmaster@isdmag.com
Comments on our editorial are welcome.
Copyright © 2000 Integrated System Design Magazine

  Free Subscription to EE Times
First Name Last Name
Company Name Title
Email address
  Click here for your Free Subscription to EETimes Europe
 
CAREER CENTER
Looking for a new job?
SEARCH JOBS
SPONSOR

RECENT JOB POSTINGS
CAREER NEWS
SRC Expands R&D Centers
The Semiconductor Research Corp has added a new center to its university R&D efforts.

For more great jobs, career related news, features and services, please visit EETimes' Career Center.


All White Papers »   

 
Education and
Learning


Learn Now:












Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints|  RSS|   Digital|  Mobile
Network Websites
International
Network Features




All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement | Terms of Service | About