Design Article

IMG1

DSP Meets FPGA: Is Massive Parallelism Enough?

Jack Shandle

11/25/2003 12:00 AM EST


The wireless communications technology explosion came at a near-perfect time for FPGA companies with access to leading-edge fabrication technology. Reductions in FPGA feature size to 0.18 micron and below made more gates available as well as less expensive on a per-gate basis. Tighter geometries also gave significant performance boost. Lower operating voltages made the chips less power hungry.

That auspicious sequence of event positioned companies such as Altera and Xilinx to take on the compute-intensive, algorithmically-specific tasks required in technologies such as W-CDMA and entertainment-oriented applications such as MPEG over a wireless link.

Although extreme number-crunching power is useful in many applications, it is particularly well matched to communications generally—and wireless communications in particular.

Cell phone infrastructure applications illustrate the skyrocketing demands being placed on processors (Table 1). Across the board in virtually every signal-processing application, the story is much the same.

  2G 2.5G 3G
Standards
GSM, DSC1800, PCS1900, IS-95B, IS-54B, IS-136, PDC
GPRS, HCSD, IS-95C, IS-136+, IS-136-HS, Compact EDGE
3GPP-DS-TDD, 2GPP-MC, ARIV-W-CDMA, IS-2000 CDMA, OS-2 — DCMA, IS-95-HDR
Bandwidth
9-13 Kbps
Narrowband Circuit Voice
64-384 Kbps
384-2000+ Kbps
Wideband Packet Data
Processor Performance
~100 MIPS
~10,000 MIPS
~100,000 MIPS

Table 1:  Standards and data drive wireless processing requirements higher

It is not hard to see how the very configurability that is the hallmark of FPGAs makes them ideal for customized but reconfigurable logic that can execute specific, compute-intensive algorithms, especially ones that benefit from massively parallel operations. Since standard DSP processors have fixed architectures, hardwiring and massive parallelism are not their strong suit, except in the case of configurable DSPs—and even there the advantages come at a price (read Configuring Success, Minimizing Risk: Upstart DSP Architectures Aim at Elegance, Ease of Use by Jack Shandle for more information).

Communications applications are particularly amenable to remarkable throughput boosts because they typically have three levels of parallelism, says Jeff Bier, president of the DSP technology consulting firm Berkeley Design Technology (BDTI). First, multiple channels are likely and similar processing takes place in each channel. Inside a channel, tasks tend to be done in a parallel, pipelined fashion, and, inside the individual tasks, the same calculations are repeated over and over again. A filter, for example, requires a long string of multiply-accumulate operations, Bier notes, which can be executed in parallel if the hardware is available to do so.

BDTI recently benchmarked representative FPGAs and DSPs for this type of application. According to its report, FPGAs for DSP, the new generation of DSP-enhanced FPGAs delivered more than an order of magnitude better throughput in a specific, telecom-oriented application benchmark (refer to www.bdti.com for details on BDTI's report).

A Multi-Dimensional Design Decision
But is massive parallelism enough?

Systems consist of not a single algorithm or even a collection of algorithms. Performance is not the only measure of design feasibility and project success. Both FPGA and DSP companies are well aware of this basic truth and are working hard to remove their technology's weaknesses and build on its strengths.

The availability of sophisticated development tools is always a key factor in making a technology switch. In the case of FPGAs, the migration also requires designers to adopt a new design mind set. Although it is something of an oversimplification, the difference could be thought of as FPGAs being an exercise in hardware (chip) design while DSPs (with their fixed architecture) are much more oriented toward software implementation.

Truth be said, both the DSP and FPGA camps recognize that FPGAs are best suited as DSP coprocessors when a two-chip solution to the processing equation is feasible. That still leaves the design team a lot of room for creativity and anxiety. The key design choice is where and when to use FPGAs as coprocessors. Relatively new figures of merit are being put forward to help answer that question.

FPGAs: The Contender
FPGAs have a number of advantages. Their architectural flexibility allows designers to adjust data widths to accommodate different requirements within the same algorithm and provide very wide on-chip memory bandwidth. The technology had its genesis in field programmability, which means the gates can be reused for multiple tasks—and in some cases even reprogrammed in the field to catch up with rapidly changing standards.

Any technology has its liabilities and a design engineer conducting a general overview of FPGAs as an alternative to DSPs would probably find the following aspects worth a thorough consideration in the context of the specific system design at hand.

Designing a signal processing engine from scratch is a daunting task so FPGA companies have added embedded MAC units and memory blocks that need only be connected to other logic. But this efficiency reduces generality. What works for one task may not for another. Power efficiency and price per unit are getting better but for the most part still lag DSPs although system cost analyses can serve up some unexpected price/performance results.

In particular, Xilinx's Virtex-II family and Altera's Stratix family both deliver not just flexibility of specific DSP enhancements. Both have hardwired, on-chip multipliers to speed up execution of the multiply-accumulate operations that are the hallmark of signal-processing algorithms.

Hardwired processing modules have two advantages over implementing the same function with programmable gates. Energy consumption is reduced and more performance can be rung out of a much smaller number of gates.

Altera and Xilinx are also offering development tools and libraries including intellectual property, library blocks for common DSP functions, and interfaces to The MathWorks' Simulink DSP software development tools.

There are challenges, however, and perhaps the most serious is the design cycle itself. Industry leaders Xilinx and Altera are working hard to correct this problem, but their DSP development tools aren't as mature as those of the DSP processor companies, who, after all, have about a 15 year head start.

Moreover, the FPGA design flow is unfamiliar to veteran DSP design engineers. In its FPGA for DSP report, for example, BDTI found evidence that even a relatively modest DSP function for an FPGA can take much longer to develop and optimize than the time required to code and optimize the same function for a DSP. One source told BDTI, for example, that it can take six man months to develop an optimal Fast Fourier Transform (FFT) compared to the week it typically takes BDTI to write and optimize FFT code for a high-end DSP.

DSPs: Still Plenty to Offer
Just the difficulty of development is one of the primary drawbacks for using FPGAs in signal processing applications, so the relative ease of development (for seasoned DSP designers, at least) is a primary strength of DSP processors. Even a cursory inspection of development tools and libraries on the Web sites of major DSP companies reveals a vast selection of software that makes design easier and faster.

DSPs are relatively easy to program and compliers are getting better and better. The significant exception to this rule is at the high end, where architectural innovations that add parallelism to boost performance make programming harder for assembly-language programmers. Similarly, features such as deeper pipelines and running multiple instructions in parallel make it more difficult for compilers to generate efficient assembly code.

The DSP processor development infrastructure available for MCU-like tasks is, in general, lagging behind dedicated MCUs and cores such as the ARM 7 and ARM 9 families.

This is one reason for the emergence of a dual processor—DSP/MCU—architecture option in ASICs. But ASICs are not an option for every design team. Communication between the two processor cores can be a challenge to be sure but it is offset in some design teams' minds by the ability to use ARM development tools for MCU tasks and a DSP core for signal-processing tasks.

In terms of raw performance, DSPs continue to make substantial gains but will inevitably fall short of the performance bar set by custom-wired algorithms in FPGAs. At the high end of the performance scale, for example, Texas Instruments' TMS320C6000 family delivers from 1200 to 5760 MIPS for fixed-point and 600 to 1350 MFLOPS for floating-point operations. The platform is optimized for broadband infrastructure, performance audio, and imaging applications.

TI and its archrival, Analog Devices (ADI), have also made architectural changes to give designers more flexibility in dealing with application-specific challenges. ADI's TigerSHARC's latest architecture, for example, can handle different data types of 8-, 16-, 32-bit fixed-point size and 32-/40-bit floating-point data size. This allows the programmer to handle application data efficiently and provides enough dynamic range for complex computations by not being bound to a single format.

Proof of the Pudding
Since the relative strengths of FPGAs and DSPs are more or less important depending on the application, benchmarking is an exercise in thoughtful selection of the application. BDTI, for example, chose a full application that resembles a real world application—specifically N channels of a simplified OFDM (Orthogonal Frequency Division Multiplexing) receiver.

The benchmark implementers' dual goals were to maximize the number of channels and minimize the cost per channel. Algorithms range from table look-ups to MAC-intensive transforms; data sizes range from 4 to 16 bits; data rates range from 40 to 320 MB/s; and data includes real and complex numbers.

The benchmark's data path included an IQ demodulator, a FIR filter, a Fast Fourier Transform, a slicer, and a Viterbi decoder. More details on the benchmark can be found on BDTI's site.

A Motorola MSC8101 DSP (based on the StarCore SC140 core) running on a 300 MHz clock and priced at $116 represented DSP technology. Two members of Altera's Stratix family represented FPGAs. The Stratix 1S20-6 and the IS80-6 are priced at $325 and $3480 respectively.

Despite their higher cost, the FPGAs proved far more cost efficient. In terms of the number of channels of the application that could be supported by a single chip, the DSP was found to handle far less than one. The two Altera FPGAs could handle about 10 and 50 channels respectively. In the critical cost-per-channel metric, the DSP was estimated to cost about $500 per channel (more than one DSP was required for the application). The two FPGAs came in closer to $10 and $50 per channel.

Up-Front Design Decisions
The technologies and combinations of technologies than can be employed by a design team range from the blazing performance of a dedicated ASIC that executes just one function to the flexibility of a common MCU, which admittedly lacks the processing power and price/performance metrics for most advanced signal processing applications. Figure 1 shows Altera's view of the DSP/FPGA solutions universe. It places special emphasis on coprocessing, says Brian Jentz, Altera's DSP Marketing Manager.


Figure 1:  Using an FPGA as a coprocessor rates special attention

The area of most interest in Figure 1 is described by Altera as "the flexibility zone." It also defines the "profitability zone" for FPGA vendors and the "uncertainty zone" for design teams. The uncertainty stems from the fact that there are at least three solutions that could be the right one in terms of optimizing cost, time to market, and performance.

In Altera's case, it offers a single-chip solution as long as its Nios soft processor core can handle the control activities. While flexible and relatively easy to implement, the Nios solution ranks fairly low on the performance axis. Similarly, the highest performance (algorithm programmed into FPGA hardware) lacks flexibility, which increases design risk in most cases.

Since the clear choice in most cases is to employ an FPGA as a coprocessor, design teams should pay very close attention to the chip-to-chip interface. Altera, for example, uses Texas Instruments' external memory interface (EMIF) to communicate with TI parts. Similarly, it calls Motorola's Eureka bus and Analog Devices' Linkport bus—but presently does not have interfaces to DSPs from other vendors.

Although built-in bus support is a plus for design teams, the configurability of FPGAs extends to I/O as well as logic functions and designers can implement bus interfaces to any DSP architecture.

"The trend is good in terms of interfacing between FPGAs and DSPs," Altera's Jentz says. "In the future, DSPs will move toward standard high-speed serial interfaces and we already have support there."

Texas Instruments espouses a roughly similar perspective. Leon Adams, TI's Worldwide Manager of DSP product marketing, says that "in most cases, FPGA technology is a complement, not a competitor, to programmable DSPs in high-performance, real-time signal processing systems."

Texas Instruments and Xilinx are working together to bring the FPGA coprocessor good news to the design community, Adams says.

While DSPs are the classic technology to handle real-time signal-processing applications, say Adams, FPGAs can add a valuable new dimension in performance. In particular, FPGA coprocessors complement DSPs in specific areas such as system logic muxing and consolidation, new peripheral or bus interface configurations, as well as performance acceleration in the signal-processing chain.

For the design team charged with finding an optimal solution to a signal-processing application, the choices can be bewildering. One approach to making a preliminary decision about using an FPGA coprocessor is to compare your application to ones the FPGA companies consider appropriate.

In the realm of wireless communications, 2.5G EDGE equalization, 3G baseband processing, and 3G RF linearization are all prime candidates for coprocessing. In the consumer arena, broadcast plants, and MPEG2 and 4 are likely candidates. Computer and storage applications would include data analysis and routing engines as well as digital imaging. Even wireline communications have appropriate applications including encryption, framers, and traffic management.


About the Author
Contributing writer Jack Shandle is a former chief editor of both Electronic Design magazine and ChipCenter.com. He holds a BSEE degree and has written hundreds of articles on all aspects of the electronics OEM industry. Jack is president of eContentWorks, a consultancy that creates high-value content for publishers, eOEM corporations, and industry associations. His email address is jshandle@earthlink.net.


print

email

rss

Bookmark and Share

Joinpost comment




Please sign in to post comment

Navigate to related information

Product Parts Search

Enter part number or keyword
PartsSearch

FeedbackForm