Programmable Logic Devices (PLDs) have long been used as primary and co-processors in telecommunications. In these applications, the input data are commonly audio data rates and there are strict timing constraints such as completing the calculations between successive input data samples. With a Digital Signal Processor (DSP) such as the TI TMS320C6000 series, this only allows for a few tens of thousands of instruction for the entire calculation. This instruction bandwidth limitation can be minimized by taking advantage of the multiple processing units in the C6000 (8 parallel units in the 620x/670x chips). However, creating the specialized pipe-lined code to take true advantage of this parallelism requires hand optimization of assembly language routines. Maintenance, re-usability, and implementation of this type of code can be troublesome and expensive at best. Additionally, the degree of parallelism (simultaneous executions) is relatively low.
Another alternative for high-bandwidth computations is to use a PLD as a co-processor which integrates the repetitive, speed critical portions of an algorithm into the PLD. With a PLD and Altera's new automated design software, the engineer has the ability to optimize system performance in ways not possible with a traditional DSP. The purpose of this white paper is to discuss the general issues of moving part, or all, of a DSP application onto a PLD using Altera's DSP Builder and SOPC Builder system software design tools.
Altera's Automated Design Suite
The Altera design software consists of three main components: Quartus II, SOPC Builder, and DSP Builder. Their relationship is shown in Figure 1. Collectively, these tools comprise an automated system development platform that provides a high level of design integration and flexibility. Now the engineer can concentrate on their target design at the system-level rather than at the level of HDL and logic construction.
Quartus II is the foundation of the design suite. It is the logic-level design tool that supports embedded processor software development, FPGA and CPLD design, synthesis, place-and-route, verification, and device programming. It performs the lower level functions of producing a programmed PLD from the set of design files passed to it by SOPC Builder and DSP Builder. In broad analogy to general software development, Quartus II is the assembler, and linker.
Figure 1. Altera Software Design Tools
Sitting on top of Quartus II are two packages that can be used separately or in conjunction, depending on the task at hand. These tools are equivalent to higher level language compilers where HDL (Verilog or VHDL) is the underlying language. SOPC Builder and DSP Builder are analogous to the visual designer environments like MS Visual C++ providing a large suite of automated design assistance. In fact, with SOPC Builder and DSP Builder, the designer does not need to be a VHDL or Verilog programmer. These packages are automated system generation tools where the components of a hardware system are defined, and inter-connected, simulated and verified, all without resorting to the underlying HDL. With a true point-and-click design method, the system architect can generate an entire system, simulate and verify it, and download it into a PLD all at the desktop of his PC.
SOPC Builder is an automated system development tool that dramatically simplifies the task of creating high-performance system-on-a-programmable-chip (SOPC) designs allowing the engineer the complete flexibility to pick and choose peripherals and their performance based on the needs of the design. The tool automates the system definition and integration phases of development. Using SOPC Builder, system designers can define a complete system, from hardware to software, within one tool and in a fraction of the time of traditional system-on-a-chip (SOC) designs.
To provide enhanced signal processing capabilities, DSP Builder uses The MathWorks MATLAB and Simulink tools for signal processing system generation. Creating a DSP application in a PLD requires both high-level algorithm and HDL development tools. The DSP Builder integrates these two functions by combining the algorithm development, and the software simulation and verification capabilities of MATLAB/Simulink with the hardware synthesis and simulation of the Altera design software. DSP Builder allows system, algorithm, and hardware designers to share a common development platform that uses a drag-and-drop architecture. Components are selected from a large menu of options and are placed on the Simulink workspace where they are connected with the mouse. Parameters for a given component (for example, an analog-to-digital converter) are controlled by drop down menus.
Medical and Industrial Applications
Signal processing in medical and industrial applications often has fundamental differences from the typical telecommunication application. In medical imaging applications, such as Computed Tomography (CT) and Magnetic Resonance Imaging (MRI), the input data is not a real-time data stream of successive bytes, but rather it is a large data block (an image) residing in memory or on disk. The task of the image processing chain is typically to tile the image into sub-blocks, perform a series of linear algebraic transforms on those blocks, and place the resultant data back on the disk. In addition, very large computational bandwidths are required when multiple images are compiled into 3D or coronal/sagittal views.
In many industrial applications, such as ultrasonic flaw detection for example, the data rate from the sensors vastly exceeds audio and may reach as high as 20 MSamples/sec. In other industrial applications, the sensor data rates can be much smaller, e.g., 100 kSamples/sec, but there may be multiple sensors. This drives the effective data rate into the radio-frequency regime. If the processing chain is complex, general-purpose DSPs often do not have the bandwidth necessary to meet real-time deadlines and processing must be done in an "offline" manner.
The core of the difficulty with DSPs in these applications comes in two parts. First, they are essentially serial machines, processing one element of the signal chain at a time. In the case of some high-end DSPs, a small number of instructions can be processed simultaneously giving a small degree of parallelization. However, the cost of these DSPs can be 10x that of a non-parallel version.
In a number of applications, multiple DSPs can be utilized to obtain true parallel processing but the costs for hardware rises very quickly as the process is made parallel. Software for these systems becomes ever more complex and expensive as well requiring a real-time operating system (RTOS), or very careful inter-processor communications schemes. However, the degree of true parallelism is relatively low and comes at a significant price in terms of design time, cost of goods, and time to market.
Gaining Speed with PLDs as Parallel Co-Processors
These issues have led many medical and industrial designers to use PLDs, such as the Altera Stratix family to take advantage of the ability to convert a cascaded set of operations into a parallel FPGA structure which operates in several clock cycles at 200+ MHz. This is the central advantage of the PLD technology " the ability to speed up an algorithm by making the process truly parallel.
In a DSP application, engineers have few options to increase performance beyond writing pipelined assembly (if their DSP supports it), or replacing the DSP with a higher frequency model. With the Altera approach, the hardware, as well as the software, can be simultaneously optimized. Now the designer has a three-dimensional optimization space available (see Figure 2), whereas code optimization and processor speed were their only previous choice. With the nearly exponential price versus performance curves for many hardware processors, designers were left with typically only one option ? code optimization. As anyone who has had to optimize code knows, you can only go so far before you exhaust that avenue. The broad flexibility provided by Altera's new paradigm can help create systems that were previously impossible to design either because of time and cost-of-goods, or because traditional DSPs could not handle the calculational load. Now the engineer has one more degree of freedom in the design process " hardware acceleration. This acceleration is accomplished by taking the parallelization process one step further than DSPs and making the algorithm massively parallel.
Figure 2. The New 3D Optimization Space Opened by Altera's Design Software
In addition to the obvious calculation speed increase, using PLDs as co-processors give the designer increased flexibility in four important ways:
* Speed gain in the core algorithm relaxes deadline pressure on the remaining parts of the algorithm reducing the overall timing criticality of the system.
* High-performance DSP applications can quickly fill the resources of available DSPs leaving no room for expansion. With the PLD approach, additional algorithms or filters can be added to the existing system without resorting to extra hardware, or increasing the timing pressure on the system.
* Control hardware that traditionally exists external to the DSP can be easily integrated into the PLD and combined with the algorithm in a single hardware chip. This can save time and money in PC board development and has a great impact on the flexibility of the system.
* The effect of specifications, performance, and availability changes made by other manufacturers is reduced because control is in the system architect's hands rather than a third party.
To illustrate these issues, Figure 3 shows a C code fragment for a convolution operation such as might be used in image processing, or in a signal filter. This nested loop structure accounts for the vast majority of image and signal processing tasks.
With the Harvard architecture of DSPs, the multiply-accumulate (MAC) can be performed efficiently in a single instruction cycle. However, multiple MACs must be performed to complete the inner loop in Figure 3, and the branching operation of the for loops consumes a larger number of instruction cycles as compared to the MAC operations themselves.
Figure 3. C Code Segment Illustrating Repetitive Nature of Typical Signal Processing Task
For illustration purposes, we will display the clock cycle usage of a traditional DSP with this convolution operation and compare it to the same calculation that has been implemented on a Stratix chip using DSP Builder. Imagine un-rolling a single iteration of the outer for loop (represented by the colored arrows in Figure 3), and placing them in a serial DSP program. If we plot these calculations versus the clock of the system (calculation time), they are represented by the sequential line of arrows in Figure 4(a).
Figure 4. Comparison: Execution of a Convolution Algorithm in a DSP and a PLD
In the TMS320C6000 DSPs, the software designer can write optimized assembly code to pipeline instructions and data to the parallel logical units. This reduces the clock cycle usage as represented in Figure 4(b). Speed gains of up to 5X can be obtained by carrying out the optimization, either by hand or in the C compiler. The designer should be aware however that the C compiler can perform some of the parallelization, but the highest speed gains are found in optimized assembly code. Writing this type of code is a specialized skill requiring the detailed analysis of the algorithm to determine the manner in which the instruction and data must be interleaved into the instruction and data pipelines. If timing considerations change, then the code may have to be modified or re-written. A distinct drawback to this scenario is that the coding time and maintenance cost of this type of software is significant.
Using the DSP Builder software and implementing the algorithm on a PLD, the convolution filter can be constructed so that the filter operates in 1-2 clock cycles as shown in Figure 4(c). In many applications, this is the primary consideration " speed. (Note: In both systems, we have ignored the initial clock cycles required to load the data (fill the taps) before the calculation can be made.)
When designing with a PLD, it is important to understand a key trade-off that requires a change of perspective from traditional DSP design. In order to acquire more speed in a DSP system, the designer can make several modifications: 1) write more optimized code (usually assembly), 2) upgrade the DSP to a faster, and more expensive model, and 3) add more DSPs. In a PLD design, the trade-offs are different. If the designer wants more speed (smaller execution times), then more LE's must be utilized to make the operation more parallel. Since each LE occupies space on the chip, the traditional way to express this trade-off is space vs. speed ? more speed requires more space on the chip. Now the primary system trade-off is chip size (number of LEs) versus costs. Altera provides a wide range of chip sizes and silicon technologies that allows the designer to select a performance/price point that works for their system.
Logic Elements (LEs) are the basic units in the FPGA architecture. Altera provides a wide range of chips with different LEs and support features.
The advantages of the PLD approach are four-fold:
* The speed gains over optimized assembly code in the DSP are a factor of 20 or more.
* The Altera design software, such as DSP Builder, is a graphical design tool that uses a drag-and-drop architecture. Thus, development time is significantly less than the writing of pipelined DSP code.
* If the filter requires a hardware control interface to external hardware, it can be easily implemented directly on the same PLD with the Altera software rather than writing code to interact with the interrupts or buses on a DSP.
* Modifications to the design and architecture are fast and easy to implement.
Design Decisions: PLD Co-processor or DSP
The speed advantage of using PLDs is evident from Figure 4. The process can be made parallel so that the ratio of computational throughput to the number of clock cycles used is quite high. In making the decision to take a project in the direction of PLD co-processors, there are a number of factors the designer must consider. We will attempt to illuminate the major concerns here.
The first design job is to segment the design into those tasks that will be placed in the co-processor, and those left in a DSP or other system microprocessor (we'll call it the master processor). As we shall see in the design example, even the master processor can sometimes be eliminated from the hardware design with the use of the NIOS soft processor.
When faced with the segmentation task, the easiest and best way to look at the segmenting process is to attempt to divide the problem into two independent, but related components: 1) the computational algorithm itself, and 2) hardware control of that algorithm. While these are inter-dependent, they can be separated easily with the use of a simple flow chart.
Figure 5 shows a simplified example that illustrates the way to look at the segmentation process. In the flow chart, we see the implementation of a simple FIR that filters a high data rate, real-time signal, e.g. bandpass filtering a noisy sensor signal, and displays a derived parameter (in this case the signal power) on a CRT or LCD.
The upper block, labeled Filter Configuration, is where the tap coefficients are calculated based on the configuration of the system. For example, the user may choose filter type (Butterworth, Chebyshev, etc.), and cut-off frequencies. The cross-hatched arrow leading into the Filter Configuration block is the software control provided by the master processor.
Figure 5. Segmentation of a real-time FIR Filter processor
In the second block, the system manages the data input, for example framing incoming data into manageable segments for the filter. Additionally, this data framing operation must also send a few hardware signals to the ADC such as enable and acknowledge. The white arrows in the Figure represent these hardware control points. In the Data Output block, hardware control is again required to format the output data and facilitate its final placement, for example, a DMA operation to external SRAM. In the final block, the power in the filtered signal is calculated by squaring the signal elements and summing. The power parameter is then displayed on a CRT or LCD.
Now let's segment the design into parts that will reside in the PLD and parts that remain with the master processor. The grey circular arrow in the background of Figure 5 represents the high-speed, repetitive portion of the algorithm. This is the segment of the algorithm that changes little in form throughout the system's usage. For example, it does not become an IIR filter based on simple selections of the user. It is also the most speed critical portion of the algorithm. This is the best candidate for transfer to a PLD based on speed requirements and its static configuration.
The final block has a dashed line diagonally across its face. This represents a possible division of the task into separate parts. Since we are simply displaying the power parameter, the timing pressure on this part of the algorithm is very low, on the order of a few tens of milliseconds. However, the calculation of the power can be computationally intense because it involves the squaring of an array element by element and then summing the elements together. Whether this remains with the master processor or is placed in a PLD depends on a number of questions:
1. What is the load on the master processor? Can moving task(s) to the PLD free other master processor resources?
2. How much FPGA real estate (logic elements, or LEs) does the filter operation take? How many LEs will the power calculation take?
3. Are you using floating or integer arithmetic in the master
processor? If integer math is used, then a scaling operation to prevent possible overflow will be needed because of the squaring operation.
When segmenting the design, designers will typically find that there are obvious portions of the algorithm that should be moved to the PLD while others are dependent on a number of system issues. For more complex code, the best way to make this determination is to develop a higher level language (such as C/C++ or MATLAB) model of the code. The compiler profile functions can be utilized to determine execution time, and what portions of the algorithm are using the most CPU resources. Since MATLAB is required for the DSP Builder, it is very convenient to use MATLAB for the algorithm simulation and employ its tools such as its graphical profiler.
For those unfamiliar with MATLAB, it is an interpretive language with many common features of C/C++ and can be learned in a short period of time. (See The MathWorks website: www.mathworks.com for more information on MATLAB and Simulink.)
Design Decisions: Computational Control " The NIOS Soft Processor
Often after the design has been segmented and the repetitive elements have been configured in an Altera PLD, there still exists a control function that must be implemented as we saw in the previous section. In that case, we left the filter coefficients and the final display algorithm in the master processor. A general purpose processor is better suited to these types of calculations for a couple of reasons:
1. They are usually based on input from a User Interface (UI) that is typically slow and subject to change.
2. Multiple decision trees must be implemented.
Altera offers an alternative to keeping a DSP in the system for control purposes with its NIOS soft-processor. NIOS is 16-bit or 32-bit microprocessor that can be custom designed with peripherals and control logic using the SOPC Builder software. The processor design is then converted into interconnected LEs by Quartus II, and can be downloaded into the FPGA much like any other hardware design. Sometimes this concept takes a moment to take root because it is a radical departure from the norm; not only is the logic design downloadable to the PLD, but the entire microprocessor is also downloadable.
The NIOS also provides a set of unique features specifically designed for performance enhancement " custom instructions and custom peripherals. With these new software functions, designers can combine a complex sequence of hardware operations into a single NIOS function call. This hardware acceleration technology is a unique way to integrate system hardware directly into the software. Essentially, it adds custom-defined logic/hardware functionality to the processor. Custom instructions are simple operations that can be handled in the NIOS CPU registers. Subsequently, a custom instruction is attached directly to the ALU of the NIOS. A simple example of a custom DSP instruction might be a MAC that also scales the result based on its size. Rather than writing a segment of code that operates with all the CPU overhead to perform the MAC and the scaling, the operation can be implemented directly in hardware and called in software as though it is a member of the native instruction set. Custom peripherals are larger and more complex pieces of hardware that can be abstracted into a single C callable function. By using these hardware acceleration technologies, the timing control of any design is simplified enormously.
The combination of NIOS with the DSP Builder can yield great gains in terms of performance as well as cost-of goods. A single PLD can replace several DSPs plus their support hardware depending on the calculation and control tasks in the system.
Click here for Part 2