Design Article

IMG1

Increasing bandwidth in industrial applications with FPGA co-processors

Michael Parker, Altera Corp.

2/1/2010 10:00 PM EST

FPGAs have long been used as primary and co-processors in telecommunications. Digital signal processing (DSP) in industrial applications often has fundamental differences from the typical telecommunication application. In telecommunications, the input data is commonly high data rates with real time processing constraints requiring completion of calculations between successive input data buffers or samples. With a DSP processor, this may allow for only a few tens of instructions per input data sample. This instruction bandwidth limitation can be minimized by taking advantage of the multiple processing units in some DSP processors. However, creating the specialized pipe-lined code to take true advantage of this parallelism requires hand optimization of assembly language routines. Maintenance, re-usability, and implementation of this type of code can be troublesome and expensive at best. Additionally, the degree of parallelism (simultaneous executions) is relatively low, and may still not permit the real time processing constraints to be met.

A better alternative for high-bandwidth computations is to use an FPGA as a co-processor that integrates the repetitive, speed-critical portions of an algorithm into the FPGA. With an FPGA and automated design software, design engineers have the ability to optimize system performance in ways not possible with a traditional DSP. This article discusses the general issues of moving part, or all, of a DSP industrial application onto an FPGA using system software design tools.

Automated Software Design Suite

The design software referred to in this article consists of three main components: Quartus II, SOPC Builder, and DSP Builder development tools. Collectively, these tools comprise an automated system development platform that provides a high level of design integration and flexibility, allowing engineers to focus on their target design at the system level rather than at the level of HDL and logic construction.

These are is the logic-level design tools that support embedded processor software development, DSP datapath design, synthesis, place-and-route, verification, and device programming. They perform the lower-level functions of producing a programmed FPGA from the set of design files passed to the development tools. These tools can be used separately or in conjunction, and are able to produce the equivalent as an HDL (Verilog or VHDL) design methodology, at a fraction of the effort. System designers do not need to be a VHDL or Verilog programmers. The automated system generation tools allow the components of a hardware system are defined, inter-connected, simulated, and verified, all without resorting to the underlying HDL. With a true point-and-click design method, system architects can generate entire systems, simulate and verify it, and download it into an FPGA all from the PC desktop.

Industrial Applications
Industrial Applications

In many industrial applications, such as ultrasonic flaw detection, the data rate from the sensors may reach as high as 50 MSPS. In other industrial applications, the sensor data rates can be much smaller (e.g.100 kSPS), but there may be multiple sensors. Either way, if the processing chain is complex, DSP processors often do not have the bandwidth necessary to meet real-time deadlines and processing must be done in an "offline" manner.

The difficulty with DSP processors in these applications comes in two parts. First, they are essentially serial machines, processing one element of the signal chain at a time. In some high-end DSP processors, specific instructions can process data simultaneously, giving a degree of parallelization. Often, the only access to these parallel instructions is either by coding in assembly language, or using special modes of compiler. Either way, this requires a high degree of expertise and makes the code unusable on other hardware architectures.

In some applications, multiple DSP processors can be utilized to obtain true parallel processing but software complexity and hardware costs rise very quickly as the process is made parallel. Software for these systems becomes more complex and non-reusable due to data dependencies and inter-processor communications schemes. However, the degree of true parallelism is low compared to a hardware based solution, and comes at a significant price in terms of design time, cost of goods, and time to market.

Gaining Speed with FPGAs as Parallel Co-Processors

These issues have led many industrial designers to use FPGAs, to take advantage of the ability to convert a cascaded set of operations into a parallel structure that operates in several clock cycles at 200+ MHz. This is the central advantage of FPGA technology—the ability to speed up an algorithm by making the process truly parallel.

In most DSP application, engineers have few options to increase performance beyond optimizing by using specialized assembly language instructions or upgrading the DSP processor. With the FPGA approach, the hardware, as well as the software, can be simultaneously optimized. Moreover, the designer can change the partition of the system, moving more processing into hardware to meet the system throughput requirements. As a result, the designer now has a three-dimensional optimization space available (see Figure 1), whereas code optimization and processor speed were the previous choices. With the flattening performance of many DSP processors at about 1 GHz clock rates, designers are left with typically only one option—code optimization. The broad flexibility provided by programmable solutions help create systems that were previously impossible to design either because of time and cost-of-goods, or because traditional DSPs could not handle the calculation load. Now the engineer has one more degree of freedom in the design process—co-processor hardware acceleration. This acceleration is accomplished by taking the parallelization process one step further than DSP processors and partitioning the algorithm to parallelize the high computational portions in hardware implementation.

Figure 1. The 3D Optimization Space Opened by the Design Software
Increased Flexibility

In addition to the obvious calculation-speed increase, using FPGAs as co-processors gives the designer increased flexibility in four important ways:

  1. Speed gain in the core algorithm relaxes processing pressure on the remaining parts of the algorithm, reducing the overall timing criticality of the system, and degree of optimization required. Scalability and throughput of the system can now be generally enhanced.
  2. High-performance DSP applications can quickly fill the resources of available DSP processors, leaving no room for expansion. With the FPGA approach, additional algorithms or filters can be added to the existing system, while leaving processor cycles to devote to more value add features and capabilities.
  3. .

  4. Control hardware, memory and data converter interfaces that traditionally exist external to the DSP processor can be easily integrated into the FPGA and combined with the algorithm in a single chip. This can save time and money in PC board development, and the flexibility of FPGAs permits a high degree of flexibility of the system.
  5. The use of an FPGA can allow for post-design updates to the hardware acceleration datapath architecture. This can be accomplished by remotely updating the FPGA configuration file, in the same manner that processor firmware updates are remotely downloaded to field installed systems.

Figure 2 shows an example of a C code fragment for a convolution operation that might be used in a signal filter. This nested loop structure accounts for the significant portion of digital signal processing tasks.

With the Harvard architecture of DSP processors, the multiply-accumulate (MAC) can be performed efficiently in a single instruction cycle. Some DSPs have up to eight MACs units that can be run simultaneously. However, the variables "nx" and "ny" can be quite large, in some cases. This could require hundreds of cycles to complete the inner loop and require thousands of cycles to complete the outer loop.

Figure 2. C Code Segment Illustrating Repetitive Nature of Typical Signal Processing Task
Clock Cycle Usage

The clock cycle usage of a traditional DSP processor is displayed with this convolution operation and compared to the same calculation that has been implemented on a Stratix series FPGA using DSP Builder. Imagine un-rolling a single iteration of the outer for loop (represented by the colored arrows in Figure 3), and placing them in a serial DSP program. If we plot these calculations versus the clock of the system (calculation time), they are represented by the sequential line of arrows.

Figure 3. Comparison: Execution of a Convolution Algorithm in a DSP Processor and an FPGA

In some DSP processors, the software designer can write optimized assembly code to pipeline instructions and data to parallel logical units. This reduces the clock cycle usage (see Figure 3(b). Speed gains of 5-10X can be obtained by carrying out the optimization, either by hand or sometime by an optimizing C compiler. Designers should be aware however that the C compiler can perform some of the parallelization, but the highest speed gains are found in optimized assembly code. For any algorithmic modifications or if other considerations change, then the code may have to be re-written. A distinct drawback to this scenario is that the coding time and maintenance cost of this type of software can be significant.

Using the DSP Builder software and implementing the algorithm on an FPGA, the convolution filter can be constructed so that the filter operates in 1-2 clock cycles as shown in Figure 3(c). In many applications, speed is the primary consideration. In both systems, the initial clock cycles required to load the data (fill the taps) before the calculation can be made has been ignored.

When designing with an FPGA, it is important to understand a key trade-off that requires a change of perspective from traditional DSP design. In order to acquire more speed in a DSP system, the designer can make several modifications:

  • Write more optimized code (usually assembly)
  • Upgrade the DSP processor to a faster, more expensive model
  • Add more DSP processors
  • In an FPGA design, the trade-offs are different. If the designer needs more speed (smaller execution times), then more logic elements (LEs), multipliers and memory blocks must be utilized to make the operation more parallel. Since each LE occupies space on the chip, the traditional way to express this trade-off is space vs. speed—more speed requires more space on the chip. Now the primary system trade-off is FPGA chip size (number of LEs) versus costs.

    The advantages of the FPGA approach are four-fold:

    1. The speed gains over optimized assembly code in the DSP processor are often a factor of 100 or more. Large FPGAs can contain thousands of multipliers, memory blocks, I/Os and associated programmable logic.
    2. The design software is a graphical tool that uses a drag-and-drop architecture. Thus, development time is significantly less than writing pipelined DSP code.
    3. If the filter requires a hardware control interface to external hardware or data converters, it can be easily implemented directly on the same FPGA with the software rather than writing code to interact with the interrupts or buses on a DSP processor. In fact, in many systems, the FPGA is already present to perform interface tasks.
    4. Using these modern FPGA tool flows, modifications to the design and architecture are fast and easy to implement. For those using traditional HDL FPGA tool flow, parameterizable IP Megacores for FFTs, NCOs, FEC, FIR and CIC filtering are all readily available for common DSP functions. Moreover, they are generally portable within all of a given FPGA vendors families and device densities.
    5. Design Decisions: FPGA Co-Processor or DSP Processor
      Design Decisions: FPGA Co-Processor or DSP Processor

      The speed advantage of using FPGAs is evident from Figure 3. The process can be made parallel so that the ratio of computational throughput to the number of clock cycles used is quite high. In making the decision whether to utilize FPGA co-processors, there are a number of factors the designer must consider.

      First, the design should be segmented into the tasks that will be placed in the co-processor and those left in a DSP processor or other system microprocessor (the master processor). Even a separate master processor can sometimes be eliminated from the hardware design with the use of an embedded soft processor built from FPGA-based logic.

      When faced with the segmentation task, the easiest way to look at it is to divide the problem into two independent but related components: 1) the computational algorithm and 2) configuration and control of that algorithm. While these are interdependent, they can be separated easily with the use of a simple flow chart.

      Figure 4 shows a simplified example illustrating the segmentation process. In the flow chart, the implementation of a simple finite impulse response (FIR) that filters a high data rate, real-time signal (i.e., low pass or bandpass filtering of a noisy sensor signal), and displays a derived parameter (the signal power).

      The upper block, labeled Filter Configuration, is where the tap coefficients are calculated based on the configuration of the system. The user may choose filter response and cut-off frequencies, by selecting the filter coefficients. The control processor can load the desired set of coefficients into the filter depending upon the configuration. The crosshatched arrow leading into the Filter Configuration block is the software control provided by the master processor.

      Figure 4. Segmentation of a Real-Time FIR Filter Processor

      In the second block, the system manages the data input, such as framing incoming data into manageable segments for the filter. Additionally, this data framing operation must send a few hardware signals to the ADC such as enable and acknowledge (shown by the white arrows). In the Data Output block, hardware control is again required to format the output data and facilitate its final placement, such as a DMA operation formatted and placed in external SRAM. In the final block, the power in the filtered signal is calculated by squaring the signal elements and summing. The power parameter is then displayed.

      Now let's segment the design into parts that will reside in the FPGA and parts that remain with the master processor. The grey circular arrow in the background represents the high-speed, repetitive portion of the algorithm. This is the segment of the algorithm that changes little in form throughout the system's usage. The convolution of the data can be performed by a FIR filter with adjustable coefficient sets based on simple selections of the user. It is also the most speed-critical portion of the algorithm. This is the best candidate for transfer to an FPGA based on speed requirements and its static dataflow configuration.

      The final block has a dashed line diagonally across its face. This represents a possible division of the task into separate parts. Since we are simply displaying the power parameter, the timing pressure on this part of the algorithm is very low (approximately tens of milliseconds). However, the calculation of the power can be computationally intense because it involves the squaring of an array element by element and then summing the elements together. Whether this remains with the master processor or is placed in the FPGA depends on:

    6. What is the load on the master processor? Can moving task(s) to the FPGA free other master processor resources?
    7. How many FPGA logic elements does the filter operation take? How many LEs will the power calculation take?
    8. Is floating or integer arithmetic in the master processor being used? If integer math is used, then a scaling operation to prevent possible overflow will be needed because of the squaring operation.
    9. When segmenting the design, designers will typically find that there are obvious portions of the algorithm that should be moved to the FPGA while others are dependent on a number of system issues. For more complex code, the best way to make this determination is to develop a higher-level language (such as C/C++ or MATLAB) model of the code. The compiler profile functions can be utilized to determine execution time, and to find which portions of the algorithm are using the most CPU resources.

      Conclusion
      Conclusion

      Like every design decision in a complex system, moving all or part of an algorithm into an FPGA co-processor depends on a wide variety of issues. However, it comes down to the analysis of the size and complexity of both the hardware and software components in the design. A simple algorithm running at fairly low throughput (i.e. audio processing) in a low-cost DSP processor is probably not a good candidate for porting to an FPGA, unless an FPGA is already in the datapath of the hardware system. However, if a system requires high performance and high throughput, and generally pushes a DSP processor beyond its limits, then an FPGA co-processor approach is definitely an alternative (i.e. HD video processing).

      Design tools are major productivity enablers in system development. The wide range of automated tools makes the development of FPGA-based systems faster and easier. The use of FPGAs as co-processors in industrial applications provides considerable speed gains over high-end DSPs, while the system architect is given a much higher degree of control than is possible with DSPs.

      About the Author

      Michael Parker
      Altera Corp.
      As Senior DSP Technical Marketing Manager, Michael Parker is responsible for Altera's DSP related IP, and is also involved in optimizing FPGA architecture planning for DSP applications. Mr. Parker joined Altera in January 2007, and has over 20 years of DSP wireless engineering design experience with Alvarion, Soma Networks, TCSI, Stanford Telecom and several startup companies.

    print

    email

    rss

    Bookmark and Share

    Joinpost comment




    Please sign in to post comment

    Navigate to related information

    Product Parts Search

    Enter part number or keyword
    PartsSearch

    FeedbackForm