Design Article

IMG1

C-Language techniques for FPGA acceleration of embedded software

David Pellerin (ImpulseC) and Kunal Shenoy (Xilinx)

3/31/2006 2:19 PM EST

Developers of embedded and high-performance systems are taking increased advantage of FPGAs for hardware-accelerated computing. FPGA computing platforms effectively bridge the gap between software programmable systems based on traditional microprocessors and systems based on custom hardware functions. Advances in design tools have made it easier to create hardware-accelerated applications directly from C language representations, but it is important to understand how to use these tools to the best advantage, and how decisions made during the design and programming of mixed hardware/software systems will impact overall performance.

This paper presents a brief overview of modern FPGA-based platforms and related software-to-hardware tools, then moves quickly into a set of examples showing how computationally-intensive algorithms can be written, analyzed and optimized for increased performance.

Overview
In recent years, FPGA-based programmable platforms have emerged as viable alternatives for many types of high-performance computing applications. The opportunities presented by these platforms include the rapid creation of custom hardware, simplified field updates and the reduction or elimination of custom chips from many categories of electronic products. As FPGAs have grown in logic capacity, their ability to host high-performance software algorithms and complete applications has grown correspondingly.

FPGA-based platforms range from individual FPGAs, with or without embedded soft/hard processor cores, to higher-performance FPGA-based computing platforms. The recent explosion in the use of FPGA embedded processors has proven that FPGAs can provide a flexible, powerful hardware platform for complete "systems-on-programmable-chips". FPGA vendors now provide, at little or no cost, all the processor and peripheral components needed to assemble a highly capable, single-chip computing platform. In addition to processors and common processor peripherals, such a platform can include one or more customized, highly parallel software/hardware accelerators.

The increased use of these FPGA-embedded soft or hard processors is particularly noteworthy. These processors can be useful for a variety of reasons: they can run legacy code, including code that is planned for later acceleration in the FPGA fabric; they can be used during development as software test generators. They can also be used to replace custom hardware structures for such things as embedded state machines, and for standardized I/O. And they can host complete operating systems and perform non-critical computations that would be too space-intensive when implemented in hardware. When arranged as a grid, multiple soft processors can even form a parallel computing platform in and of themselves – one that is more generally programmable than an equivalent platform constructed entirely of low-level FPGA gates.

One example of such a platform is the Xilinx Virtex-4 FX device illustrated in Fig 1. The simplest of these devices (the FX-12) includes more than 12,000 programmable logic cells; an integrated PowerPC 405 core, which can operate at speeds as fast as 450 MHz; and dual 10/100/1000 Ethernet MACs. Larger versions of this same device provide substantially greater numbers of logic cells (as many as 142,000 and four Ethernet MACs) as well as dual PowerPC processors and larger number of dedicated multiplier units.


1. The Virtex-4 FX-12 combines general-purpose programmable fabric with an embedded PowerPC 405 processor and a high-performance auxiliary processing unit (APU) interface, along with dual 10/100/1G EMACs. (Figure Courtesy Xilinx, Inc.)

The Virtex-4 FX family of devices provides an ideal platform for hardware acceleration of embedded applications due to its close coupling of the processor to the FPGA fabric. The CPU is directly coupled to the Auxiliary Processing Unit (APU) controller, which provides direct access to hardware accelerators implemented in the FPGA logic. The APU controller provides a high-bandwidth interface between the FPGA fabric and the pipeline of the on-chip PowerPC. Fabric co-processor modules (FCMs) implemented in the FPGA fabric can be connected to the embedded PowerPC processor through the APU interface, allowing the use of custom hardware accelerators. When combined with C-to-hardware compiler tools, the APU controller allows software programmers to create hardware-accelerated software applications with little or no FPGA design expertise.

Used in this way, FPGAs are excellent platforms for implementing coarse-grained heterogeneous parallelism. Compared to other models of machine parallelism, this approach requires less process-to-process communication overhead; if each process maintains its own local memory and has a clearly delineated task to perform, the application can easily be partitioned between different areas of the FPGA, perhaps including different clock domains, and between independent FPGA devices. There are many types of calculations that lend themselves quite naturally to coarse-grained parallelism, including vector/array processing, pipelined image processing, and multistage signal filtering.

The role of software-to-hardware tools
Software development tools, whether intended for deeply embedded systems or for enterprise applications, improve the application development process in two fundamental ways. First, a good set of tools provides an appropriate and easily understood abstraction of a target platform, whether that platform is an embedded processor, a desktop PC, or a supercomputer. A good abstraction of the platform allows software developers to create, test, and debug relatively portable applications while encouraging them to use programming methods that will result in the highest practical performance on the target platform.

The second fundamental value that tools provide is in the mechanical process of converting an application from its original high-level description, whether written in C or Java, as a dataflow diagram or in some other representation, into an optimized low-level equivalent that can be implemented-loaded and executed-on the target platform. Again, such a target platform might be a single-chip embedded system or it might be a large, general-purpose computing device.

In an ideal tool flow, the specific steps of this process would be of no concern to the programmer; the application would simply operate at its highest possible efficiency through the magic of automated tools. In practice this is rarely the case: any programmer seeking high performance must have at least a rudimentary understanding of how the optimization and code generation or mapping process works, and must exert some level of control over the process either by adjusting the flow (specifying compiler options, for example) or by revisiting the original application and optimizing at the algorithm level, or both.

To fulfill the dual role of tools as described above, tools for automated hardware generation must focus both on the automatic compilation/optimization problem and on delivering programming abstractions, or programming models, that make sense for the FPGA-based programmable platforms. To be effective, any such tool must provide a software-oriented design experience. Software-oriented programming, simulation and debugging tools that provide appropriate abstractions of FPGA-based programmable platforms allow software and system designers to begin application development, experiment with alternative algorithms and make critical design decisions without the need for specific hardware knowledge. This is of particular importance during prototype development. It is important to realize, however, that the use of software-to-hardware tools so will not necessarily eliminate the need for hardware engineering skills; in fact, it is highly unlikely that a complete and well-optimized hardware/software application can be created using only software knowledge. On the plus side, it is certainly true that working prototypes can be more quickly generated using hardware and software design skills in combination with modern tools for software-to-hardware compilation.

The FPGA as an embedded software platform
Because of their reprogrammability, designing for FPGAs is conceptually similar to designing for common embedded processors. Similar tools can be used to verify the functionality of an application prior to actually programming a physical device, and there are tools readily available from FPGA vendors for performing in-system debugging.

Although the tools are more complex and design processing times are substantially longer (it can take literally hours to process a large application through the FPGA place-and-route process), the basic design flow can be viewed as one of software, rather than hardware development. As any experienced FPGA application designer will tell you, however, the skills required to make the most efficient use of FPGAs, with all their low-level peculiarities and vendor-specific architectural features, are quite specialized and often daunting. To put this in proper perspective, however, it's important to keep in mind that software development for specialized embedded processors such as DSPs can also require specialized knowledge. DSP programmers, in fact, often resort to assembly language in order to obtain the highest possible performance, and use C programming only in the application prototyping phase. The trend for both FPGA and processor application design has been to allow engineers to more quickly implement applications without the need to understand all the intricate details of the target, while at the same time providing access (through custom instructions, built-in functions/macros, assembly languages and hardware description languages as appropriate) to low-level features for the purpose of extracting the maximum possible performance.

One of the key attributes of such a software-oriented system design flow is the ability to implement a design specification captured in software using the most appropriate computing resources. If the most appropriate resource is a microprocessor, then this should be a simple matter of cross-compiling to that particular processor. If, however, the best fitting resource is an FPGA, then traditional flows would require a complete rewrite of the design into register transfer level (RTL) hardware description language. This is not only time consuming, but also error prone and represents a significant barrier to the designer in exploring the entire hardware/software solution space. With a software-oriented flow, the design can simply be modified in its original language, no matter which resource is targeted.

C language for FPGA design
Experimenting with mixed hardware/software solutions can be a time-consuming process due to the historical disconnect between software development methods and the lower-level methods required for hardware design, including design for FPGAs. For many applications, the complete hardware/ software design is represented by collection of software and hardware source files that are not easily compiled, simulated or debugged with a single tool set. In addition, because the hardware design process is relatively inefficient, hardware and software design cycles may be out of sync, requiring system interfaces, fundamental software/hardware partitioning decisions and algorithm designs to be prematurely locked down.

With the advent of C-based FPGA design tools, however, it is now possible to use familiar software design tools and standard C language for a much larger percentage of a given application, and in particular those parts of the design that are computationally-intensive. Later performance tweaks may introduce hand-crafted HDL code as a replacement for the automatically-generated hardware. Because the design can be compiled directly from C code to an initial FPGA implementation, however, the point at which a hardware engineer needs to be brought in to make such performance tweaks is pushed farther back in the design cycle and the system as a whole can be designed using more productive software design methods.

Emerging hardware compiler tools allow C-language applications to be processed and optimized to create hardware, in the form of FPGA netlists, and also include the necessary C language extensions to allow highly parallel, multiple-process applications to be described. For target platforms that include embedded processors, these tools can be used to generate the necessary hardware/software interfaces as well as generating low-level hardware descriptions for specific processes.

Making use of a programming model appropriate for highly parallel applications is also important. In many cases, this means re-thinking the application as a whole and finding new ways to express data movement and processing. The results of doing so, however, can be dramatic. By increasing application-level parallelism and taking advantage of programmable hardware resources, for example, it is possible to accelerate common algorithms by orders of magnitude over a software-only implementation.

Modern software-to-hardware tools such as Impulse C from Impulse Accelerated Technologies, Catapult C from Mentor, Mitrion C from Mitrion, and Handel-C from Celoxica support this type of application development by allowing application developers to describe their algorithms in more familiar, software-oriented environments. In the case of Impulse C, the algorithm of interest can be expressed as a standard C function (or a collection of such functions) and compiled automatically into HDL, which in turn is synthesized into the bitstream required to program the FPGA.

Programming for parallelism
For high performance software applications, the increased access to massively parallel hardware resources provided in an FPGA is a key benefit. It is not easy, however, for a software engineer to take advantage of these resources using standard programming languages. The standard C language, for example, has few if any features that are appropriate for parallel programming. Parallel processing and the programming of parallel systems require support for concurrency in the language being used, and an understanding of how to manage multiple, quasi-independent computational elements. The standard C language does not contain any such features. VHDL and Verilog, on the other hand, which are intended for describing highly parallel systems of connected hardware components, are designed for exactly this purpose, albeit at a rather low level of abstraction.

The closest thing to a truly parallel programming model in the context of C is support for multiple threads, which is not a standard feature of C but is popular and readily available in the form of add-on, operating system-specific libraries. Another, less common C library for this type of programming is the message-passing interface, or MPI. This library is intended for the design of larger supercomputing applications implemented on clusters of standard desktop computers and other, larger-scale parallel processing platforms. The Impulse C programming model
A key aspect of any software-to-hardware design flow is the use of parallelism to increase performance. When accelerating C applications using FPGAs, parallelism can be exploited at two distinct levels: at the application system level and at the level of statements (or blocks of statements) within a specific subroutine or loop. Although there are ongoing attempts to create compiler technologies that can exploit both levels of parallelism with a high degree of automation, the best approach today is to focus automation efforts (represented by the software-to-hardware compiler) on the lower level aspects of the problem, while at the same time providing software programmers an appropriate and easy-to-use programming model that allows higher level, coarse-grained parallelism to be expressed. In this way programmers can make hardware/software partitioning decisions and experiment with alternative algorithmic approaches, leaving the task of low-level optimization to automated compiler tools. This approach is particularly useful for platforms such as the Virtex-4 device that include embedded processors. This is also the approach taken in the Impulse C tools provided by Impulse Accelerated Technologies.

At the heart of the Impulse C programming model are processes and streams (Fig 2). Processes are independently synchronized, concurrently operating portions of an application that are written in a standard language (in this case C language). Processes perform the work of the application by accepting data, performing computations, and generating relevant outputs.


2. The Impulse C programming model emphasizes the use of streams for inter-process communication, and also supports signals and shared memories.

Unlike traditional C subroutines, Impulse C processes are considered persistent; they are normally invoked once (whether in hardware or software) and continue as long as there is data available to be processed. The data processed by such an application flows from process to process by means of streams, or in some cases by means of messages or shared memories, which are also supported in the programming model.

In Impulse C, streams represent unidirectional communication channels that are used to connect multiple parallel processes, whether hardware or software. Each stream is defined by a data width (in bits, usually ranging from 8 to 128, depending on the application and the target platform), and a buffer depth, which is usually 1 or some other small number reflecting the depth of the generated stream buffers. These streams are read and written using the Impulse C functions co_stream_read and co_stream_write, which read and write packets of data from the stream in a synchronized way. If there is no data on an input stream, the co_stream_read function will block until data is available; if an output stream is already full, the co_stream_write function will block until a receiving processes reads a packet of data, making space in the stream for additional data to be written. When implemented by the Impulse compiler as hardware, streams are generated as FIFOs and may have either one or two clocks depending on whether multiple processes are to be run at different clock rates. A stream buffer size of 1 indicates that the stream is essentially unbuffered; the receiving process will block until the sending process has completed and moved data onto the stream. In contrast, a larger buffer size will result in additional hardware resources (memories and corresponding control logic) being generated, but may result in more efficient process synchronization. As an application designer, you will choose buffer sizes that best meet the requirements of your particular application.

An important role that streams play is in abstracting away the details of software/hardware communication for different types of platforms, and thereby making applications more portable. This is important because each target FPGA platform provides different methods for efficient software/hardware communication. In the Virtex-4 platform, for example, streams are generated as APU interfaces, allowing high-speed movement of data from the PowerPC to an FPGA accelerator, and back, with a small number of simple function calls.

The key to allocating processing power within such a system is to implement one or more processes in the FPGA to handle the heavy computation, and implement other processes on embedded or external microprocessors to handle file I/O, memory management, system setup, and other nonperformance-critical tasks. Using tools such as those included with Impulse C, an application comprising multiple parallel C processes can be modeled entirely in software, verified using a standard desktop C debugging environment, and then, after the application is functionally complete, incrementally moved into the FPGA for further optimization and acceleration.

Impulse C is designed primarily for streams-oriented applications, but is also flexible enough to support alternate programming models including the use of signals and shared memory as a method of communication between parallel, independently-synchronized processes.

The Impulse C library consists of minimal extensions to the C language (in the form of new data types and predefined function calls) that allow multiple, parallel program segments to be described, interconnected and synchronized. The Impulse C compiler translates and optimizes Impulse C programs into appropriate lower-level representations, including Register-Transfer-Logic (RTL) VHDL descriptions that can be synthesized to FPGAs, and standard C (with associated library calls) that can be compiled onto supported microprocessors through the use of widely available C cross-compilers.

From C to hardware – specific steps
The Impulse C tools give software programmers access to FPGAs by allowing hardware to be compiled directly from software descriptions. The resulting hardware may operate standalone (perhaps interfacing to other hardware elements via streaming, signal or memory interfaces) or may be attached to an embedded CPU and serve as hardware accelerators.

Because it is based on standard C, Impulse C allows FPGA algorithms to be developed and debugged using popular C and C++ development environments, including Microsoft Visual Studio and GCC-based tools. The compiler translates specific C-language subroutines to low-level FPGA-hardware while optimizing the generated logic and identifying opportunities for parallelism. The compiler is also capable of unrolling loops and generating loop pipelines to exploit the extreme levels of parallelism possible in an FPGA. Instrumentation and monitoring functions generate debugging visualizations for highly parallel multi-process applications, helping system designers identify dataflow bottlenecks and other areas for acceleration.

For applications targeting the Xilinx Virtex-4 and other platforms involving embedded processors, the Impulse C compiler automates the creation of hardware/software interfaces and generates outputs compatible with FPGA platform building tools. This makes it possible to create high-performance, mixed hardware/software applications for FPGA-based platforms without the need to write low-level VHDL or Verilog.

The following summarizes the steps required for a typical accelerated Virtex-4 application using the Impulse and Xilinx tools:

  1. The application is initially written in standard C, using common C development tools. These tools include readily available tools such as Visual Studio, Eclipse, or GCC and GDB, and may also involve more comprehensive cross-development tools. During this phase, a baseline for validation (a software test bench, also written in C) is established, which simplifies the testing of later design iterations.
  2. A C profiler such as gprof may be invoked, or other, less sophisticated methods may be used to identify computational hotspots. Often these hotspots can be isolated to a few C subroutines or inner code loops requiring acceleration. Application monitoring (made possible by instrumenting the C code during software testing) can help characterize these hotspots and analyze data movement.
  3. Once identified, the critical hotspots are moved into dedicated functions (called processes) that are compatible with hardware compilation. Depending on the nature of the algorithm, some initial hand-optimization of the C code may be performed in this partitioning step, possibly including floating- to fixed-point conversions.
  4. Using software-to-hardware interface functions provided in the Impulse C library, data streams or shared memories are used to create abstract connections between the main algorithm running on the PowerPC and hardware-accelerated subroutines running in the FPGA. The modified software algorithm, which now includes one or more independently synchronized processes, is simulated again in a standard C environment to ensure its correct behavior.
  5. The C-language processes representing hardware accelerators are analyzed and optimized by the Impulse C compiler, resulting in hardware description files compatible with FPGA synthesis tools. Optimization reports generated in this phase help you understand the impact of various coding styles, and make appropriate revisions in the original C code for improved performance. During this compilation process, additional compiler outputs are generated that represent hardware-to-software interfaces, including (in the case of the Virtex-4 FPGA) the necessary APU interface logic. Software run-time libraries are also generated at this point, corresponding to the abstract stream and shared memory interfaces specified on the processor side of the application.
  6. Interactive optimization and cycle-accurate debugging tools may be invoked at this point to look for opportunities to improve the performance and utilization of the generated logic, or to debug hardware issues related to bit-accuracy and cycle-by-cycle behaviors.
  7. The generated hardware and software files are exported from the Impulse tools (as a PCORE peripheral) and imported directly into the Xilinx Platform Studio environment.
  8. The stream and shared memory interfaces defined in the C application are mapped to APU, PLB, or other interfaces where appropriate, along with other components (such as standard processor peripherals or non-standard IP blocks) to create the complete system. From within the Platform Studio interface, the entire application (both hardware and software) is built, resulting in a downloadable bitmap.

A sample application: fractal image generation
The Mandelbrot image is a classic example of fractal geometry, a branch of mathematics that is widely used in the scientific and engineering community to simulate chaotic events such as weather. Fractals are also used to generate textures and imaging in video rendering applications. Mandelbrot image generation is an ideal candidate for demonstrating hardware acceleration because it includes a single, computation-intensive function that calculates a color for each pixel in the generated image. This calculation assigns the color of each pixel using an iterative computation, with the previous results used in the present calculation. The result is formed by repeatedly squaring a complex number, then adding another complex number. The complex number added to the equation is constant for a particular pixel and is changed by a fixed amount for each pixel. For each pixel, the following formula is applied repeatedly until a maximum iteration count is reached, or the value of ZN diverges towards infinity:

     ZN+1 = Z2N+c

A pixel's color in the image depends on whether that pixel is in the Mandelbrot set, and the number of iterations it takes to determine that it is in the set. Specifying a larger maximum iteration count provides a better quality image but also increases the associated computation time.

This inner loop is performed once for each pixel in the generated image, and represents a highly iterative calculation. The nature of the calculation is such that, for a given pixel location and corresponding values, the loop may require many thousands of iterations before a result is generated. Making this critical function faster by moving it into hardware significantly increases the speed of the whole system.

To make this application compatible with the Impulse software-to-hardware tools, and to partition the entire application between hardware and software, the following changes were made:

  1. The software project was partitioned into two distinct processes, one of which would run on the embedded PowerPC processor and control the display of the generated image, and the other of which would implement the accelerated image generation algorithm.
  2. The image generation algorithm was converted from double-precision floating point to a fixed-point implementation, using special macros provided in the Impulse C library.
  3. Data streams were described, again using Impulse C function calls, to manage software-to-hardware and hardware-to-software communication.
  4. The software portion of the algorithm was modified such that the original algorithm could be executed on the embedded PowerPC as well as in hardware, allowing direct comparison of the results, in terms of image correctness and processing speed.
  5. The Xilinx Virtex-4 was specified as the target device, allowing the Impulse tools to generate the required low-level APU software/hardware interfaces.

The result of this experiment is illustrated in Fig 3. This relatively simple experiment shows that, with minimal effort, a computationally-intensive algorithm can be moved into an FPGA hardware accelerator for an immediate 17X increase in system performance over the PowerPC processor, which in this test was running at 300MHz.


3. Screen shot of fractal image generation test (output generated from Xilinx ML403 development board).

While a 17X increase in performance is impressive, it should be noted here that much greater levels of acceleration are possible for this algorithm (and others like it) simply by applying additional hardware accelerators. Because this algorithm is highly scalable (the image could be partitioned into discrete segments before processing), there is no reason why multiple pixel generators could not be created, up to the limit of the target FPGA and with correspondingly higher effective performance.

Summary
This paper has described the fundamentals of FPGA-based platforms, and how C-language programming techniques can be applied effectively to these highly parallel platforms. The roles of software and hardware compiler tools in implementing this model have also been discussed, and we have seen a specific example of how a larger parallel application can be expressed using the Impulse C libraries. We have also seen how desktop simulation and application monitoring can be used to debug an application and have discussed the process of FPGA hardware generation. Additional information and examples can be found in the book Practical FPGA Programming in C, available from Prentice Hall.

References
Pellerin, David and Scott Thibault. Practical FPGA Programming in C. Prentice Hall, 2005.

Shenoy, Kunal, Accelerating Software Applications Using the APU Controller and C-to-HDL Tools. Xilinx Application Note XAPP901, Xilinx, Inc., 2005.

Shenoy, Kunal, Implementing a Virtex-4 FX PowerPC System with a C-to-HDL Hardware Coprocessor Accelerator. Xilinx, Inc., 2005.

Virtex-4 User Guide. Xilinx, Inc., 2005.

David Pellerin is CTO of Impulse Accelerated Technologies, Kirkland, WA. Davis can be reached at david.pellerin@ImpulseC.com.

Kunal Shenoy is a Design Engineer at Xilinx, San Jose, CA. Kunal can be reached at kunal.shenoy@Xilinx.com.

This article is excerpted from a paper of the same name presented at the Embedded Systems Conference Silicon Valley 2006. Used with permission of the Embedded Systems Conference. Please visit www.embedded.com/esc/sv.


print

email

rss

Bookmark and Share

Joinpost comment




Please sign in to post comment

Navigate to related information

Product Parts Search

Enter part number or keyword
PartsSearch

FeedbackForm