Design Article
How to accelerate algorithms by automatically generating FPGA coprocessors
Glenn Steiner, Kunal Shenoy, Dan Isaacs (Xilinx), and David Pellerin (ImpulseC)
8/9/2006 7:33 PM EDT
In this article, we explore code acceleration and techniques for code conversion to hardware coprocessors. We also demonstrate the process for making trade-off decisions with benchmark data through an actual image-rendering case study involving an auxiliary processor unit (APU)-based technique. The design uses an immersed PowerPC implemented in a platform FPGA.
The value of a coprocessor
A coprocessor is a processing element that is used alongside a primary processing unit to offload computations normally performed by the primary processing unit. Typically, the coprocessor function implemented in hardware replaces several software instructions. Code acceleration is thus achieved by both reducing multiple code instructions to a single instruction as well as the direct implementation of the instruction in hardware.
The most frequently used coprocessor is the floating-point unit (FPU), the only common coprocessor that is tightly coupled to the CPU. There are no general-purpose libraries of coprocessors. Even if there were, it is still difficult to readily couple a coprocessor to a CPU, such as a Pentium 4.
As shown in Fig 1, the Xilinx Virtex-4 FX FPGA has one or two PowerPCs, each with an APU interface. By embedding a processor within an FPGA, you now have the opportunity to implement complete processing systems of your own design within a single chip.

1. Virtex-4 FX processor with APU interface and EMAC blocks.
The integrated PowerPC with APU interface enables a tightly coupled coprocessor that can be implemented within the FPGA. Frequency requirements and pin number limits make an external coprocessor less capable. Thus, you can now create application-specific coprocessors attached directly to the PowerPC, providing significant software acceleration. Because FPGAs are reprogrammable, you can rapidly develop and test CPU-attached coprocessor solutions.
Coprocessor connection models
Coprocessors are available in three basic forms: CPU bus connected, I/O connected, and instruction-pipeline connected. Mixed variants also exist.
- CPU Bus Connected: Processor bus-connected accelerators require the CPU to move data and send commands through a bus. Typically, a single data transaction can require many processor cycles. Data transactions can be hindered by bus arbitration and the necessity for the bus to be clocked at a fraction of the processor clock speed. A bus-connected accelerator can include a direct memory access (DMA) engine. At the cost of additional logic, the DMA engine allows a coprocessor to operate on blocks of data located on bus-connected memory, independent of the CPU.
- I/O Connection: I/O-connected accelerators are attached directly to a dedicated I/O port. Data and control are typically provided through GET or PUT functions. Lacking arbitration, reduced control complexity, and fewer attached devices, these interfaces are typically clocked faster than a processor bus. A good example of such an interface is the Xilinx Fast Simplex Link (FSL). The FSL is a simple FIFO interface that can be attached to either the Xilinx MicroBlaze soft-core processor or a Virtex-4 FX PowerPC. Data movement through the FSL has lower latency and a higher data rate than data movement through a processor bus interface.
- Instruction Pipeline Connection: Instruction-pipeline connected accelerators attach directly to the computing core of a CPU. Being coupled to the instruction pipeline, instructions not recognized by the CPU can be executed by the coprocessor. Operands, results, and status are passed directly to and from the data execution pipeline. A single operation can result in two operands being processed, with both a result and status being returned.
As a directly connected interface, the instruction-pipeline connected accelerators can be clocked faster than a processor bus. The Xilinx implementation for this type of coprocessor connection model through the APU interface demonstrates a 10x clock cycle reduction in the control and movement of data for a typical double-operand instruction. The APU controller is also connected to the data-cache controller and can perform data load/store operations through it. Thus, the APU interface is capable of moving hundreds of millions of bytes per second, approaching DMA speeds.
Either I/O-connected accelerators or instruction-pipeline-connected accelerators can be combined with bus-connected accelerators. At the cost of additional logic, you can create an accelerator that receives commands and returns status through a fast, low-latency interface while operating on blocks of data located in bus-connected memory.
The C-to-HDL tool set described in this article is capable of implementing bus-connected and I/O-connected accelerators. It is also capable of implementing an accelerator connected to the APU interface of the PowerPC. Although the APU connection is instruction-pipeline-based, the C-to-HDL tool set implements an I/O pipeline interface with a resulting behavior more typical of an I/O-connected accelerator.
FPGA / PowerPC / APU interface
FPGAs allow hardware designers to implement a complete computing system with processor, decode logic, peripherals, and coprocessors all on one chip. An FPGA can contain a few thousand to hundreds of thousands of logic cells. A processor can be implemented from the logic cells, as in the Xilinx PicoBlaze or MicroBlaze processors, or it can be one or more hard logic elements, as in the Virtex-4 FX PowerPC. The high number of logic cells enables you to implement data-processing elements that work with the processor system and are controlled or monitored by the processor.
FPGAs, being reprogrammable elements, allow you to program parts and test them at any stage during the design process. If you find a design flaw, you can immediately reprogram a part. FPGAs also allow you to implement hardware computing functions that were previously cost-prohibitive. The tight coupling of a CPU pipeline to FPGA logic, as in the Virtex-4 FX PowerPC, enables you to create high-performance software accelerators.
A block diagram showing the PowerPC, integrated APU controller, and an attached coprocessor is shown in Fig 2. Instructions from cache or memory are simultaneously presented to the CPU decoder and the APU controller. If the CPU recognizes the instruction, it is executed. If not, the APU controller or the user-created coprocessor has the opportunity to acknowledge the instruction and execute it. Optionally, one or two operands can be passed to the coprocessor and a result or status can be returned. The APU interface also supports the ability to transfer a data element with a single instruction. The data element ranges in size from one byte to four 32-bit words.

2. PowerPC, integrated APU controller, and coprocessor.
One or more coprocessors can be attached to the APU interface through a fabric coprocessor bus (FCB). Coprocessors attached to the bus range from off-the-shelf cores, such as an FPU, to user-created coprocessors. A coprocessor can connect to the FCB for control and status operations and to a processor bus, enabling direct access to memory data blocks and DMA data passing. A simplified connection scheme, such as the FSL, can also be used between the FCB and coprocessor, enabling FIFO data and control communication at the cost of some performance.
To demonstrate the performance advantage of an instruction-pipeline-connected accelerator, we first implemented a design with a processor bus-connected FPU and then with an APU/FCB-connected FPU. Table 1 summarizes the performance for a finite impulse response (FIR) filter for each case.

Table 1. Non-accelerated vs. accelerated floating-point performance.
As is reflected by the table, an FPU connected to an instruction pipeline accelerates software floating-point operations by 30X, while the APU interface provides a nearly 4X improvement over a bus-connected FPU.
Converting C code to HDLConverting C code to an HDL accelerator with a C-to-HDL tool is an efficient method for creating hardware coprocessors. The illustration in Fig 3 and the steps detailed below this figure summarize the C-to-HDL conversion process:

3. C-to-HDL design flow.
- Implement the application or algorithm using standard C tools. Develop a software test bench for baseline performance and correctness (host or desktop simulations). Use a profiler (such as gprof) to begin identifying critical functions.
- Determine if floating-to-fixed point conversion is appropriate. Use libraries or macros to aid in this conversion. Use a baseline test bench to analyze performance and accuracy. Use the profiler to reevaluate critical functions.
- Using a C-to-HDL tool, such as Impulse C, iterate on each of the critical functions to:
- Partition the algorithm into parallel processes.
- Create hardware/software process interfaces (streams, shared memories, signals).
- Automatically optimize and parallelize the critical code sections (such as inner code loops).
- Test and verify the resulting parallel algorithm using desktop simulation, cycle-accurate C simulation, and actual in-system testing.
- Using the C-to-HDL tool, convert the critical code segment to an HDL coprocessor.
- Attach the coprocessor to the APU interface for final testing.
Impulse: C-to-HDL tool
Impulse C, shown in Fig 4, enables embedded system designers to create highly parallel, FPGA-accelerated applications by using C-compatible library functions in combination with the Impulse CoDeveloper C-to-hardware compiler. Impulse C simplifies the design of mixed hardware/software applications through the use of well-defined data communication, message passing, and synchronization mechanisms. Impulse C provides automated optimization of C code (such as loop pipelining, unrolling, and operator scheduling) and interactive tools, allowing you to analyze cycle-by-cycle hardware behavior.

4. Impulse C.
Impulse C is designed for dataflow-oriented applications, but it is also flexible enough to support alternate programming models, including the use of shared memory. This is important because different FPGA-based applications have different performance and data requirements. In some applications, it makes more sense to move data between the embedded processor and the FPGA through block memory reads and writes; in other cases, a streaming communication channel might provide higher performance. The ability to quickly model, compile, and evaluate alternate algorithm approaches is an important part of achieving the best possible results for a given application.
To this end, the Impulse C library comprises minimal extensions to the C language in the form of new data types and predefined function calls. Using Impulse C function calls, you can define multiple, parallel program segments (called processes) and describe their interconnections using streams, signals, and other mechanisms. The Impulse C compiler translates and optimizes these C-language processes into either:
- Lower-level HDL that can be synthesized to FPGAs, or
- Standard C (with associated library calls) that can be compiled onto supported microprocessors through the use of widely available C cross-compilers.
The complete CoDeveloper development environment includes desktop simulation libraries compatible with standard C compilers and debuggers, including Microsoft Visual Studio and GCC/GDB. Using these libraries, Impulse C programmers are able to compile and execute their applications for algorithm verification and debugging purposes. C programmers are also able to examine parallel processes, analyze data movement, and resolve process-to-process communication problems using the CoDeveloper Application Monitor.
The output of an Impulse C application, when compiled, is a set of hardware and software source files that are ready for importing into FPGA synthesis tools. These files include:
- Automatically generated HDL files representing the compiled hardware process.
- Automatically generated HDL files representing the stream, signal, and memory components needed to connect hardware processes to a system bus.
- Automatically generated software components (including a run-time library) establishing the software side of any hardware/software stream connections.
- Additional files, including script files, for importing the generated application into the target FPGA place and route environment.
The result of this compilation process is a complete application, including the required hardware/software interfaces, ready for implementation on an FPGA-based programmable platform.
Design example
The Mandelbrot image shown in Fig 5, a classic example of fractal geometry, is widely used in the scientific and engineering communities to simulate chaotic events such as weather. Fractals are also used to generate textures and imaging in video-rendering applications. Mandelbrot images are described as self-similar; on magnifying a portion of the image, another image similar to the whole is obtained.

5. Mandelbrot image and code acceleration.
The Mandelbrot image is an ideal candidate for hardware/software co-design because it has a single computation-intensive function. Making this critical function faster by moving it to the hardware domain significantly increases the speed of the whole system. The Mandelbrot application also lends itself nicely to clear divisions between hardware and software processes, making it easy to implement using C-to-HDL tools.
We used the CoDeveloper tool set as the C-to-HDL tool set for this design example. We modified a software-only Mandelbrot C program to make it compatible with the C-to-HDL tools. Our changes included division of the software project into distinct processes (independent units of sequential execution); conversion of function interfaces (hardware to software) into streams; and the addition of compiler directives to optimize the generated hardware. We subsequently used the CoDeveloper tool set to create the Pcore coprocessor that was imported into Xilinx Platform Studio (XPS). Using XPS, we attached the PC to the PowerPC APU controller interface and tested the system.
Xilinx Application Note XAPP901 provides a full description of the design along with design files for downloading. Meanwhile, User Guide UG096 provides a step-by-step tutorial in implementing the design example.
Performance improvement examples
We measured performance improvements for the Mandelbrot image texturing problem, an image filtering application, and triple DES encryption. The performance improvements, demonstrating acceleration ranging from 11X to 34X that of software, are documented in Table 2.

Table 2. Algorithm acceleration through coprocessor accelerators.
Conclusion
Constrained by power, space, and cost, you might need to make a non-ideal processor choice. Frequently, it is a choice where the processor is of lower performance than desired. When the software code does not run fast enough, a coprocessor code accelerator becomes an attractive solution. You can hand-craft an accelerator in HDL or use a C-to-HDL tool to automatically convert the C code to HDL.
Using a C-to-HDL tool such as Impulse C enables quick and easy accelerator generation. Virtex-4 FX FPGAs, with one or two embedded PowerPCs, enable tight coupling of the processor instruction pipeline to software accelerators. As demonstrated in this article, critical software routines can be accelerated from 10X to more than 30X, enabling a 300 MHz PowerPC to provide performance equaling or exceeding that of a high-performance multi-gigahertz processor. The above examples were generated in just a few days each, demonstrating the rapid design, implementation, and testing possible with a C-to-HDL flow.
Glenn Steiner is Sr. Engineering Manager, Advanced Products Division Xilinx, Inc. Glenn can be reached at glenn.steiner@xilinx.com.
Kunal Shenoy is a Design Engineer, Advanced Products Division Xilinx, Inc. Kunal can be reached at kunal.shenoy@xilinx.com.
Dan Isaacs is Director of Embedded Processing, Advanced Products Division Xilinx, Inc. Dan can be reached at dan.isaacs@xilinx.com.
David Pellerin is Chief Technology Officer at Impulse Accelerated Technologies. David can be reached at david.pellerin@impulsec.com.
Editor's Note: This article first appeared in the Xilinx Embedded Magazine and is presented here with the kind permission of Xcell Publications.



