Design Article

IMG1

FPGA-based hardware acceleration of C/C++ based applications - Part 2

Dan Poznanovic, SRC Computers

8/1/2007 1:15 PM EDT

Editor's Note: There are a lot of folks who are interested in accelerating their algorithms/programs written in C or C++. Many of these guys and gals are aware that FPGA-based accelerators are available, but they don't know how you actually make these little rascals perform their magic.

In order to address this, I contacted a number of the main players in this arena and asked each of them if they would be interested in penning an article that explained the process of:

  1. Writing a new application (or modifying a legacy application) in C or C++ in a form suitable for acceleration.
  2. Partitioning the application such that some portions will be compiled for use on the general-purpose processor and other portions will be implemented in the FPGA.
  3. Actually getting those portions of the application that are to be accelerated into the FPGA by one means or another).
  4. Interfacing the main application running on the general-purpose processor with those portions running on the FPGA.
  5. Analyzing/profiling (debugging?) the new version of the program, part of which is running on the FPGA.

Some of the companies I contacted declined because they were too busy (which is a good place for them to be), but three stepped up to the plate:
  – Part 1: DRC Computer Corporation
  – Part 2: SRC Computers (this article)
  – Part 3: Impulse Accelerated Technologies


Programming model for reconfigurable computing
Traditionally the programming model for Reconfigurable Computing (RC) has been one of hardware design. Given that the tools required for the underlying FPGA technology of RC are all logic design tools from the Electronic Design Automation (EDA) industry, there really has not been a programming environment recognizable to a software developer. The tools have supported Hardware Definition Languages (HDL) such as Verilog, VHDL, and Schematic Capture.

With the introduction of System-on-chip (SOC) technology and the complexity associated with hardware definition of such complexity, high level languages have begun to be available. Java and C-like languages are becoming more common for use in programming RC chips. This is a significant step forward, but continues to require quite a leap by application programmers.

The SRC programming model is the traditional software development model where C and Fortran are used to program a reconfigurable processor. For the microprocessor component of an RC system any language capable of linking with the run time libraries (written in C) can be compiled and run on the system.

The SRC Carte programming environment was created with the design assumption that application programmers would be writing and porting applications to the RC platform. Therefore the standard development strategies, of design, code in high level languages (HLLs), compile, debug via standard debugger, edit code, re-compile, and so on, until correct, are used to develop applications. Only when the application runs correctly in a microprocessor environment, is the application recompiled and targeted for the reconfigurable processor.

Compiling to hardware in a RC system requires two compilation steps that are quite foreign to programming for an instruction processor. The output of the HLL compiler must be a hardware definition language. In Carte, the output is either Verilog or Electronic Design Interchange Format (EDIF). EDIF files are the hardware definition object files that define the circuits that will be implemented in the RC chips. If Verilog is generated, then that HDL must be synthesized to EDIF using a Verilog compiler.

A final step, "place and route", takes the collection of EDIF files and creates the physical layout of the circuits on the RC chip. The output files for this process is a configuration bitstream which can be loaded into an FPGA to create the hardware representation of the algorithm being programming into the RC processor.

The Carte programming environment performs the compilation from C or FORTRAN to bitstream for the FPGA without programmer involvement. It further compiles the codes targeted to microprocessors into objects modules. The final step for Carte is the creation of a Unified Executable which incorporates the microprocessor object modules, the FPGA bitstreams, and all of the required run time libraries into a single Linux executable file. The Carte compilation process is presented in Fig 1 and Fig 2.


1. Carte programming environment.


2. Carte compilation.

Target hardware architecture
In order to better understand the process of developing or porting codes to an RC platform using Carte, a description of the target hardware architecture is presented. SRC's MAP is a reconfigurable FPGA-based processor that operates as a peer to attached microprocessors. The SRC-7 MAP is presented in Fig 3. Internally it is composed to two FPGAs that are used to contain the compiled C or FORTRAN targeted for MAP. A third FPGA provides control for starting, stopping, configuring logic, and performing address translation and protection for data movement. The user FPGAs are Altera Stratix II EP2S180 devices that are clocked at a fixed rate of 150 MHz.


3. MAP-H Module.

The on-board memory for MAP is composed of eight physical SRAM banks presented to the software as 16 banks (64 MBytes) capable of delivering or receiving 16 64-bit words of data on each clock cycle. Additionally 2 banks of SDRAM memory are available for DMA access. On chip memory is provided via BRAM.

Connections in and out of MAP are provided through the control chip as well as directly to the user logic chips. The control chip manages a pair of 7.2 GB/s DMA ports that connect to CPUs or other MAPs, Common Memories or HiBar Disks. Connections directly into user logic chips are provided through GPIOX ports providing 12 GB/s streaming data access to or from a variety of external data sources or targets. Figures 4 and 5 present a system architecture within which MAP exists. Bandwidths in, out, and through the components of the MAP are balanced to avoid bottlenecks. SRC-7 bandwidth numbers for the MAP board are shown in Fig 3, while Fig 4 and Fig 5 show system bandwidth.


4. SRC-7 MAPstation.


5. Hi-Bar based systems.

MAPs can connect directly to CPU motherboards or indirectly through switches as depicted in the figures. The SNAP provides the connection to a motherboard through connection into CPU memory via a pair of DIMM slots. Writing and porting applications
Developing an application using the Carte programming environment for systems with reconfigurable processors follows a process that is very similar to that used for microprocessors whether code is being written from scratch or being ported. The code is organized into subprograms; the subprograms are targeted for the specific processor type and compiled.

Debugging is accomplished using a standard microprocessor debugger, such as gdb. While the code is being developed, feedback from the Carte compilers and through use of profiling identifies the subprograms that should be optimized. Optimization typically involves identifying the processor type for the routines, potential restructuring to maximize parallel execution, and eventual compilation for both microprocessor and reconfigurable processor. In the following sections, this process will be investigated in greater detail through simple examples.

Getting Started: The Carte programming environment runs under the Linux OS using standard program development tools such as editors, Makefiles, microprocessor compilers, and the Carte C and FORTRAN compilers. An application is considered to be a collection of files that contain subprograms that will run either on microprocessors or RC processors such as SRC's MAP. The application can be structured to run in multiple parallel microprocessor and multiple RC processors through the use of threads.

Carte utilizes a simple Makefile to control the compilation of an application. All applications are composed of at least a main routine that runs on the µP and a routine that run on the RC processor. For simplicity these will be identified as µP routines and MAP routines. There can be multiple µP routines and multiple MAP routines of arbitrary complexity.

Carte provides three modes for compilation: debug, simulate, and hw. Compiling in debug mode allows a programmer to run totally within a µP environment using normal debugger tools and print statements anywhere within the application. Meanwhile, simulate mode allows compiling the application for clock accurate logic simulation. The hw compile mode compiles the application to a mix of µP instructions and a collection of configuration bit streams for the FPGAs within the system. Switching between the three compilation modes requires no changes to source code. Only specifying alternate Makefile targets is required:

make debug

make simulate

make hw

The result of the Carte compile in all cases is a Linux unified executable file that contains all of the code and bitstreams required to execute the application. Applications are able to link in specialized precompiled library routines, if desired, using the linker. Both µP routines and MAP routines are compiled into Linux object modules and are linked into an executable. Using standard module types allows creating libraries of µP and MAP routines if desired.

The bulk of the application development time is spent using debug mode compiles. When an application is running correctly, then and only then is the compilation for hardware performed. Compiling in simulation mode will not be discussed in any detail in this paper. It is intended to support hardware designers who develop portions of an application in Verilog or VHDL. It allows them to use clock cycle accurate simulation to debug their designs. When compiling for hardware, Carte invokes the FPGA tools for creating the circuit that has been generated by the Carte compiler. A set of FPGA configurations is generated and incorporated into the Linux executable. All of the hardware management needed to configure FPGA chips and pass control to and from MAP hardware is performed by routines automatically included in the application. A programmer never explicitly calls such management routines.

Partitioning CPU and RC Code: Fig 6 (main.c) and Fig 7 (MAP Routine poly.mc) show an application with a µP main and a MAP routine. This is obviously a very simple example, but will be used to illustrate the process of coding, and optimizing applications that run in a mix of µP and MAPs. In this example the main.c code allocates memory, sets up the data, allocates a MAP, calls the MAP routine, and displays the results. One can see that with the exception of allocating a MAP, this code is essentially identical to a µP code that is making a simple subroutine call. The one difference in coding is that a MAP routine always has as its last argument a MAP number. This permits the programmer to select from a group of MAPs a particular processor to be used. In this example where a single MAP is allocated, the MAP number is simple set to zero.

Though the MAP routine, poly(), will execute on MAP hardware the call is an ordinary subroutine call. A proxy subroutine residing on the CPU manages the actual transfer of control to MAP hardware. The proxy determines whether the MAP's FPGAs have already been configured. If the FPGAs must be configured with the bitstream, the proper library routines are invoked, the FPGAs configured, parameters transferred, and control passed to the MAP hardware. The programmer does not ever have to explicitly manage the FPGAs.

This MAP routine evaluates a polynomial using Horner's rule. The input array is dt_source and the results of the polynomial calculation are returned in array dt_res. The subroutine parameters are passed into the MAP very similarly to any subprogram. All addresses, such as the array addresses dt_source and dt_res are virtual addresses. In the example the main routine on the CPU allocates the arrays via a malloc() call and passes the addresses as arguments to the poly routine on the MAP. The MAP control logic provides address translation and limits enforcement.

The main computation takes place in a pipelined loop. In this loop, 20 double precision floating point operations are performed in each clock cycle. The compilation output for this code is shown in Fig 8 (debug mode compile output). This analysis shows the pipeline to fire every clock cycle. The pipeline depth is 270 clocks.

Data to feed the pipeline loop is obtained from On Board Memory (OBM). In this example the OBM arrays are allocated via the OBM_BANK_A and OBM_BANK_B statements. To achieve highest performance, SRAM bank conflicts are avoided by allocating the input and output arrays in separate banks. Data is brought into the MAP via DMA where the CPU array and the OBM array addresses, and the array length are provided. The results are sent back to the CPU through a DMA. The style of DMA presented in the example is a buffered DMA, where OBM is allocated as used to hold input and output data. A streaming DMA is also available where data is presented directly to the computation. Use of a streaming DMA provides an optimization opportunity, when the computation permits it, by eliminating the latency required to completely buffer either the input or output data through OBM. A significant performance gain can be seen using streaming DMAs.

Fig 9 (example application makefile) shows the Makefile for compiling the example Horner's rule polynomial evaluation. The compilation output for a debug mode compile is shown in Fig 10 (debug mode execution) and the subsequent execution of the debug mode executable follows the compile output in Fig 10. Observe that the Carte C compiler displays the pipeline statistics. Each loop will be analyzed and have performance information displayed.

Fig 11 (H/W mode compile) shows the hardware mode compilation with subsequent execution in Fig 12 (H/W mode execution). Observe that the Carte loop analysis is identical in both debug and hardware mode compiles. In the hardware mode compile the Carte generated EDIF is further processed to create the FPGA circuit, in this case for a Xilinx Virtex2 VP100 chip. The resource utilization for that chip is displayed as part of the compilation output. Optimizing and Scaling RC Applications: The greater the parallel execution, the greater the performance. Provided that data can be delivered effectively to computation, more parallelism is best. In an RC FPGA-based system, greater parallelism can be achieved using both fine grain and course grain parallelism. Pipelining is an example of fine grain parallelism, where many functional units are firing with each clock tick. Coarse grain parallelism in the form of multiple parallel code block and multiple RC processors also can lead to greater performance. How does one implement such fine and course grain parallelism within C and FORTRAN using the Carte programming environment?

Pipelining was discussed in the previous example. Using parallel code block will next be discussed. In the world of microprocessor, parallel code blocks normally require multi-processor systems and the associated overhead that comes with an SMP synchronization. In an RC processor, parallelism can be constructed within the processor with minimal overhead. Due to the sequential semantics of C, parallelism is expressed through pragmas; in Carte, pragmas similar to those used in OpenMP are used to declare parallel code blocks. Streams were mentioned earlier in relation to reducing the latency of DMA data movement. The same approach can be utilized in implementing producer-consumer loops. A data stream can connect the two pipelined loops resulting in an extended pipeline without forcing loop fusion optimization. To illustrate the power of this approach the polynomial evaluation code "poly" will be expanded to include a post processing loop that performs histogramming.

Fig 13 (poly.mc routine with histogram) shows the expanded poly subroutine that contains a consumer loop that creates a histogram from the produced values from the polynomial. This example also illustrates the allocation of multiple banks of on-chip memory.

Data is passed between the producer loop and the consumer loop through stream S0. An inter-parallel block stream functions as a FIFO with flow control to properly synchronize the connected loops. The effect of this connection is an extended pipeline that starts in the producer loop and ends in the consumer loop. The processing in the consumer loop occurs in parallel with the producer loop – Fig 14 [debug compile of consumer-producer loops in poly()].

Observe that the Carte compiler reports that the histogramming loop has been slowed due to loop carried variable dependency. This slowdown will result in the loop running 3× as long as a fully optimized pipelined loop. To deal with such slowdowns, library routines are available. In this case, a conditional accumulator routine is available. Using the routine, the histogram loop is simplified. Fig 15 (optimized histogramming loop) shows the modified loop.

The compiler analysis is shown in Fig 16 (debug mode compiler output for optimized histogrammer), indicating that the histogram loop is now running at full speed. The end result of the steps to add a parallel section connected via a stream, and to use an optimized accumulator is that the two loops function as if a single pipeline loop was created by fusing the two loops. More work is being accomplished in every clock cycle, with very straightforward code changes.

Streams also provide a very powerful mechanism for scaling. In the previous example, a consumer loop was added as a parallel code block within the same chip as the producer loop. The Carte implementation of streams allows a stream to connect parallel code blocks within a single chip, in separate but connected chips and in separate but connected RC processors. The produce and consumer loops remain unchanged, using the put_stream() and gets() functions to send and receive data. Only the stream definition changes.

Bridge streams connect two chips within an RC processor, and chain streams connect two RC processors. Fig 17 (poly.mc using bridge streams) illustrates the use of bridge port streams. In this example, the poly() routine executes in one chip and the histo() routine executes in a second chip. Very little change to the code is required to split the original routine composed of parallel code blocks connected via a stream, into two routines connected via a bridge port stream. The example could be easily converted to a pair of MAP routines running in separate MAPs and connected with chain port streams. In creating that example, only the stream definition changes.

Debugging: As was mentioned earlier, the process of debugging code targeted for a mixed microprocessor and RC processor system under Carte is identical to the normal program development process for microprocessor based applications.

The code can be compiled in debug mode, with the –g option specified to maintain the symbol table. The code can then be executed within a debugger such as ddt, gdb, or idb. The code intended to be turned to logic and execute within FPGAs is instead compiled for the µP and run on the CPU. Parallel code blocks are maintained, and multiple FPGAs or MAPs are also maintained as parallel executing threads. Data movement and control passing to and from MAPs is handled through the use of a MAP control chip emulator. This ensures that data movement is able to be debugged as completely as is the computational logic. The MAP control emulator is implemented as library routines that are automatically link and invoked. The emulator is also able to provide traces of data movement. Print statement may be included in any MAP routines and are executed in debug mode, but deleted when the code is compiled for hardware.

Though debug mode execution is slower that executing on RC hardware, the execution performance is good enough to debug a large application. Compile time is fast and therefore permits the compile, debug, modify, re-compile cycle of debugging to take place. A brief example debug session is shown in Fig 18 (example debug session). In the example, breakpoints are set within the CPU based main, as well as the MAP based poly() routine. Values are displayed and the code is started and stopped. The main point of the example is that a normal gdb (or idb) session can be held for debugging the application.

Conclusion The Carte programming environment was presented in this paper. A code example, with compilations, execution for debugging, use of a debugger, optimizing code, and compiling for MAP, an FPGA based reconfigurable processor was presented.

The examples of polynomial evaluation and histogramming are artificially simple, but they are intended to give a feel for programming a reconfigurable processor. In reality, the application that are capable of seeing performance advantage on a RC system can be thousands of lines of code, with hundreds of routines. It is important to realize that writing or porting such an application with the Carte environment is both practical and realistic.

After all: "It's just programming!"

Dan Poznanovic is Vice President for Software Development at SRC Computers, Inc. and is also current chair of the CoreLib working group of the OpenFPGA organization. He has 30 years experience in the computing industry in both applications and system development and is now focusing his attention on providing high level programming language support for reconfigurable computing systems. A member of the IEEE and ACM, Dan can be contacted at poz@srccomputers.com.


print

email

rss

Bookmark and Share

Joinpost comment




Please sign in to post comment

Navigate to related information

Product Parts Search

Enter part number or keyword
PartsSearch

FeedbackForm