Editor's Note: There are a lot of folks who are interested in accelerating their algorithms/programs written in C or C++. Many of these guys and gals are aware that FPGA-based accelerators are available, but they don't know how you actually make these little rascals perform their magic.
In order to address this, I contacted a number of the main players in this arena and asked each of them if they would be interested in penning an article that explained the process of:
- Writing a new application (or modifying a legacy application) in C or C++ in a form suitable for acceleration.
- Partitioning the application such that some portions will be compiled for use on the general-purpose processor and other portions will be implemented in the FPGA.
- Actually getting those portions of the application that are to be accelerated into the FPGA by one means or another).
- Interfacing the main application running on the general-purpose processor with those portions running on the FPGA.
- Analyzing/profiling (debugging?) the new version of the program, part of which is running on the FPGA.
Some of the companies I contacted declined because they were too busy (which is a good place for them to be), but three stepped up to the plate:
– Part 1: DRC Computer Corporation
– Part 2: SRC Computers (this article)
– Part 3: Impulse Accelerated Technologies
Programming model for reconfigurable computing
Traditionally the programming model for Reconfigurable Computing (RC) has been one of hardware design. Given that the tools required for the underlying FPGA technology of RC are all logic design tools from the Electronic Design Automation (EDA) industry, there really has not been a programming environment recognizable to a software developer. The tools have supported Hardware Definition Languages (HDL) such as Verilog, VHDL, and Schematic Capture.
With the introduction of System-on-chip (SOC) technology and the complexity associated with hardware definition of such complexity, high level languages have begun to be available. Java and C-like languages are becoming more common for use in programming RC chips. This is a significant step forward, but continues to require quite a leap by application programmers.
The SRC programming model is the traditional software development model where C and Fortran are used to program a reconfigurable processor. For the microprocessor component of an RC system any language capable of linking with the run time libraries (written in C) can be compiled and run on the system.
The SRC Carte programming environment was created with the design assumption that application programmers would be writing and porting applications to the RC platform. Therefore the standard development strategies, of design, code in high level languages (HLLs), compile, debug via standard debugger, edit code, re-compile, and so on, until correct, are used to develop applications. Only when the application runs correctly in a microprocessor environment, is the application recompiled and targeted for the reconfigurable processor.
Compiling to hardware in a RC system requires two compilation steps that are quite foreign to programming for an instruction processor. The output of the HLL compiler must be a hardware definition language. In Carte, the output is either Verilog or Electronic Design Interchange Format (EDIF). EDIF files are the hardware definition object files that define the circuits that will be implemented in the RC chips. If Verilog is generated, then that HDL must be synthesized to EDIF using a Verilog compiler.
A final step, "place and route", takes the collection of EDIF files and creates the physical layout of the circuits on the RC chip. The output files for this process is a configuration bitstream which can be loaded into an FPGA to create the hardware representation of the algorithm being programming into the RC processor.
The Carte programming environment performs the compilation from C or FORTRAN to bitstream for the FPGA without programmer involvement. It further compiles the codes targeted to microprocessors into objects modules. The final step for Carte is the creation of a Unified Executable which incorporates the microprocessor object modules, the FPGA bitstreams, and all of the required run time libraries into a single Linux executable file. The Carte compilation process is presented in Fig 1 and Fig 2.
1. Carte programming environment.
2. Carte compilation.
Target hardware architecture
In order to better understand the process of developing or porting codes to an RC platform using Carte, a description of the target hardware architecture is presented. SRC's MAP is a reconfigurable FPGA-based processor that operates as a peer to attached microprocessors. The SRC-7 MAP is presented in Fig 3. Internally it is composed to two FPGAs that are used to contain the compiled C or FORTRAN targeted for MAP. A third FPGA provides control for starting, stopping, configuring logic, and performing address translation and protection for data movement. The user FPGAs are Altera Stratix II EP2S180 devices that are clocked at a fixed rate of 150 MHz.
3. MAP-H Module.
The on-board memory for MAP is composed of eight physical SRAM banks presented to the software as 16 banks (64 MBytes) capable of delivering or receiving 16 64-bit words of data on each clock cycle. Additionally 2 banks of SDRAM memory are available for DMA access. On chip memory is provided via BRAM.
Connections in and out of MAP are provided through the control chip as well as directly to the user logic chips. The control chip manages a pair of 7.2 GB/s DMA ports that connect to CPUs or other MAPs, Common Memories or HiBar Disks. Connections directly into user logic chips are provided through GPIOX ports providing 12 GB/s streaming data access to or from a variety of external data sources or targets. Figures 4 and 5 present a system architecture within which MAP exists. Bandwidths in, out, and through the components of the MAP are balanced to avoid bottlenecks. SRC-7 bandwidth numbers for the MAP board are shown in Fig 3, while Fig 4 and Fig 5 show system bandwidth.
4. SRC-7 MAPstation.
5. Hi-Bar based systems.
MAPs can connect directly to CPU motherboards or indirectly through switches as depicted in the figures. The SNAP provides the connection to a motherboard through connection into CPU memory via a pair of DIMM slots.