Editor's Note: There are a lot of folks who are interested in accelerating their algorithms/programs written in C or C++. Many of these guys and gals are aware that FPGA-based accelerators are available, but they don't know how you actually make these little rascals perform their magic.
In order to address this, I contacted a number of the main players in this arena and asked each of them if they would be interested in penning an article that explained the process of:
- Writing a new application (or modifying a legacy application) in C or C++ in a form suitable for acceleration.
- Partitioning the application such that some portions will be compiled for use on the general-purpose processor and other portions will be implemented in the FPGA.
- Actually getting those portions of the application that are to be accelerated into the FPGA by one means or another).
- Interfacing the main application running on the general-purpose processor with those portions running on the FPGA.
- Analyzing/profiling (debugging?) the new version of the program, part of which is running on the FPGA.
Some of the companies I contacted declined because they were too busy (which is a good place for them to be), but three initially stepped up to the plate, followed by a late Part-4 entry from Nallatech:
– Part 1: DRC Computer Corporation (this article)
– Part 2: SRC Computers
– Part 3: Impulse Accelerated Technologies
– Part 4: Nallatech
Most software today is written so that instructions are executed in sequence, and to speed up execution programmers have typically pushed the hardware designers to build processor with ever higher clock rates. That has given rise to heavily pipelined processors that operate at clock rates of 3 GHz and more. These processors also include architectural tricks such as large caches, and functions such as out-of-order execution to get the most out of every clock cycle.
However, faster processors generate lots of heat and today, clock speeds have, for the most part, leveled off since the heat generated by the faster circuits ends up constraining the clock speeds. To continue the march towards ever-faster execution, hardware designers have switched from a single processor on a chip to dual, quad, and even more CPU cores on a single chip. The operating system can then allocate the processors to different applications, all running in parallel. The next level down from there is to find ways to parallelize the code running in each application and then run that parallelized code on multiple engines either within the CPU or in a companion coprocessor that is optimized to execute that particular segment of parallelized code.
The latest generation field programmable gate arrays (FPGAs) and the new open-standard Torrenza coprocessor interface on Opteron system platforms defined by Advanced Micro Devices, and the Intel QuickAssist Technology Acceleration Abstraction Layer (AAL) provide designers with the hardware portion of the parallelization goal. By downloading configurations into an FPGA tied to either a Torrenza or QuickAssist motherboard platform, designers can accelerate computationally-complex algorithms such as encryption, compression, search and sort, up to 1000× over a general-purpose processor. Also, DSP algorithms that need billions of integer or floating-point operations per second for image and audio processing, and much more, can readily be accelerated by an FPGA-based coprocessor.
The challenge now becomes how to find the code within your current application that can be parallelized, and then how to parallelize that code so that it can be executed on an array of computational elements configured in an FPGA or ASIC. Where does one start with this analysis. One good starting point is to first profile the code to find the computationally-intensive portions of the code and then find ways to isolate the code so that it does not have many data dependencies. Once this code is isolated, you must find ways to optimize the code so that it can be executed on the resources available on the coprocessor. Optimizing the code to fit the architecture is difficult on architectures like graphics processing units and the Cell processor, where users are not given all the data for the device. FPGAs, on the other hand, allow you to define an optimized architecture on a case-by-case basis.
Next, examine the communication channels between the hardware and software. Make sure the hardware is not data starved (not enough data reaches the hardware to keep the hardware continually performing its computations). And at the same time, make sure the hardware can keep up with the pace that the software feeds data to the hardware. Finding the right balance is critical to smooth operation. If it takes longer to transmit the data to the accelerator than it takes the CPU to calculate the answer, then make sure the hardware design uses all memory accesses most efficiently, or perhaps even add another memory port to feed more data to the compute array.
For example, if you are accelerating a fast-Fourier transform (FFT) followed by a phase shift and then followed by an inverse FFT, in the software algorithm, you take the data out of a memory location, perform the FFT and then put data back into memory. Then the data comes out of the memory again and the algorithm executes the phase shift computations and places the data back into memory one more time. Lastly, the inverse FFT retrieves the data, executes the algorithm and the places the final result back in memory. With a few hardware and software tricks data reuse can dramatically improve performance by eliminating multiple store and access operations.
When performing numerical calculations, analyze the precision that you really need to implement – you will probably find that you won't always need the full precision of single or double-precision floating-point computations. A fixed-point multiplier and arithmetic unit combo usually requires less logic and thus more elements can be configured on an FPGA. The more elements, the more computations per cycle and the faster the algorithm can execute. In FPGAs, integer or fixed-point math can often run 10 to 100 times fast than floating-point computations.
So where to start to parallelize the algorithms? First, perform a data-flow analysis to understand how data has to move between the different logic and computational elements. Next, perform a latency analysis to try and determine where potential bottlenecks may occur and then find a balance between desired performance and the cost of implementing the design – to achieve a desired performance level, will you have to use PCI Express or Hypertransport to provide high-bandwidth, low latency communication channels. Or, will you need a larger FPGA than initially planned to host more computational elements? How will the larger FPGA affect the cost of your solution?