The Blokus strategy game becomes the latest example of the use of hardware acceleration of an algorithm that is traditionally implemented in software.
High Level synthesis (HLS)
High Level synthesis tools convert applications written in a high-level language like C to Hardware Description Languages like Verilog and VHDL. Some of the advantages of using high level synthesis are as follows:
- It is easier to modify high-level code than HDL code; the application can thus be easily optimized
- Functional behavior can be verified through software simulation
- Applications developed in a high-level language can be easily ported to FPGAs
- A hardware testbench can be generated directly from the software testbench
- Development time is greatly reduced
This Blokus example used Impulse C to construct parallel computational blocks called processes which communicate using various interfaces. These communication interfaces, provided as library components with the tool, include streams, shared memory, semaphores, and messages.
The compiler also provides features, in the form of compiler pragmas, for fine-grained optimization of the design. This enables the developer to concentrate entirely on coarse grained optimization like load balancing amongst the various processes. Some of the fine grained optimization features used in this project are as follows:
- PIPELINE: This is used to pipeline loops where each iteration takes more than one clock cycle to complete
- UNROLL: This is used to compute each iteration of a loop in parallel
- FLATTEN: This is used to execute a large block of nested "if-else" statements in a single clock cycle
- There are other compiler constructs that utilize the benefits of dual ported RAM through multiple read write operations wherever possible
Various utilities provided with the development toolkit enable the designer to gauge the efficiency of the generated hardware from the high-level code. One such utility is the "Stage Master Explorer," which provides information regarding:
- Partitioning of code into combinational and sequential blocks
- Tentative delays of the combinational blocks (which affects the clock speed)
- Latency and throughput of compiler generated pipelines; any conflicts therein
As an illustration, consider the following code snippet:
The corresponding output is as illustrated below:
Stage Master Explorer output
(Click here to see a larger image.)
The performance of the current implementation was compared against a software solution executing the same algorithm. Following are the implementation details, performance results, and analysis:
- Algorithm: Minimax-based game tree search with alpha-beta pruning
- Software solution: Executes on single core of a 2.00GHz Intel Xeon processor
- FPGA implementation: Executes on a 50MHz Altera Cyclone IV EP4CE115F29C7 FPGA
When the game tree is searched to a depth of 2, the speedup achieved is around 11X. The speedup varies during the course of the game as the load pattern changes. The speedup mentioned above is one among the higher values achieved when there are maximum moves to search.
When the game tree is searched to a depth of 3, the speedup achieved is around 4X. The decrease in speedup can be attributed to the following reason. Pruning of the game tree search occurs best when the latest values of alpha and beta are used, which happens in case of the sequential software solution. However, due the distributed architecture of the FPGA implementation, each process communicates only with its parent process. This results in various processes using older values of alpha and beta, thus leading to inefficient pruning and redundant search overhead.
The code for this example is available for qualified researchers (please contact the author as discussed below).
Further acceleration is possible through algorithmic improvements to the minimax algorithm, including techniques like aspiration search, iterative deepening, and distributed tree search. Architectural improvements involve exploring communication and computation trade-offs.
About the author
Nilim Sarma is a Masters Candidate in the department of Electrical and Computer Engineering at Clemson University. He is a part of Dr. Melissa Smith's "Future Computing Technologies" research group. He is currently working on optimizing algorithms for hardware accelerators like FPGA and GPGPU. He has three years of industry experience in embedded systems programming. He can be reached via firstname.lastname@example.org.