Developers of embedded and high-performance systems are taking increased advantage of FPGAs for hardware-accelerated computing. FPGA computing platforms effectively bridge the gap between software programmable systems based on traditional microprocessors and systems based on custom hardware functions. Advances in design tools have made it easier to create hardware-accelerated applications directly from C language representations, but it is important to understand how to use these tools to the best advantage, and how decisions made during the design and programming of mixed hardware/software systems will impact overall performance.
This paper presents a brief overview of modern FPGA-based platforms and related software-to-hardware tools, then moves quickly into a set of examples showing how computationally-intensive algorithms can be written, analyzed and optimized for increased performance.
In recent years, FPGA-based programmable platforms have emerged as viable alternatives for many types of high-performance computing applications. The opportunities presented by these platforms include the rapid creation of custom hardware, simplified field updates and the reduction or elimination of custom chips from many categories of electronic products. As FPGAs have grown in logic capacity, their ability to host high-performance software algorithms and complete applications has grown correspondingly.
FPGA-based platforms range from individual FPGAs, with or without embedded soft/hard processor cores, to higher-performance FPGA-based computing platforms. The recent explosion in the use of FPGA embedded processors has proven that FPGAs can provide a flexible, powerful hardware platform for complete "systems-on-programmable-chips". FPGA vendors now provide, at little or no cost, all the processor and peripheral components needed to assemble a highly capable, single-chip computing platform. In addition to processors and common processor peripherals, such a platform can include one or more customized, highly parallel software/hardware accelerators.
The increased use of these FPGA-embedded soft or hard processors is particularly noteworthy. These processors can be useful for a variety of reasons: they can run legacy code, including code that is planned for later acceleration in the FPGA fabric; they can be used during development as software test generators. They can also be used to replace custom hardware structures for such things as embedded state machines, and for standardized I/O. And they can host complete operating systems and perform non-critical computations that would be too space-intensive when implemented in hardware. When arranged as a grid, multiple soft processors can even form a parallel computing platform in and of themselves – one that is more generally programmable than an equivalent platform constructed entirely of low-level FPGA gates.
One example of such a platform is the Xilinx Virtex-4 FX device illustrated in Fig 1. The simplest of these devices (the FX-12) includes more than 12,000 programmable logic cells; an integrated PowerPC 405 core, which can operate at speeds as fast as 450 MHz; and dual 10/100/1000 Ethernet MACs. Larger versions of this same device provide substantially greater numbers of logic cells (as many as 142,000 and four Ethernet MACs) as well as dual PowerPC processors and larger number of dedicated multiplier units.
1. The Virtex-4 FX-12 combines general-purpose programmable fabric with an embedded PowerPC 405 processor and a high-performance auxiliary processing unit (APU) interface, along with dual 10/100/1G EMACs. (Figure Courtesy Xilinx, Inc.)
The Virtex-4 FX family of devices provides an ideal platform for hardware acceleration of embedded applications due to its close coupling of the processor to the FPGA fabric. The CPU is directly coupled to the Auxiliary Processing Unit (APU) controller, which provides direct access to hardware accelerators implemented in the FPGA logic. The APU controller provides a high-bandwidth interface between the FPGA fabric and the pipeline of the on-chip PowerPC. Fabric co-processor modules (FCMs) implemented in the FPGA fabric can be connected to the embedded PowerPC processor through the APU interface, allowing the use of custom hardware accelerators. When combined with C-to-hardware compiler tools, the APU controller allows software programmers to create hardware-accelerated software applications with little or no FPGA design expertise.
Used in this way, FPGAs are excellent platforms for implementing coarse-grained heterogeneous parallelism. Compared to other models of machine parallelism, this approach requires less process-to-process communication overhead; if each process maintains its own local memory and has a clearly delineated task to perform, the application can easily be partitioned between different areas of the FPGA, perhaps including different clock domains, and between independent FPGA devices. There are many types of calculations that lend themselves quite naturally to coarse-grained parallelism, including vector/array processing, pipelined image processing, and multistage signal filtering.
The role of software-to-hardware tools
Software development tools, whether intended for deeply embedded systems or for enterprise applications, improve the application development process in two fundamental ways. First, a good set of tools provides an appropriate and easily understood abstraction of a target platform, whether that platform is an embedded processor, a desktop PC, or a supercomputer. A good abstraction of the platform allows software developers to create, test, and debug relatively portable applications while encouraging them to use programming methods that will result in the highest practical performance on the target platform.
The second fundamental value that tools provide is in the mechanical process of converting an application from its original high-level description, whether written in C or Java, as a dataflow diagram or in some other representation, into an optimized low-level equivalent that can be implemented-loaded and executed-on the target platform. Again, such a target platform might be a single-chip embedded system or it might be a large, general-purpose computing device.
In an ideal tool flow, the specific steps of this process would be of no concern to the programmer; the application would simply operate at its highest possible efficiency through the magic of automated tools. In practice this is rarely the case: any programmer seeking high performance must have at least a rudimentary understanding of how the optimization and code generation or mapping process works, and must exert some level of control over the process either by adjusting the flow (specifying compiler options, for example) or by revisiting the original application and optimizing at the algorithm level, or both.
To fulfill the dual role of tools as described above, tools for automated hardware generation must focus both on the automatic compilation/optimization problem and on delivering programming abstractions, or programming models, that make sense for the FPGA-based programmable platforms. To be effective, any such tool must provide a software-oriented design experience.