Design Article
Reconfigurable Computing: Custom Supercomputers on Demand?
Tom VanCourt, Altera
4/15/2008 11:17 PM EDT
It's one of the most flexible, adaptable computing technologies available. However, it is only slowly being utilized in blades and desktops around the world. RC creates an unprecedented opportunity for orders of magnitude improvement in GFlops-per-dollar, GFlops-per-watt, and just GFlops.
It's no silver bullet, though. Realizing the potential of RC requires understanding the basic technology, then making sure that it's the right vehicle for the specific application.
Reconfigurable computing fabric
RC typically depends on field programmable gate arrays (FPGAs). For now, consider a high performance FPGA to be a bag of loose computer parts. Today's largest FPGAs include 500 or more block multipliers, on-chip RAM totaling a few MB, and pools of uncommitted arithmetic, control, and connectivity resources. RAM-based switches and lookup tables control connectivity and function, allowing easy redefinition of the computation.
An accelerator board attaches to an existing computer's main system interconnect, such as HyperTransport in an AMD system, NUMAlink in Silicon Graphics Altix processors, or PCI in a typical workstation. The board contains one or more FPGAs for application computing, plus some amount of SRAM and/or DRAM, arranged in several independently addressable banks. The block diagram shown in Fig 1 is similar to that of a graphics accelerator with a computing engine, on-board memory, and system interconnection.

1. FPGAs typically include low-latency on-board buffers, as well as access to system memory.
Creating the computer
The von Neumann programming model distributes an algorithm across time: one function unit performs a sequence of operations, one at a time, to carry out a specific computation. Speed comes from performing many operations, including memory accesses, in rapid succession.
RC gets away from the von Neumann model; it distributes an algorithm spatially across the configurable computing fabric, as shown in Fig 2. Speed comes from performing tens to hundreds of operations in parallel, using pipelining, broadside parallelism, or a combination of both.

2. Programming an FPGA means configuring it into an application-specific processor.
Programming an FPGA means implementing the control structure of an application as well as the data path in the reconfigurable fabric. Compilers exist for turning C or C-like languages into FPGA "bit files" or executable images: Handel-C from Agility Design Solutions, Mitrion-C from Mitrionics, and Impulse-C from Impulse Accelerated Technologies are just a few of the tools commercially available, and research tools exist in many commercial and academic labs.
High-level descriptions rarely exploit the full potential of an FPGA, however. The biggest reason is that C-like programming languages have sequential execution built deeply into their basic structure, making it extremely difficult to automate the extraction of FPGA-friendly parallelism.
Just as von Neumann programmers may fall back to an assembler for performance-critical kernels, FPGA programmers can use hardware description languages (HDLs) like Verilog or VHDL to expose more of the algorithm to the FPGA fabric. This step generally requires specialized programming skills, just as parallelism is a special case of a programmer's responsibility in C-like languages. HDLs are natively parallel and sequential execution is largely up to the developer.
When considering the "grain" of a processing element (PE), x86-compatible and similar processors traditionally stand at one end of the spectrum (i.e. coarse-grained with one PE but one that is big and complex). The continuum runs through common dual cores, multi-cores on the order of ten PEs (such as Cell Broadband Engine and the UltraSPARC T2 from Sun Microsystems), and many-cores on the order of 102 PEs (like Intel's Polaris or products from Clearspeed and Tilera).
As the number of PEs increases, the size and power of each PE decreases. FPGAs are sometimes considered the fine-grained extreme: on the order of 105 PEs of one-bit functionality, fixed at the time they are programmed. This, however, does not reflect how developers really program FPGAs.
Even in HDLs, design quanta are typically not individual bits of logic, but register arrays, RAM buffers, arithmetic units, or entire filters. As a result, any given FPGA can implement PEs of different sizes at different times, and occupy a different point on the axes of PE complexity vs. number of PEs per chip. In practice, FPGAs represent variable-grain computing, typically with 10-103 PEs custom tailored to the specific application.




Comments
VidExprt
6/6/2008 8:54 PM EDT
I have an application where I would like to process images in an Altera FPGA but the image data comes from a PC. I wonder if anyone makes a board with an FPGA that plugs into a PC which would allow me to write a PC program to transfer the image data to a memory on the FPGA card and then give the FPGA a command to start processing. When it completes, then the PC would need to be able to read the results back in. I know how to program an FPGA to process the data. The problem I have is how to transfer the data from a PC to the FPGA and how to transfer the results back again. I am hoping that there is a product that does this so I can just concentrate on the FPGA image processing algorithm. I know that SGI makes such a device but I would like to avoid going that route.
Sign in to Reply
Tom VanCourt
6/15/2008 11:41 AM EDT
There are lots of products around. XtremeData builds accelerators that plug directly into Hypertransport sockets. Gidel and Annapolis Micro Systems builds PCI boards, and there are probably lots of others. They all come with tools and IP for host communication.
Other boards don't sit in the system bus, but have USB, serial lines, or other IO that can let you exchange data with your PC host.
Sign in to Reply
xdiisafpga
6/24/2008 6:14 PM EDT
Hello Dave;
XtremeData makes In-Socket Accelerators that utilize the direct link to the microprocessor (Intel/FSB or AMD/HT) and FPGA technology to accelerate algorithms to incredible rates. If you are interested in hearing more email me at:
jward@xtremedatainc.com
Thanks;
Joe Ward
XtremeDataInc
www.xtremedatainc.com
Sign in to Reply
SOY
6/27/2008 9:25 AM EDT
Hi,
My interst is why the RC is still niches?
Reason maybe belong to development tool that must do;
1. Partitioning Application to host program on host processor and logic circuit on the reconfigurable fabric(s)
2. Scheduling/Optimizing the communication among them, and each processing unit of software/hardware
In addition, the RC, especially FPGAs are;
3. Small Scale compared to Traditional Von Neuman type Computers supporting virtualization (ex. virtual memory)
4. Application-Specific Datapath Synthesis is now one of hot topics rather than How to Harness/Scheduling the resources
How do you think about these points?
Sign in to Reply
Tom VanCourt
8/16/2008 4:15 PM EDT
Soy -
These are great questions, and don't have any easy answers. Somewhat surprisingly, answers to the first three are much the same as when applications port to "big iron," like a Blue Gene with a quarter-million processors. I hope to discuss these issues and others, at least at a high level, in future articles.
I don't know of any real how-to books on the market, but "Reconfigurable Computing" by Hauck and deHon is pretty good.
Sign in to Reply
SOY
8/24/2008 6:07 AM EDT
Dear VanCourt,
I grad to see you again.
1. Partitioning Application; is still problem, and my question is why such system is still necessary, I think why stand-alone and not-passive (so, active) device is not yet apeared well...We do not need to think over and take time to partition If we get such stand-alone devices.
2.Scheduling/Optimizing the communication; And then we do not need discuss about this, only we need scheduling among RC fabrics.
3.Virtualization; Since the virtualization proposed at WASMII project, there is few products supporting it, so why the technique is still niches
Yes, these first two points are problem on the BlueGene/L, but they already support with tools. BG/L uses simulated-anealing to placement. This is same way of FPGA/CPLD world.
>I don't know of any real how-to books on the market, but "Reconfigurable Computing" by Hauck and deHon is pretty good.
I've read the book, the book is good for reading each topic, but we can not get "total view" through the book because of there is no system concept that must be used in all topics.
Regards,
SOY
http://electron-nest.on.coocan.jp/
Sign in to Reply