Today's systems architects have a tough enough job solving difficult architectural problems for applications like 40G line cards, HD Video Transcoding Systems, and next generation RADAR applications. However, the most difficult part of their job is that they also need to draw the difficult line between hardware and software; keeping manpower requirements in the equation, costs in check, and architecting a solution that can be built inside the market required timeframe. This is all in a day's work for some, but an almost impossible task for many.
Since software is generally cheaper to develop and the number of software engineers out-number hardware engineers by almost an order of magnitude, the explosion of software based solutions on x86 platforms has arguably become the de facto standard for many platforms. Many of these platforms were (and still are) built from the previous generations of CPUs with reusable software, and thus leveraged faster next-generation CPUs or multi-cores for their roadmap. That is, until now.
For 2008, the industry buzzwords are "Hardware Acceleration". CPU vendors are integrating custom integrated IP into their chips. AMD and Intel are creating ecosystems named Torenzza and QuickAssist for third party accelerators. GPU vendors are setting their sights on general-purpose functionality. Meanwhile, a host of other chip companies, too many to mention, are developing new products that target this High Performance Computing (HPC) market.
A little known fact in the HPC community is that the embedded computing market has always solved their problems with a combination of CPUs and accelerators. Due to different space, weight, power, and environmental requirements for High Performance Embedded Computing (HPEC), a large portion of those accelerators are implemented with Field Programmable Gate Arrays (FPGAs) from companies like Altera, Xilinx, and others.
As these two markets collide, the emergence of some of the world's largest companies from HPC and HPEC are now working together (Intel, AMD, Altera, and Xilinx) to make x86 CPUs with their own internal accelerators and easy to access external FPGA accelerators the solution of choice for both HPC and HPEC. As CPU-bound software-only solutions require the benefits of hardware acceleration to remain competitive, architects must figure out what to accelerate and how to make the price, performance, and power trade-offs that meet market requirements.
What do I accelerate?
In these examples, we will concentrate on FPGAs , but many of the concepts can be applied to other types of accelerators. A simplistic and very common place for designers to start is to profile your C/C++ code, find the routines that take most of your clock cycles, and start your efforts to increase performance or remove bottlenecks there. Hardware designers have done this for years and found the following types of applications that make sense to run on FPGA hardware. Here are some typical applications that can be parallelized beyond a factor of 10x improvement.
- Filters – FIR, IIR, Poly-Phase
- Fast-Fourier Transforms (FFT)
- Encryption – AES, TDES, DES, etc.
- Video Transcoding – MPEG2, H.264, VC-1, and others
- Compression – ZLIB, GZIP, etc.
- Bioinformatics – Smith Waterman, BLAST, ClustalW
- Random Number Generation (RNG) – SOBOL and Mersenne Twister for Monte-Carlo
- Medical Imagining – CT Back Projection
- Packet and Network Processing (IPv6, Deep Packet Inspection)
- Market Data – FIX, FAST FIX, OPRA, etc.
- And more...
Another way to describe or find these algorithms in your code is to think of them in one of two ways:
- Bit Level Processing with deep instruction pipelines.
- Vector Based Processing of large amounts of data.
Both of these are trying to analyze data in ways that are not the standard 32 bit or 64 bit instruction, which creates overheard for the CPU or GPU because it has a fixed data and instruction set size. Additionally, many of these functions can have very deep instruction pipelines, which in an FPGA can be defined as deep as necessary. FPGAs can be programmed to be 3 bit machines in the example of BLAST that can be parallelized 100's of times with instruction pipelines over one hundred deep. This creates a single resulting chip which can contain a 1000 core "machine" running at 300 MHz or 100x faster than a single 3 GHz CPU core.
Another example is encryption, which contains many XOR functions at the bit level, which again can be created easily in a FPGA, parallelized 1000's of times, to create a machine that can encrypt data at 4 GB/s (Bytes, not bits) using less than 50% of a large 65 nm FPGA.
1. Encryption is a good application for FPGA implementation.
(Click this image to view a larger, more detailed version)
A secondary effect of FPGA technology is substantial power savings over today's high-end CPUs or GPUs. The largest FPGA built on 65 nm technology can consume 25 to 30 watts of power versus a x86 CPU which can run over 100 watts, or worse, a GPGPU which can easily run over 200 watts. Combine this power savings with any of the examples above and you can have two orders of magnitude improvement on a performance/watt metric. This can be hard to ignore if you are in a design environment that has a limited power budget, like a UAV (Unmanned Aerial Vehicle), or a financial data center with major power/cooling concerns.