EDA DesignLine Blog
Comment
StefanMohl
FPGAs are actually very much better than standard CPUs for low-latency access ...
Dr DSP
Double precision floating point efficiency will turn out to be the big issue. ...
Altera announces industry’s first OpenCL program for FPGAs
Clive Maxfield
11/15/2011 3:40 PM EST
The folks from Altera have just announced a development program focused on the Open Computing Language (OpenCL) standard for FPGAs and SoC FPGAs(see also OpenCL gets upgrade, Altera tips FPGA tool).
What does this mean? What is OpenCL? Why do we care? Actually, there are so many implications here that this takes a little effort to wrap one’s brain around, but I’ll try to explain it and we’ll see how well I do…
Let’s start off with the fact that we all want more processing power. As an engineer I want as much processing performance “horse power” as I can get. The same applies to the folks performing multimedia processing (HD, VoD, 3D Video…); medical imaging (MRI, CT, PET…); high-performance computing (HPC) such as climate, financial, and fluid dynamic modeling; radar systems processing, and … the list goes on…
One way to increase processing power is frequency scaling, which basically means increasing the frequency of the CPU clock, but power considerations and physics limitations caused this approach to grind to a halt at around 3GHz circa 2003.
Another way to increase processing power is to increase the number of processor cores, which is why we now see CPUs containing dual-cores, quad-cores, and sometimes more.
Now, some algorithms are highly applicable to multi-core processing. In fact, some algorithms can benefit from having access to hundreds of processor cores, but where are we going to find hundreds of processor cores lying around? Well, by some strange quirk of fate, the graphics processing units (GPUs) found on today’s high-end graphics cards do, in fact, contain hundreds of processor cores.
The thing is that, a few years ago, some bright person came up with the idea of accessing the processing cores in the GPU and using them to perform for non-graphical computing. At that time, circa 2006, these cores typically worked with fixed-point values and accessing them was non-trivial. Circa 2007/8 folks started providing APIs that provided easier access to the cores. Also, the GPUs themselves became much more sophisticated – today they contain hundreds of cores each of which can support single- or double-precision floating-point calculations.
All of which leads us to OpenCL. Rather than my re-inventing the wheel here, let’s simply look to see what the Wikipedia has to say:
To put this in a nutshell, the OpenCL standard is a C-based open standard for parallel programming. Note in particular the part that says “…execute on heterogeneous platforms consisting of CPUs, GPUs, and other processors.” The point is that, in addition to CPUs and GPUs, OpenCL can be compiled for use in FPGAs.
“So what,” you may say, “why not just use CPUs and GPUs?” Well, the thing is that FPGAs are actually really, REALLY efficient when it comes to running things in parallel using hardware algorithmic acceleration functions. In fact, using an FPGA you can get higher performance than a GPU while using only about 1/5 of the power, which is “nothing to sneeze at” as they say.
But I’m wandering off into the weeds again… Altera’s OpenCL program combines the parallel performance capability of FPGAs with the OpenCL standard to enable powerful system acceleration. This heterogeneous system (CPU plus FPGA using the OpenCL standard) also has a significant time-to-market advantage compared to traditional FPGA development using lower level hardware description languages (HDLs) such as Verilog or VHDL.
Through its OpenCL program, Altera has engaged with multiple customers and expanded its university program to support the OpenCL standard for FPGA development in academia, and is actively contributing to the evolution of the OpenCL standard based on customer feedback. Early results of customer evaluations show a 35X performance increase compared to multicore CPU solutions, and a 50 percent reduction in development time compared to HDL-developed FPGA solutions.
Developed by an industry consortium called The Khronos Group, the OpenCL standard is an open, royalty-free standard that supports cross-platform, parallel programming of heterogeneous systems. As a standard parallel language, the OpenCL standard allows programmers to use a familiar C-based language to develop code across platforms, from CPUs to GPUs, and – now – expanding to FPGAs.
By adopting a heterogeneous architecture with OpenCL, system architects can maximize performance of algorithmic-intensive portions of their design while also achieving fast time-to-market. Target applications range from high-performance computing, including climate and financial modeling, to advanced radar systems, medical imaging, and video encoding and processing—any system that requires fast computations that can be parallelized.
The OpenCL standard offers a natural separation between “host” code—pure software, written in standard C/C++, that can be executed on any type of microprocessor—and the “kernel” code, written in OpenCL C, that runs on the accelerator. By profiling their algorithms, system architects can choose which functions to accelerate as kernels in the FPGA device to improve system performance. Multiple kernels can operate in parallel to further speed up processing. The host communicates with the accelerator device via a set of library routines with a minimal set of extensions that allow programmers to specify parallelism and memory hierarchy for the most computationally intensive portions of the code.
Visit www.altera.com/OpenCL for more information on Altera’s OpenCL program, including a whitepaper and online learning materials, and also to register for updates. For more information on the OpenCL standard, visit www.khronos.org/opencl.
If you found this article to be of interest, visit Programmable Logic Designline where – in addition to my blogs on all sorts of "stuff" (also check out my Max's Cool Beans blog) – you will find the latest and greatest design, technology, product, and news articles with regard to programmable logic devices of every flavor and size (FPGAs, CPLDs, CSSPs, PSoCs...).
Also, you can obtain a highlights update delivered directly to your inbox by signing up for my weekly newsletter – just Click Here to request this newsletter using the Manage Newsletters tab (if you aren't already a member you'll be asked to register, but it's free and painless so don't let that stop you [grin]).
What does this mean? What is OpenCL? Why do we care? Actually, there are so many implications here that this takes a little effort to wrap one’s brain around, but I’ll try to explain it and we’ll see how well I do…
Let’s start off with the fact that we all want more processing power. As an engineer I want as much processing performance “horse power” as I can get. The same applies to the folks performing multimedia processing (HD, VoD, 3D Video…); medical imaging (MRI, CT, PET…); high-performance computing (HPC) such as climate, financial, and fluid dynamic modeling; radar systems processing, and … the list goes on…
One way to increase processing power is frequency scaling, which basically means increasing the frequency of the CPU clock, but power considerations and physics limitations caused this approach to grind to a halt at around 3GHz circa 2003.
Another way to increase processing power is to increase the number of processor cores, which is why we now see CPUs containing dual-cores, quad-cores, and sometimes more.
Now, some algorithms are highly applicable to multi-core processing. In fact, some algorithms can benefit from having access to hundreds of processor cores, but where are we going to find hundreds of processor cores lying around? Well, by some strange quirk of fate, the graphics processing units (GPUs) found on today’s high-end graphics cards do, in fact, contain hundreds of processor cores.
The thing is that, a few years ago, some bright person came up with the idea of accessing the processing cores in the GPU and using them to perform for non-graphical computing. At that time, circa 2006, these cores typically worked with fixed-point values and accessing them was non-trivial. Circa 2007/8 folks started providing APIs that provided easier access to the cores. Also, the GPUs themselves became much more sophisticated – today they contain hundreds of cores each of which can support single- or double-precision floating-point calculations.
All of which leads us to OpenCL. Rather than my re-inventing the wheel here, let’s simply look to see what the Wikipedia has to say:
OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. OpenCL includes a language (based on C99) for writing kernels (functions that execute on OpenCL devices), plus APIs that are used to define and then control the platforms. OpenCL provides parallel computing using task-based and data-based parallelism. It has been adopted by Intel, AMD, Nvidia, and ARM. OpenCL is an open standard defined by the Khronos Group.
OpenCL gives any application access to the graphics processing unit for non-graphical computing. Thus, OpenCL extends the power of the Graphics Processing Unit beyond graphics (general-purpose computing on graphics processing units). Academic researchers have investigated automatically compiling OpenCL programs into application-specific processors running on FPGAs,[1] and commercial FPGA vendors are developing tools to translate OpenCL to run on their FPGA devices.
OpenCL is analogous to the open industry standards OpenGL and OpenAL, for 3D graphics and computer audio, respectively. OpenCL is managed by the non-profit technology consortium Khronos Group.
To put this in a nutshell, the OpenCL standard is a C-based open standard for parallel programming. Note in particular the part that says “…execute on heterogeneous platforms consisting of CPUs, GPUs, and other processors.” The point is that, in addition to CPUs and GPUs, OpenCL can be compiled for use in FPGAs.
“So what,” you may say, “why not just use CPUs and GPUs?” Well, the thing is that FPGAs are actually really, REALLY efficient when it comes to running things in parallel using hardware algorithmic acceleration functions. In fact, using an FPGA you can get higher performance than a GPU while using only about 1/5 of the power, which is “nothing to sneeze at” as they say.
But I’m wandering off into the weeds again… Altera’s OpenCL program combines the parallel performance capability of FPGAs with the OpenCL standard to enable powerful system acceleration. This heterogeneous system (CPU plus FPGA using the OpenCL standard) also has a significant time-to-market advantage compared to traditional FPGA development using lower level hardware description languages (HDLs) such as Verilog or VHDL.
Through its OpenCL program, Altera has engaged with multiple customers and expanded its university program to support the OpenCL standard for FPGA development in academia, and is actively contributing to the evolution of the OpenCL standard based on customer feedback. Early results of customer evaluations show a 35X performance increase compared to multicore CPU solutions, and a 50 percent reduction in development time compared to HDL-developed FPGA solutions.
Developed by an industry consortium called The Khronos Group, the OpenCL standard is an open, royalty-free standard that supports cross-platform, parallel programming of heterogeneous systems. As a standard parallel language, the OpenCL standard allows programmers to use a familiar C-based language to develop code across platforms, from CPUs to GPUs, and – now – expanding to FPGAs.
By adopting a heterogeneous architecture with OpenCL, system architects can maximize performance of algorithmic-intensive portions of their design while also achieving fast time-to-market. Target applications range from high-performance computing, including climate and financial modeling, to advanced radar systems, medical imaging, and video encoding and processing—any system that requires fast computations that can be parallelized.
The OpenCL standard offers a natural separation between “host” code—pure software, written in standard C/C++, that can be executed on any type of microprocessor—and the “kernel” code, written in OpenCL C, that runs on the accelerator. By profiling their algorithms, system architects can choose which functions to accelerate as kernels in the FPGA device to improve system performance. Multiple kernels can operate in parallel to further speed up processing. The host communicates with the accelerator device via a set of library routines with a minimal set of extensions that allow programmers to specify parallelism and memory hierarchy for the most computationally intensive portions of the code.
Visit www.altera.com/OpenCL for more information on Altera’s OpenCL program, including a whitepaper and online learning materials, and also to register for updates. For more information on the OpenCL standard, visit www.khronos.org/opencl.
If you found this article to be of interest, visit Programmable Logic Designline where – in addition to my blogs on all sorts of "stuff" (also check out my Max's Cool Beans blog) – you will find the latest and greatest design, technology, product, and news articles with regard to programmable logic devices of every flavor and size (FPGAs, CPLDs, CSSPs, PSoCs...).
Also, you can obtain a highlights update delivered directly to your inbox by signing up for my weekly newsletter – just Click Here to request this newsletter using the Manage Newsletters tab (if you aren't already a member you'll be asked to register, but it's free and painless so don't let that stop you [grin]).
Navigate to related information


Max the Magnificent
11/15/2011 3:54 PM EST
One thing that makes this announcement particularly interesting is Altera's Fused Datapath technology which lets them implement single-and double-precision floating-point operations very efficiently...
Sign in to Reply
Dr DSP
12/8/2011 6:49 PM EST
Double precision floating point efficiency will turn out to be the big issue. This is what most HPC folks are looking for and it is not that efficient in FPGAs. (Maybe Altera has or will publish some benchmarks to provide some evidence to the contrary?)
If DP computation is improved then the issue is getting data on and off chip efficiently (from high speed memory)for things like sparse matrix computations.
If someone can show a real world example that addresses these issues then I can be convinced. Otherwise it's just marketing fluff (IMHO).
Sign in to Reply
StefanMohl
12/9/2011 2:58 PM EST
FPGAs are actually very much better than standard CPUs for low-latency access and memory bandwidth. The main reason is that FPGAs have large numbers of internal parallel memory banks that can be manually controlled.
CPUs only have automatically controlled caches in sequential "waterfall" levels. On a CPU, you are usually forced to read a cache-line of contiguous data from memory at a time, and hope that the algorithm access pattern fits the associativeness of the cache you happen to have. With FPGAs, you can stage your data manually and with high precision into thousands of simultaneously accessible memory regions, giving you literally multi-terabyte-per-second (not giga, tera!) bandwidth to your data.
Also, you are not limited to reading bursts of a cache-line in size, rather you can read as much as you really need. Often, FPGA-cards also have several attached SRAM memory banks for fast-access off-chip storage, again improving on the von Neumann bottleneck.
The key point in all this are the words "large numbers of, several, thousands of, simultaneously accessible", and so on. The whole point is that the FPGA is originally parallel, in contrast to CPUs that are originally sequential. That means that everything about the FPGA is based on parallel and multiple access (along with full manual control of data staging), and that really helps _a_lot_ when having tricky memory access problems!
Sign in to Reply