News & Analysis
AMD's CTO talks heterogeneous systems architecture
Sylvie Barak
1/30/2012 9:08 PM EST
Not a replacement for OpenCL
HSA, said Macri, was also not a replacement for Open CL, rather, HSA would be an optimized platform architecture for OpenCL. “If you want to write OpenCL, this will be the hardware to run it better,” he said.
Indeed, using OpenCL on HSA, he said would avoid wasteful copies, have low latency dispatch, improve the memory model and share pointers between CPU and GPU.
“HSA also exposes a lower level programming interface, for those that want the ultimate in control and performance,” said Macri, not to mention that optimized libraries could choose the lower level interface.
Today’s command and dispatch flow has too many steps and processes, said Macri, adding that it was a waste to have so much overhead just to get something to execute.
With HSA, he said, applications could simply place things directly into the hardware queue without the need for all those extraneous drivers. “No APIs to deal with, no kernel mode drivers, no soft queues. Just direct access to the hardware,” he explained.


The bottom line, said Macri, was that it was important to switch the compute, not move the data. With every processor now running serial and parallel cores, every core should be capable of running at different levels of performance and be easily programmable. The architecture needs to easily support massive data sets and task based programming models, while remaining open to all.

“The architectural path for the future is clear,” Macri declared. That path will be paved with the programming patterns established on Symmetric Multi-Processor (SMP) systems migrating to the heterogeneous world. The architecture will be open, with published specifications and an open source execution software stack, and heterogeneous cores would be able to work together seamlessly in coherent memory, with low latency dispatch and no software fault lines.
That future, according to Macri, could not come around soon enough.
For EE Times' full coverage of DesignCon, please visit here.
HSA, said Macri, was also not a replacement for Open CL, rather, HSA would be an optimized platform architecture for OpenCL. “If you want to write OpenCL, this will be the hardware to run it better,” he said.
Indeed, using OpenCL on HSA, he said would avoid wasteful copies, have low latency dispatch, improve the memory model and share pointers between CPU and GPU.
“HSA also exposes a lower level programming interface, for those that want the ultimate in control and performance,” said Macri, not to mention that optimized libraries could choose the lower level interface.
Today’s command and dispatch flow has too many steps and processes, said Macri, adding that it was a waste to have so much overhead just to get something to execute.
With HSA, he said, applications could simply place things directly into the hardware queue without the need for all those extraneous drivers. “No APIs to deal with, no kernel mode drivers, no soft queues. Just direct access to the hardware,” he explained.



“The architectural path for the future is clear,” Macri declared. That path will be paved with the programming patterns established on Symmetric Multi-Processor (SMP) systems migrating to the heterogeneous world. The architecture will be open, with published specifications and an open source execution software stack, and heterogeneous cores would be able to work together seamlessly in coherent memory, with low latency dispatch and no software fault lines.
That future, according to Macri, could not come around soon enough.
For EE Times' full coverage of DesignCon, please visit here.
Navigate to related information


prabhakar_deosthali
1/31/2012 1:44 AM EST
In the earlier generation computers there was a concept of bit-sliced processors and hardware time slicing. By this a single CPU computer worked like a multi core processor and the software developers could take advantage of this feature to write parallel programing applications with the required synchronization at some hardware buffers.
Looks like similar thing is appearing in a new Avatar in these latest multi-core CPUs
Sign in to Reply
goafrit
1/31/2012 11:18 AM EST
ARM is a very innovative company that understands the model of the next industrial business. Focusing on building the basis and depending on others to plug and play will make them remain lean with capacity to adjust to market needs.
Sign in to Reply
xorbit
1/31/2012 2:21 PM EST
You mean AMD?
Sign in to Reply
dirk.bruere
1/31/2012 1:53 PM EST
This has been a research topic for the past 30 years. No doubt they will be reinventing the same wheels.
Sign in to Reply
NSK
1/31/2012 2:42 PM EST
I don't understand the new direct-hardware-access model. Seem like a shared harware resource is still going to need layering somewhere to assure ownership by one process at a time. Is this task somehow being pushed out to the hardware so that it looks transparent to the caller?
Sign in to Reply
wmgervasi
1/31/2012 2:42 PM EST
Remember that floating point math started out as a coprocessor to the x86 architecture before being integrated; in fact, I imagine that it still has an "escape sequence" in the binary to invoke the coprocessor function. If AMD is using a similar path for the future, it seems like a logical extension of an x86 feature that has been around since dinosaurs walked the earth.
Sign in to Reply
MikeSmith2011
1/31/2012 4:35 PM EST
How is this different from the GPGPU concept? nVidia has been at it for quite some time with CUDA and has success in a very limited set of applications - oil and gas exploration etc.
I don't see what the innovation here is.
Sign in to Reply
melonakos
1/31/2012 10:14 PM EST
Sounds a lot like ArrayFire (which has both OpenCL and CUDA support), http://accelereyes.com/arrayfire
Sign in to Reply
Hasmon
2/2/2012 9:41 AM EST
There is some historical inertia in our whole approach to programming models.
In the 1970s memory speeds were faster than CPU clock speeds (RAM access was on the order of 100ns but CPU instructions on 1 1Mhz clock took microseconds to execute.) So programming languages took care to optimize arithmetic operations but could get away with *ignoring* memory completely...since memory accesses took place almost instantly from the processors point of view. So C does not distinguish between fast and slow memory...all pointers are equivalent. If there is a delay in accessing memory, the language makes no provision for how to reduce that latency...it does not even explicitly acknowledge that as a possibility. All programming languages today have this bias towards ignoring memory I/O, as a legacy from the popular languages of the 1970s.
Since then CPU speeds have gone up by an order of magnitude but memory speeds have only gone up slightly. And so hardware designers have used memory caches to try to manage memory invisibly to the programmer...and continue run software to run in a bubble where the conditions of the 1970s are imperfectly replicated--where memory accesses are fast and instantaneous.
Since the bottleneck in CPUs and GPUs is now memory I/O, a new type of language is needed which, at the least, allows the programmer to explicitly make a distinction between the various layers in the memory hierarchy, rather than in the kludgy way it's handled right now. Something like http://sequoia.stanford.edu/
Sign in to Reply