There is some historical inertia in our whole approach to programming models.
In the 1970s memory speeds were faster than CPU clock speeds (RAM access was on the order of 100ns but CPU instructions on 1 1Mhz clock took microseconds to execute.) So programming languages took care to optimize arithmetic operations but could get away with *ignoring* memory completely...since memory accesses took place almost instantly from the processors point of view. So C does not distinguish between fast and slow memory...all pointers are equivalent. If there is a delay in accessing memory, the language makes no provision for how to reduce that latency...it does not even explicitly acknowledge that as a possibility. All programming languages today have this bias towards ignoring memory I/O, as a legacy from the popular languages of the 1970s.
Since then CPU speeds have gone up by an order of magnitude but memory speeds have only gone up slightly. And so hardware designers have used memory caches to try to manage memory invisibly to the programmer...and continue run software to run in a bubble where the conditions of the 1970s are imperfectly replicated--where memory accesses are fast and instantaneous.
Since the bottleneck in CPUs and GPUs is now memory I/O, a new type of language is needed which, at the least, allows the programmer to explicitly make a distinction between the various layers in the memory hierarchy, rather than in the kludgy way it's handled right now. Something like http://sequoia.stanford.edu/
How is this different from the GPGPU concept? nVidia has been at it for quite some time with CUDA and has success in a very limited set of applications - oil and gas exploration etc.
I don't see what the innovation here is.
Remember that floating point math started out as a coprocessor to the x86 architecture before being integrated; in fact, I imagine that it still has an "escape sequence" in the binary to invoke the coprocessor function. If AMD is using a similar path for the future, it seems like a logical extension of an x86 feature that has been around since dinosaurs walked the earth.
I don't understand the new direct-hardware-access model. Seem like a shared harware resource is still going to need layering somewhere to assure ownership by one process at a time. Is this task somehow being pushed out to the hardware so that it looks transparent to the caller?
ARM is a very innovative company that understands the model of the next industrial business. Focusing on building the basis and depending on others to plug and play will make them remain lean with capacity to adjust to market needs.
In the earlier generation computers there was a concept of bit-sliced processors and hardware time slicing. By this a single CPU computer worked like a multi core processor and the software developers could take advantage of this feature to write parallel programing applications with the required synchronization at some hardware buffers.
Looks like similar thing is appearing in a new Avatar in these latest multi-core CPUs