I know I will get myself into some trouble with this blog, not because I mean to offend anyone, but because I am getting on the edge of something I don’t fully understand and thus will probably open myself up for people telling me I am full of it, or that I have it all wrong. You know – I really don’t mind and it is part of the reason for the blog, just keep it polite.
OK, so what is going on in my mind? The reason why general purpose processors are a less than ideal structure is not just because of their general inefficiency, but because of the memory structures they are connected to. Going to memory is expensive. I know for a long time that processor speeds were increasing faster than memory access times and I am sure that has continued although at a slower rate. Now in going to multiple processing cores, we only speed things up when the bandwidth to memory is increased and this is basically achieved by giving each processor its own cache which is not only faster and more expensive memory but also avoids having to go off chip.
DSPs added a new twist in that they fetched memory contents using multiple buses, a technique that was later copied in some processors and by accessing data in wider words than the internal data width so that one fetch would obtain multiple instructions or pieces of data. Pre-fetch attempted to use any memory access dead time to do useful fetches in the hopes that they would be used, and DMA saved some of the information transfers that had to happen in order to get contents from blocks of memory rather than one at a time.
When we start to think of accelerators for processors, the biggest impediment to achieving significant speedups are the memory accesses, or in other words the things that are the most acceleratable are those in which lots of processing is performed on a few pieces of data. I also remember sitting in a seminar given by Tilera many years ago that talked about how communications between processors used to be expensive, because of the memory accesses, and that with a many core architecture where processors could directly talk to each other, the communications would become cheaper than computation. I never managed to really get my head around what that meant in programming terms.
But what I can get my head around are the problems associated with building FPGA accelerators that sit on the side of a processor. It is a nice concept – use the FPGA to accelerate the computational pieces of the code and then you may be able to downsize the processor, or use it for additional tasks and the whole application runs faster. But having worked on this kind of problem in the past, it is still constrained in many cases by memory. The accelerator still has to get data from the memory, put it back in the memory, and sometimes the data it uses it sitting in the process memory or cache and so has to be flushed to memory first. With some algorithms, even using advanced memory managers and DMA, there was barely any overall speedup even though the accelerator ran 10X faster than the processor (figure totally made up).
I can even relate this to high-level synthesis. The designs that work the best are the ones that operate from a block of data that is accessed in a regular manner and returns the results to memory. When “random” accesses to memory are made it becomes too difficult to work out the data dependencies. Most of the time you spend doing architectural investigation in these tools involves sizing and shaping the memories and the ways in which they are accessed so that things can be properly pipelined. In hardware we have the luxury of fully custom memories and access mechanisms which are unavailable to the general processor world.
So it would seem that our world is constrained by memories and yet there have been few advancements in this technology apart from a slow and steady speedup in access times and capacities. Where are the memory optimization tools that tell us our application would go faster if we rearranged data in this manner, or the processors with different memory architectures that are optimized for certain kinds of tasks, or algorithms that are designed to minimize data access?
With so much memory now available we seem to be heading in the opposite direction, using memory in a willy-nilly fashion, when in reality it is our biggest roadblock.
When I was at university (in the 80's), I am sure I remember some research into putting processing in memory to solve this bottle neck. I guess that did not come to anything, although specialist items such as CAM do a little of this. The nearest thing is probably the distributed RAM in FPGAs, allowing integrated processing.