Having been in the hardware acceleration business for over a decade, I completely understand the concept of moving data from one algorithm processor to the next one in a programmable way. There are some huge problems with this concept that have to be considered - and will take many years to resolve...
First, many of the scheduling operations that are currently done within the OS will need to move to hardware. For example, if one algorithm processor is busy and 3 others want to feed their results to it, then the resource needs to scheduled. Priorities need to be considered, and queues need to be created to prevent the processors vying for the downstream resource from being stalled. Figuring how to do this in an efficient way will take a lot of R&D. Otherwise, all the memory accesses we are trying to reduce will be spent dumping and retrieving data from these queues.
While it is great in concept to have a bunch of hardware legos that we can connect in any fashion, the reality is that connecting two legos in a unique way means doing a new silicon spin, which takes months to complete. If we use FPGAs to shorten the turn-around time, then we've defeated the goal of reducing power.
Finally, there are a lot more software engineers than hardware architects. Training up a new breed of developers is going to take a long time.
So, while I embrace the idea, I'm a bit skeptical about the economics of actually making it happen. We've been trying to get rid of the basic Von Neumann machine for a long time, and I've seen very little headway toward making this happen...
Other than marketing bragging rights, what is the optimal number of CPU cores to maximize performance and minimize energy use? Other than image processing (where parallel processing has obvious benefits), most computer tasks consist of many interacting threads. Parallel multi-processor algorithms are still in their infancy and grid-lock conditions are becoming increasing familiar to computer users. Can certain tasks be dedicated to an available core to distribute the work? Could a core be reserved for the user so there is some level of responsiveness in the computer? It seems to me that today "background" tasks like scanning disks and file backup have a way of completely taking over the computer and blocking the most basic user operations. Obviously the task prioritization algorithms are failing miserably. That said, the prospects for multiple cores to be coordinated effectively are not encouraging.
The problem lies in software, e.g our inability to program multi-core operation efficiently...if you llook at the human brain it is highly parallel...although I am not sure how many cores do I have under my skull ;-)...Kris
Those "background" tasks you refer take over the user responsiveness because the bottleneck it is in your input/output (IO) operations. Spinning Hard drives have and extremely non-linear performance regarding multiple access. It is very difficult to predict that performance hit because it depends on so many factors (data positiion, number of disk resources been accessed "simultaneously", disk internal algorithm, energy class, etc). The solution is to get a more predictable IO device like an SSD. I have a couple of them (SATA and PCIe based) and there is no turning back.
And being extremely conservative about using disk resources from background processes it is not going to help because reading two 4KB chunk of data which are far apart kills disk performance in a way the final user notices. Perhaps if applicactions could have access to latency or command queue depth metrics they would know better when the system is really idling.
We are essentially trying to design machines that look increasingly like the human brain...parallel processing devices with the ability to compute multiple tasks at the same time...this explains the recent fascination with DeepMind and a number of other AI companies.
I am not sure I agree Zewde...people keep talking about using brain as an example, and perhaps multiple cores is a small example in this direction...but the basic computation and communication we use in electronics is digital while brain uses analog...very different approaches...also we insist on perfection (we test all chips so they don't "fail" by trying as many vectors as possible), brain is fine with rough calculations and coasional errors...but look at the power dissipation, brain 20W, Watson from IBM 20kW, 1000x difference...Kris
One of the under appreciated features of the human brain is the ability to prioritize. If I ask you a complex math problem and you're attacked by a lion at the same time, you'll ignore my question (and provide a random answer) while you run for safety. Computers will always deliver the "correct" answer to the math problem even if in the process the combined delay proves fatal to the greater objective. I suppose it is one of the advantages of our highly parallel processing system with a sophisticated overarching control system. Every once in a while we find ourselves slightly conflicted - but considering the quantity of processing that we're doing these events are trivial and quite infrequent. Imagine if once a day we got into gridlock and our hearts, lungs, and brains stopped pending a reboot. We would have been extinct millions of years ago. Perhaps there is a lesson in that for computers.
I suspect Horowitz was pointing to work at parallel research labs at Stanford and Berkeley that concluded instrad of one big monolithic general processor, we want a host of task specific blocks to have best p/r and p/w.
That's what we are already seeing in smartphone chips with their grpahics, audio, image, baseband and general processors. So most data movement is on chip. Is the penalty still big?
I may have spent too much time reading between the lines, but it appeared to me that they were promoting a pool of heterogeneous processors where apps would be directed to the processor that best fit its profile. If this is the case, then there are going to be resource conflicts, thus the queueing. The queueing can be internal, but may require a lot of memory if the bottlenecks are significant. In the case of cell phones, etc. the interconnections are very application specific. To make them general purpose as I surmised from the article, a whole new programming model would be required that includes scheduling and other conflict management functions built into the HW.
If we want an array of homogeneous processors, then the GPUs and Xeon Phi are already doing this - and they are very power hungry.
About the pix, I chose this avatar because I love this Yosemite hike and most people haven't seen Half Dome from this side :-).