In the world of silicon implementation, things change every minute. But in the world of system architecture, the pace is more leisurely. Old ideas have a nasty way of getting reinvented by a new generation of innocents, to the great amusement of industry veterans. A case in point is parallel multiprocessing.
As ideas go, this one is really old. Computer scientists were building large parallel machines with dozens of computing elements in the days of transistor-level integration: Illiac IV, to drop a name that only the superannuated would recall, dates from the mid-1960s. The irony is that silicon architects are now starting to tread the same ground so laboriously (and ingloriously, in Illiac's case) broken by an earlier generation of computer scientists.
Of course we're using different names this time. Several areas, including wire-speed switching and signal processing for wireless applications, are seeing single processors run out of gas. So architects are pasting down multiple instances of their CPU, network processor or DSP core until arithmetic tells them they have enough Gops to meet their requirements. And then comes the problem of interconnect.
One of the first things the supercomputing folks learned 40 years ago was that parallel machines are relatively insensitive to the performance of an individual computing element-fortunately, since processors rarely behave as represented in the data sheet. But such architectures are exquisitely sensitive to the topology and performance of the interconnect and memory. In general, if the topology of the processing elements, links and memories isn't nearly congruent to the actual data flow through the system during normal operation, you are in trouble.
This would appear to be a simplification, not a problem. Just analyze the data flow of your application, identify the necessary flows, transforms and storage, and sit down with an architectural-planning tool. How you implement the pieces is a secondary issue.
And it might be that simple except for a human factor. Engineering education outside Europe has for some reason studiously avoided teaching most designers to think in terms of data flows. By the time they learn C, a procedural language, and C++, C thinly cloaked in a wrapper of object-oriented lingo, engineers are so biased toward control flow that even visualizing moving data seems counterintuitive.
Nonetheless, don't be surprised if the future of high-performance embedded processors involves data flow analysis at the front end, then using that information to generate first requirements and topology, then timing and power charts, and only then block implementations.