In a meeting at the Embedded Systems Conference recently, someone made a remark that has been sticking in the back of my mind ever since. "You know," this group manager said, "we really haven't taken on the real system-on-chip challenge yet. So far, the SoC designs that have been attempted have been mostly just integrations of existing board-level designs. We haven't really started to ask how we would do the underlying architecture differently with the ability to put all those transistors on one die."
That's an interesting observation. So far, there has been so much struggle just to make an SoC design work-from process problems to model inaccuracies to tool problems to methodology failures-that there hasn't been much discussion about what we are trying to design in the first place. And I think this guy was right: To date, the designs have been mostly glorified microcontrollers. They almost all follow the pattern of a CPU surrounded by a cluster of small blocks of memory and more or less intelligent peripheral controllers. If you will, 8051s on steroids.
But what if we pulled out a clean sheet and started with a new architecture rather than an existing design? Would we do things differently?
The example that is emerging in the so-called network processor race suggests that the answer is a definite "yes." Network processors have followed the outlines of the multiboard designs they are attempting to replace: a large number of specialized packet engines running at line speed, clustered around a central higher-level CPU and linked by a more or less flexible interconnect matrix.
But in following this model, the network processors have left behind the old microcontroller paradigm and have started moving toward the concept of parallel processing. If you put your finger over the CPU block, the data sheet diagram for one of these things looks more like the system diagram for Illiac-IV than it resembles an overblown 8051.
The trend could go further. Suppose that you had a communications problem that needed very high bandwidth, but was not so critical on latency. You could license (or write for yourself) an 8051-type MCU core, and instantiate it-say, 64 or 96 times. Then you could connect the whole array of simple processors to the incoming data stream with a partially populated crossbar switch, and put the whole thing under the control of a scheduling task on the central CPU core.
Similar approaches can be applied to any dataflow that lends itself to partitioning. This gets interesting in analog design, but that can be a different subject.