The latest rage in architectures is, following Sutton's Law, network processors. (In case you hadn't previously encountered it, Sutton's Law is based on a probably fabricated interview with the eponymous U.S. bank robber. When asked why he robbed banks, he allegedly replied "That's where the money is.")
For a number of reasons, a network processor is a tricky design. Everyone wants to run at wire speed, in which "wire'' can mean anything from 10-Mbit Ethernet to OC-48 optical fiber. Complicating things, packet processing involves tearing apart packets in which just about nothing is aligned on a byte boundary, making very fast decisions based on table lookups, and executing encryption algorithms on the fly.
Not a good match for a conventional RISC engine.
Nonetheless, we are starting to see proposed solutions now, and the approaches seem similar: a cloud of specialized packet processors, usually surrounding a 32-bit RISC master processor.
The thinking behind these approaches appears to be that the specialized packet engines may be tricky to program, but they will pretty much all be running instances of the same small code set, and doing basically easy tasks.
The really complex stuff, like Quality of Service decisions and higher-layer protocol processing, will be on the RISC engine where it can be tamed by powerful language tools.
There is a little something missing from this analysis, though, I think.
Historically, multiprocessing systems have been the hardest architectures to program and-especially-to debug.
Tightly coupled multiprocessing systems have been the most difficult of the multiprocessors.
And tightly coupled heterogeneous systems, in which there are at least two different kinds of processors, have been the hardest of all.
It's not that the theory is hard. But the practice can be a bear. Finding bugs, particularly when the bug involves interaction between two or more of the processors, can be a nightmare. Not a few design teams have had to resort to multiple in-circuit emulators and logic analyzers lashed together.
Now given that history, it seems fairly obvious that the less observable such a system is, the more trouble it's going to be to debug it.
So why not put all the processors on a single chip, where none of them can be examined in real-time, and be done with it? I'd want to ask some questions about debug strategy before I signed anything.