San Mateo, Calif. - Single-chip systems based on a large array of CPU cores are quietly appearing in an increasing number of applications. Design teams are finding that these architectures, though far from conventional, may be the least risky of their alternatives. A network processor design recently announced by Cisco Systems Inc. based on an array of 192 processor cores from Tensilica Inc. (Santa Clara, Calif.) illustrates the trend. Companies like ARC and picoChip have also disclosed real-world processor-array-on-chip architectures.
Cisco's CRS-1 chip had an intimidating charter from the outset, said Dan Lenoski, vice president of engineering for the routing-technology group at Cisco Systems (San Jose, Calif.). "We were creating a routing-platform architecture that would have a 10-year life span," Lenoski said. Given the rate of change taking place in communications protocols, "we felt that meant that the routing engine had to be programmable-not just in theory, but with available, general-purpose programming tools. And we had a 40-Gbit/second rate target."
The key element of the new platform would be a routing chip that would take in packets at the 40-Gbit line rate and perform header decapsulation/encapsulation, forwarding-table lookups, classification, accounting functions and interaction with external queuing chips.
The Cisco team rapidly concluded that no existing network processor was up to the task. Nor was it viable to create a unique architecture based on a proprietary NPU-on-steroids, a reconfigurable computing engine based on programmable logic or anything similarly speculative. Even if these architectures could get to the desired performance, the necessity for them to remain programmable for the long years the equipment stays in the field precluded this approach.
By the process of elimination, the team considered an array of standard CPUs executing standard C-based code. That still left open the question of how to organize the CPUs and how to map tasks to them.
A pipelined architecture was perhaps the most obvious approach to achieving high speed and minimizing instruction memory issues. But, Lenoski pointed out, such an approach severely limits flexibility for the future. Every decision that moves the chip design away from a general array and toward higher efficiency builds assumptions about the routing algorithms into the silicon.
So the team decided that each packet would be fed into a small cluster of identical processors, which would handle all the computations on a packet and then pass it on. The number of clusters necessary would be simple arithmetic from the line rate and the estimated worst-case processing time. This would necessitate having the same code repeated in the instruction caches of numerous CPUs, and it would require much more generality in the interconnect scheme than would have been necessary for a pipelined design. "Given our need for flexibility, replicating the instruction memory was not too painful," Lenoski said. "We ended up with a design in which a packet flows into a cluster of 12 CPUs. The packets get distributed within the cluster, and one CPU ends up doing the whole job on one packet. The cluster has its own shared local memory. There is a cascaded structure of crossbar switches interconnecting the processors to permit packets to move in and out, and to give all the processors access to shared on-chip resources."
Each processor in the cluster, Lenoski said, has its own "very modest" caches, and there is a shared L2. There are, altogether, 16 clusters of CPUs on the die. In addition, one more CPU is specialized for debug, and another as a system interface and maintenance processor. A set of shared resources exists for functions such as table lookups.
The design makes some, but not extensive, use of Tensilica's instruction-set configurability. "We use the cores almost in their off-the-shelf configuration," Lenoski said. "We added some extensions for nonaligned operations, because extracting and inserting bit fields in the header is always a time-critical operation without unaligned data handling. And there were a few other changes. But probably the most significant piece of hardware we added was a DMA [direct memory access] engine of our own design. That is responsible for the majority of the data traffic around the chip."
The chip is being fabricated in 130-nanometer CMOS at IBM Corp. It has a total of 192 CPU sites in the array. The design permits up to four processors to be disabled, giving a total of 188 minimum active routing CPUs. On top of that are the two other processors and the shared resources. Lenoski estimated that altogether the chip, which runs at 250 MHz, has about three times the processing power of the fastest Pentium.
The Cisco design is not the only array architecture in actual use. Although less focused on large arrays of processors than Tensilica, rival ARC International (San Jose) is also seeing gradual acceptance of array architectures. "I can think of maybe a dozen examples of designs we've been involved in, in which more than six cores were used," said architectures manager Peter Wells. "Some of them get up into the 64-core range. We were involved in one paper design with 264 cores, but it was an attempt to see how far the envelope could be pushed, and I don't believe it was ever implemented."
Most often, Wells said, architectures with large numbers of cores are homogeneous-that is, they use many instances of the same core rather than a complex arrangement of cores with different configurations. Design differentiation, and adaptation to a particular set of tasks, tend to come from the choice of caching schemes, memory architectures and especially interconnect architectures. "When it comes to interconnect, it seems like everybody is different," Wells said. "Often that's a differentiation for the design team. You see everything from wide buses to crossbars to point-to-point connections."
But often, Wells said, the main reason for moving to the processor array approach is software. By creating a large array of similar processors, the application-level programming becomes very similar to programming a single CPU. The language is usually C or C++. For all the tasks except some communications routines to handle data flowing between processors, and excepting the supervisory tasks-which usually run on a separate processor outside the array-the programming model is in fact just the same as it would be for a single CPU. This enormously simplifies the problems of assembling a software team, of debugging and of long-term maintenance.
Meanwhile, at least one standard-product IC vendor, picoChip Designs Ltd. (Bath, England), reports increased receptivity for its processor array architecture. In picoChip's case, the architecture is mildly heterogeneous, and is somewhat specialized for the needs of cellular-basestation processing.
Vice president of marketing Rupert Baines said the specter of WiMAX wireless broadband, and the cloud of uncertainty surrounding the 802.16e wireless metro network standard, have boosted interest. "We now have two solid design wins in the WiMAX area-systems under development with scheduled announcement dates," Baines said.