An easy conclusion from the annual Hot Chips conference this year is that multicore is becoming many-core. While the PC and server markets gradually evolve from four to six or eight massive x86 cores, Hot Chips papers suggest that the rest of the world is moving in a different direction: large numbers of relatively simple CPUs. But the trend is reinforcing a long-appreciated set of questions—as the number of cores grows, how do you deal scalably with interconnect, memory hierarchy, coherency, and intra-thread synchronization?
Answers to these questions depend on the size of the design, the application space, and the heritage of the design team. Solutions at Hot Chips ranged from the elegantly—and perhaps overly—simple to the rococo.
At the large end of the spectrum was Cavium, describing the 32-core Octeon 68xx family of network processing ICs (figure 1.) The family claims its place in the many-core trend by using up to 32 identical MIPS64 cores. The individual cores are relatively simple dual-issue, in-order designs with some networking-specific extensions, according to Cavium fellow Richard Kessler.
Figure 1. The Octeon 86xx can scale up to 32 CPUs
Each MIPS core has its own highly-associative L1 caches. Observing that packet-processing applications tend to be low-touch, Kessler said that the best policy for the L1 was write-through backed by a substantial write buffer. Just in case, there is the added ability to declare pages private and suppress the write-through when desired.
The 32 cores are grouped into four clusters, each with its own connection to a crossbar switch (figure 2.) On the other side of the crossbar are four L2 controllers, each with access to a pair of DDR3 DRAM controllers and to 4 Mbytes of shared L2 cache. The L2 controllers serve as the point of coherency for this hierarchy.
Figure 2. CPUs are clustered around a central crossbar.
The family reflects its network-processing heritage with a curious asymmetric, heterogeneous architecture. Along with the clusters of MIPS cores, the architecture includes a number of dedicated accelerators, including a compression/decompression engine, a RAID 5/6 processor, and an elaborate regular-expression processor for deep packet inspection. These engines attach to the crossbar via a shared I/O bus and a bus bridge, apparently outside the coherency sphere.
This arrangement suggests Cavium’s view of how software will employ the chip. Viewed by a MIPS thread, the chip would appear to be a vast symmetric multiprocessing system with strictly coherent memory, dynamic scheduling—there is a hardware scheduling engine on the I/O bus that can trigger tasks when a packet arrives and can remove locks—and a strong bias toward short packets and relatively simple, independent threads. Kessler said that to speed processing of short packets, the design team put the crypto acceleration hardware not in an outside engine, but in the MIPS cores, in parallel with the main execution pipeline.
Cavium treats compression, deep packet inspection, and RAID processing quite differently, connecting these engines in a way that suggests setting up large streams of packets, bursting the streams out to an engine, and suspending the thread until the packets are processed.
It also appears Cavium relies primarily on locks for synchronization, assisted by facilities in the scheduler hardware. Given the numerous issues with locks, just how all this works with the write buffers and the coherency architecture could be a very important detail.
IBM followed Cavium’s presentation with an entirely different kind of chip—yet one with some striking similarities to the Octeon. The Blue Gene/Q chip is the basic processing element for IBM’s next massively-parallel scientific computer. Accordingly, it emphasizes heavy threads, compute-intensive inner loops, and floating point performance. Like Octeon, the Blue Gene/Q employs a large number of relatively simple cores, although the cores in this case are rather large Power cores with big floating-point units. The architectural emphasis, again, is on interconnecting and synchronizing the cores more than on mining the last gram of instruction-level parallelism from them.
At the macro level, similarities quickly give way to differences. The IBM chip employs 18 Power Architecture cores closely related to the cores in the PowerEN microprocessor. Each core has its own quad-pipeline FPU and L1 caches. Sixteen of these cores are used for computing, one for control, and one is redundant, with a mechanism for switching it into the array if another core fails. Rather than replicate more than 18 of these large cores, IBM chose to give each core hardware support for four concurrent threads, so under ideal circumstances the chip can behave almost as a 64-CPU system. In order to minimize stalls from instruction-fetch misses, each core has an adaptive prefetch engine that serves all four of the active threads. The engine can either prefetch in stream mode, or if it is tipped off that the processor is entering an inner loop, it can record the sequence of fetches and replay it from a list.
But it is in the interconnect between the CPUs that the Blue Gene/Q really begins to diverge from the Octeon (figure 3.) There are superficial similarities: Blue Gene/Q has a massive shared L2 that provides the point of coherency for the memory system. And there is a large central crossbar switch, connecting the CPUs, the L2, a pair of wide DDR3 controllers, PCIe, and a router for chip-to-chip networking.
IBM's Blue Gene/Q employs 18 4-thread Power cores.
The L2, however, is quite different from the design in the Cavium chip. IBM built theirs as 32 Mbytes of embedded DRAM in 16 slices. The cache supports multiple versions, tracked by tags and a scoreboard, as well as atomic operations. These two facilities make it very different from a conventional set-associative cache in two ways.
First, the cache can act as a transactional memory—that is, a thread on a CPU can perform an arbitrarily long series of operations with the cache as a single atomic transaction. This, according to IBM Blue Gene chip design manager Ruud Haring, eliminates the need for locks for inter-thread synchronization. Apparently the cache creates a new version for the transaction, and then watches to see if any load/store violations occur during the transaction. If any are spotted, the cache flags the software, which must then resolve the conflict, potentially invalidating the version of the transaction.
Second, the version capability allows deep speculative execution. A thread may execute past a control point or a data dependency, making an assumption about the correct direction of the branch or value of the data. The cache keeps the speculative operations in a version. If the assumptions later prove false, the L2 can invalidate the version and alert the software that the sequence must be rerun. Here we are seeing a more general-purpose processing system with quite advanced thinking about synchronization and speculation, but—and this is a departure from the CPU-centric thinking of the past—with the effort going into the memory architecture rather than the CPU core.
Our third study is a many-core chip built from yet another point of view, the Chinese Academy of Sciences Godson-T. Following earlier Godson chips that focused on SIMD approaches to extracting data parallelism, Godson-T, like Blue Gene/Q, seeks to exploit thread-level parallelism through provision of many—up to 64—simple CPU cores on a die. But unlike Octeon, which will mainly be programmed by Cavium’s internal team of experts, and Blue Gene/Q, which can count on having the finest scientific programmers, Godson-T is aimed at applications coded by teams less experienced with multiprocessing.
At a superficial level we see the familiar pattern in Godson-T: an array of relatively simple MIPS-derived cores, each with its own cache, local memory, and connections to the rest of the die, all unified by a large L2. But the nature of the connections appears influenced by the designers’ wish for both silicon and programming simplicity.
To begin with, there is no complex, custom-designed central crossbar. Godson’s cores are arranged as an array with vertical and horizontal connections between elements. Each core includes an internal router that permits low-latency routing to neighbors and worm-hole routing across the die. There are actually two physically independent networks, permitting low latency even when some cores are doing high-bandwidth DMA bursts.
There is a chip-wide coherency scheme based not on conventional bus snooping or directory structures, but on a mutual-exclusion lock instruction and on an additional hardware block that watches the bus and detects deadlocks. There is also, in each core, a Data Transfer Agent—essentially a super DMA controller—that accelerates movement of complicated data structures over the on-chip networks. The hope, according to presenter Dongrui Fan, is that these structures will speed implementation of thread-rich codes on the chip.
We have seen three instances of many-core chips, each using relatively simple CPU cores and a unifying L2. But the three differ significantly in the way they move data, support coherency, and provide synchronization between threads. These differences seem driven both by the intended application space and by how the architects view the programmers they expect to be using the silicon. This pattern of similarities and differences may be a good leading indicator of where the effort, and the differentiation, will occur in the next generation of many-core designs.