PORTLAND, Ore. -- On-chip busses and ring toplologies in use today will be more trouble than they
are worth, making on-chip mesh networks the preferred architecture for
massively parallel processors, according to according to researchers at the Massachusetts Institute of Technology (MIT).
"Future multi-core processors will have to communicate the same way
computers hooked to the Internet do--by bundling the information they
transmit into 'packets'. Each core will have its own router, which sends
a packet down any of several paths, depending on the condition of the
network as a whole, In short, rings scale better than buses,
but worse than meshes. Rings cannot scale much beyond 16," according to Li-Shiuan Peh, an associate
professor of electrical engineering and computer science at MIT.
An on-chip mesh network "lays a grid over all the cores, so there are many possible paths between nodes," said Peh. "Latency is much lower, with the disparity increasing as you scale up the core counts. Bandwidth is also much much higher because there are many possible paths to spread traffic across."
Intel used an on-chip mesh network with integrated router for its experimental 80-core TeraFLOPs processor of a few years ago, but the most sophisticated on-chip network on any production processor is the ring-network on its latest eight-core XeonE5-2600. Dropping back to a ring topology, however, is just a stop-gap, according to Peh, who claims his recent study shows that at 16-cores or above, Intel, IBM, ARM, Freescale, Samsung and every other multi-core processor maker will have to go to an on-chip mesh network with integrated router.
Today nearly all multi-core processors use a conventional bus architecture which overlays a bus above the cores to connect them to each other and to memory, but the last stop for the bus will be above quad-cores, according to Peh, prompting some chip makers to go to dual busses and Intel to go to a ring network topology for its Xeon E5-2600. Above 16-cores, however, all manufactures will have to adopt the Internet-on-a-chip topology.
Peh will present his results at the Design Automation Conference in June with fellow professor Anantha Chandrakasan and doctoral candidate Sunghyun Park. Their demonstration chip showed that 38 percent less energy is consumed by a packet-switched Internet-on-a-chip topology that uses voltage swings of just 300 millivolts.
I am still not so convinced by NoC even though it has been a hot research topic since almost 10 years. It might be useful in high performance computing. For the embedded domain, I do not see the point.
The landscape is a bit more complex than this picture of bus-rings-mesh. Freescales ships SoCs with 8 cores based on switched fabric interconnect. Cavium ships products with 32 cores based on a combination of bus and switched fabric interconnect as well. Netlogic uses a hierarchy of rings to deal with different I/O requirements of it's cores/memories/accelerators.
A mesh adds an additional layer of complexity to the already hard to tackle problem of fully and efficiently exploiting the compute capabilities of multicore SoCs - significant latency differences depending on where is your task scheduled to run.
Tilera has yet to gain significant traction in the market and I would bet that you will see a lot of products with more than 16 cores shipping with something different used as interconnect than a mesh.
P.S. Cavium's Octeon 3 will ship 48 cores. Not using a mesh.
For applications that require off-chip access, such as to DRAM, bandwidth requirement will vary by distance from the interface. Applications with heterogeneous processing functions lack symmetry and therefore require a NoC with an irregular topology, just like the Internet.
Peh claims that his mesh implementation does scale up well. Here is what he said to me: "Our paper is challenging this conventional wisdom--demonstrating that it is possible to design a mesh network [that] simultaneously approaches the latency, throughput and power limits...For a N-node mesh, the max hop count is sqrt(N) + sqrt(N). For example, our 16-node mesh is laid out 4 by 4, and the maximum hop count is 8, and the average hop count is 4. So latency is much lower, with the disparity increasing as you scale up the core counts. Bandwidth is also much much higher because there are many possible paths to spread traffic across."
The Mesh networks have been shown to have their own latency issues for cores greater than 64 in the MP domain. And I think that's the reason Intel might have gone back to rings in Xeon.
As the cores scale the topology will keep changing from buses, to rings, to meshes, to exotic networks!
One thing I forgot to mention, is that "virtual bypassing" is proposed by Peh at MIT to optimize the Internet-on-a chp for core-to-core packet comm, thereby boosting speed by sending a probe signal through on-chip routers to preset switches before a data burst. That way the ordinary delays for sending packets through routers is minimized compared to the real Internet where it doesn't matter if individual packets in a burst go by different routes to the same destination.
Join our online Radio Show on Friday 11th July starting at 2:00pm Eastern, when EETimes editor of all things fun and interesting, Max Maxfield, and embedded systems expert, Jack Ganssle, will debate as to just what is, and is not, and embedded system.