I am still not so convinced by NoC even though it has been a hot research topic since almost 10 years. It might be useful in high performance computing. For the embedded domain, I do not see the point.
The landscape is a bit more complex than this picture of bus-rings-mesh. Freescales ships SoCs with 8 cores based on switched fabric interconnect. Cavium ships products with 32 cores based on a combination of bus and switched fabric interconnect as well. Netlogic uses a hierarchy of rings to deal with different I/O requirements of it's cores/memories/accelerators.
A mesh adds an additional layer of complexity to the already hard to tackle problem of fully and efficiently exploiting the compute capabilities of multicore SoCs - significant latency differences depending on where is your task scheduled to run.
Tilera has yet to gain significant traction in the market and I would bet that you will see a lot of products with more than 16 cores shipping with something different used as interconnect than a mesh.
P.S. Cavium's Octeon 3 will ship 48 cores. Not using a mesh.
For applications that require off-chip access, such as to DRAM, bandwidth requirement will vary by distance from the interface. Applications with heterogeneous processing functions lack symmetry and therefore require a NoC with an irregular topology, just like the Internet.
Peh claims that his mesh implementation does scale up well. Here is what he said to me: "Our paper is challenging this conventional wisdom--demonstrating that it is possible to design a mesh network [that] simultaneously approaches the latency, throughput and power limits...For a N-node mesh, the max hop count is sqrt(N) + sqrt(N). For example, our 16-node mesh is laid out 4 by 4, and the maximum hop count is 8, and the average hop count is 4. So latency is much lower, with the disparity increasing as you scale up the core counts. Bandwidth is also much much higher because there are many possible paths to spread traffic across."
The Mesh networks have been shown to have their own latency issues for cores greater than 64 in the MP domain. And I think that's the reason Intel might have gone back to rings in Xeon.
As the cores scale the topology will keep changing from buses, to rings, to meshes, to exotic networks!
One thing I forgot to mention, is that "virtual bypassing" is proposed by Peh at MIT to optimize the Internet-on-a chp for core-to-core packet comm, thereby boosting speed by sending a probe signal through on-chip routers to preset switches before a data burst. That way the ordinary delays for sending packets through routers is minimized compared to the real Internet where it doesn't matter if individual packets in a burst go by different routes to the same destination.