Power8 also has external L4 cache as part of its memory controller which is also off-chip. Kind of a halfway house to HMC.
The power battle is actually going to fought on the I/O side. Intel can dial up the core complexity depending on usage. No reason haswell core cannot be brought down to 1w per core. Cache sizes and execution units are the main variables. I would not change the front end too much as that is more painful to change between variants.
The interconnect is a major power hog. No of rings can be altered as Intel has shown. ARM interconnects tend to be NoCs which actually consume more power than a ring. Interconnects like Bus enhanced NoCs or the swizzle switch from U Mich. do reduce noc/crossbar power. In short In terms of I/o arch, ARM is actually the laggard in terms of power. This will be evident if you do an apples to apples comparison. Intel needs higher I/o throughput and hence you are seeing higher power components. At those throughput levels , I suspect power dissipation is non-linear. We are doing some work on 28nm multi core that shows up some interesting results as you scale up.
In the new atom server parts, you can see the wring not being used and a simpler cross type interconnect used instead. That is probably simpler though I still feel rings are more optimal in terms of power.
You can optimize an NoC but cannot make it more optimal than a ring.
You alluded to it in your post, rings are static and hence more efficient.
The dynamic routing logic is the power hog.
The two NoC companies survive because of the mobile world. Mobile SoCs use a lot of IP and NoCs let you change the topology and IP
Configuration easily. That is primarily why they are popular in mobile SoCs. Not because of efficiency.
Had a chat yesterday with a couple of guys who used to do the omap and sparc designs just to make sure ! I am currently designing a family of processors. Using NoCs for mobile variants and plan to use crossbar/hybrid NoCs for the server class config. Will know better after a few months of simulations. If the univ. Of michigan swizzle configs works, crossbars are the way to go for homogenous designs. Verification is easy too.
Disagree that interconnect power is at a scale that compares with CPU power. CPUs are very much the highest power consuming circuits on the die by a huge margin. The interconnect discussions of NOC vs. Ring again is confusing things. NOCs are used to connect to peripheral interfaces and IP. NOCs are not coherent by definition. The Rings that Intel uses are for the coherent CPU interconnect.
Reducing the CPU power has the biggest bang for the buck while still maintaining performance. Adding CPUs to the die increases the power linearly but the interconnect power increase is less than linear.
1. I actually had discussed the core and interconnect power seperately. My position is that Intel core and ARM cores are equivalent in efficiency at identical complexity. You can check out the Wisconsin paper which shows that the x86 decode penalty is no longer significant. That said, interconnect power is significant in large core counts. Can approach a total of 40% as per a lot of research. Most CPU designers these days focus on that, including me.
2. You seem to be confused about higher layer functions and interconnect topologies. Functionally buses, crossbars, rings and NoCs are identical. All can be cache coherent. NoCs are NOT anything by definition. You can run the AXI CC protocol over NoC. It all depends on your SoC config. For a large no of heterogenous blocks, NoC is the best and hence its prevalence in mobile. Intel wants power efficiency with mostly homogenous blocks, hence Ring.
NoCs can very well be made coherenet. Latency tends to be higher so it is used for CC traffic only at high core counts. The bus enhanced NoC arch. is a good compromise.
Did I read correctly the ARM offering was going to be 18watts (4x4.5)? Not exactly low power. Any anaylsis on building more chassis to house more processors that are less capable from a reliability point of view?
Remember this is 18W for a 3GHz octo core at 40nm. A 22nm C2750 runs at 2.4GHz, uses 20W and cannot achieve anywhere near the same performance (X-Gene should have better than Cortex-A57 performance, while we already know Silvermont is slower than Cortex-A15 clock for clock).
So it looks like Avoton will be beaten by a huge margin on performance and power efficiency despite having the advantage of 2 process generations. Now imagine a next-generation X-gene at TSMC 20nm...
Wilco1: Please show your evidence that shows A7 performed better than Intel's, either through the transistor physics or the architecture ALU layout in the chip. Please don't put out any alleged information that is non-scientific and no basis.
ARM is targeting microservers, which isn't Xeon based on Ivytown. ARM is going after the Atom-based C2000 processors.
Ivytown Xeon vs. ARM microserver won't be much of a wrestling match: Semi-Truck vs. Fiat 500. One gets a lot of work done, and the other is very fuel efficient. Those meetings on the highway tend to leave one of them pretty squished. :)
@Some Guy: True and I tried to be clear on this in the story. Ivytown and X-Gene are not head-to-head competitors.
However, it's worth noting OEMs say many sevrers no longer need performance as much as lower power. For many jobs a high end Xeon is no longer needed if its just about pushing data through the Ethernet and stroage interfaces. So in this way the two very differently focused products are wrestling over a pie of sometimes overlapping workloads.
@Rick - the point is that pie is already divided and the ARM is going after a piece that is already covered by Intel, which is already on its 2nd-generation Atom microserver chip before ARM is even out of the gate. Kind of like how the Japanese car makers were already on their 3rd-generation design of hybrid cars before Detroit ever got in the act.
At the same time, it will be interesting to see how it goes, especially when the advantages of ARM disappear with the microcode being less of the function blocks in the total chip, and the power advantage gone ... not to mention all the other server requirement, e.g., 64-bit, IO, cach, ECC, etc. The most ironic outcome would be that they end up with ARM creating a niche and it all ends up coming from Intel on its advanced process.
"the point is that pie is already divided and the ARM is going after a piece that is already covered by Intel, which is already on its 2nd-generation Atom microserver chip before ARM is even out of the gate."
That's an interesting rewrite of history... Calxeda has had its ARM servers out for well over a year now, and that was before Intel even announced Centerton, let alone shipped it! Note Calxeda has its 2nd generation out as well.
I also don't agree that the x86 penalty is low - if that were true then why is AMD having such a hard time keeping up with Intel while a dozen of small outfits can design fast and efficient ARM cores which are challenging Intel? Even Intel took a very long time to come up with an Atom replacement, and it ended up being a simple 2-way core (as 3/4-way is too power hungry on x86).