The debate on ISAs is interesting. I have designed x86 CPUs and other ISAs as well. It is a fact that x86 is inherently more complex than MIPS or ARM or PowerPC to varying degrees. There is certainly the CISC instruction decode penalty but there are other complex mechanisms that have been built into x86 over generations which still need to be supported by the latest x86 processors. All of these mechanisms take die-size and/or complexity. Almost every implementation of x86 CPU has a built in micro-code engine. This is like a programable engine within the CPU to handle these complex tasks. Intel has continued to stress floating point performance and each generation adds additional instructions adding transistors to the design.
So why is this relevant? This "overhead" becomes smaller in very high performance implementations - out-of-order, multi-threaded, large cache designs. Here the overhead can be amortized over the performance gains of a complex CPU. This is why Intel has competed well at the very high end compute but failed in low power efficient designs that are required for mobile.
In these less complex implementations where the CPU has fewer transistors, this overhead starts to make a difference. This is why the mobile processors from Intel and even the Atom cores have not competed so well.
Well it's obvious you've never looked in detail at the complexity of the x86 ISA. The overheads of x86 affect the whole microarchitecture. With an identical microarchitecture x86 would end up slower (and thus less power efficient). For x86 to achieve the same performance as a RISC, it needs a far more complex microarchitecture, increasing die size and power. You can compare die sizes for various ARM and x86 CPUs here: http://chip-architect.com/news/2013_core_sizes_768.jpg
The claim that x86 has a dense encoding is yet another myth. In fact the complex encoding means that x86 binaries are typically a little larger than ARM binaries, and significantly larger than Thumb-2. x64 is usually 15% larger than x86.
Yes I've read that paper and discussed it in detail on RWT. It is a badly written paper with most of the conclusions not supported by evidence. If you choose to compare wildly different and relatively ancient CPUs, an old compiler and completely ignore the memory system then of course the only possible conclusion is that microarchitecture matters the most! But that's only true if you make wild extrapolations and ignore or handwave at all other aspects. Let's hope this paper was a one-off mistake and doesn't reflect on the quality of papers coming from this university.
Note PPC is certainly not CISC. Neither is ARM or Thumb. PPC vs ARM is less interesting as their ISA features are nearly identical (not that there aren't differences but the differences tend to be insignificant details).
Actually reality also intrudes when it comes to these power figures. When we did some power analysis testing on a 3.4 Ghz Corei7 (Ivy Bridge), when we hit max load on all the 4 cores, thermal management gets into the act and reduces freq to below 2 Ghz. The key is that it seems to stay there for a long time !
I think the team is publishing these results
Granted this is not an ISA issue but realworld performance is always different from the claims. So in this case, you may as well go with a 2-2.2 Ghz processor if you want to have all cores running at max freq. IB EP is now at 15 cores, presumably Broadwell based design will hit 24+cores. At these core counts, we are probably looking at 2-2.5 Ghz speeds. Which implies ARM and x86 cores inteh server market will run at similar clock speeds. Makes comaprisons easier I guess.
Our server designs are aslo aimed at 2-2.5 Ghz with large core counts. Does not seem to make sense going above this and besides our arch. capability is excellent but our physical design capability is not that great. But then we are an academic/research enity !
1. The power overhead is only in the decoder and related support functions. Others are a washout where denser encoding helps and lesser number of register does make some muxes simpler. I am not theorizing here. I run a large processor designgroup desinging server and HPC grade processors, so these are issues we analyse in great detail. My colleague, a professor in fact has a x86 comaptible design under his belt. But don't take my word for it. The HPCA 2013 paper goes about it in more detail.
Bottom line, differences are negligible and the micro-arch is what matters. There is really NOT a great x86 ISA penalty.
But having said that, I am not advocating using an x86 like ISA. That path is a one way ticket to an asylum for any CPU designer ! Intel using an incredible amount of resources, has managed to more or less eliminate the burden of teh x86 ISA. CISC vs RISC is an entirely different issue. PowerPC is a better examplar of CISC done right and a PPC vs ARM comparison (or RISC-V pereferably) is a better technical debate.
2. If Cavium wants to adress the server market they better get single threaded perf. right. Oracle had to create the M class CPUs to compensate for the T class's bad single threaded performance. Again I am commenting on what the market wants. Single threaded perf. is needed becuase most of the world's programmers cannot write multi-threaded code even if their lives depended on it.
On a purely technical basis the Sparct T series approach is the best way to go (Intel HPC parts are also taking this approach). Back in my Sybase days, there would be 1k+ threads per core and we could not get enough cores or HW thread support. I presume it is still teh same. But other RDBMSs do not leverage threading so well.
FS's approach in the T4240 shows that a balance can be reached with wide OO and lower power large core counts. Our testing will reveal how far this is true. They have also taken an AMD like approach where threads are almost indepenedent cores with dedicated execution and related resources.
1. It's a myth that ISA overhead is just in decode. There are many aspects of an ISA that affect the overall microarchitecture. Just to mention one example, x86 requires more load/store units due to having fewer registers and load+op instructions. x86 also uses a more complex memory ordering model.
2. Given they designed their own CPU it seems likely Cavium are aiming for better than Cortex-A57 performance, as otherwise they could have just licensed that (the same argument applies to X-Gene). A 3-way in-order is not completely implausible, but to get decent throughput it would need to be at least 2-way and ideally 4-way multithreaded.
4. If all else is equal, an identically performing x86 would use more power than ARM due to its more complex ISA. So the x86 ISA really is LESS efficient. Of course different processes, microarchitectures etc can mitigate this difference.
In any case there is no doubt a dedicated CPU can outperform a generic Xeon despite having a process disadvantage (as you say in point 5). Beating Xeon on single-threaded performance is much harder of course, but that is not something Cavium or X-Gene are attempting (at least with their current line-up). For many tasks, using more, slower cores is actually far more energy efficient.
@Servernut: Applied definitely got out there early. I have written 3-4 stories about them so far. I am not a Computex so would love to hear the latest. For a while they have been in Cavium's spot: we have been waiting for them to ship and report performance specs. Anyone have an update on that?
I think this has come up a few times in this forum but I guess it is worth reiterating some of the conclusions we reached over the past 2 years
1. This a key point. Intel's main challenge is its high margin model, especially in the server segment. It is unrealistic to believe that any competing vendor can beat it on the semicon process or in systems architecture. Its x86 inst set does cause an overhead but that is primarily in the decode stage.
2. Wrt OO architecture, Intel, Power8 and the Netlogic 4 issue pipelines are some of the best when it comes to ILP. The FS QorIQ is equally good but it is a dual issue. Cavium typically went for in-order cores and it is not clear how agressively out of order these cores are. There is nothing inherent in the ARM ISA that will perevent Cavium from matching Intel in OO performance, it is just a question of what they wanted out of this design.
3. In server grade parts, the interconnect and the cache architecture is key and I have a tough time believeing Cavium can match Intel, AMD or IBM in this regard within one generation. Nothing magical in getting to where Intel is today but it needs work over 3-4 generations of parts and lots and lots of real life usage data. Intel's ring bus and QPI is pretty optimized.
4. ARM does NOT have a power advantage over Intel. If the SoC config is the same (process, cache size, OO width etc) you will find parts built using either ISA will have TDP numbers in the same ballpark.
5. We do have perf numbers from one Freescale part. The 12 core T4240 matches an Ivy Bridge 6 core Core i7 part with half the TDP and half the freq. This is just results from one benchmark, the coremark so take it for what it is worth. But this is not an appple to apples comparison since teh cache sizes are vastly different, pipeline depth is half and I/O mix is different. FS chose a short depth pipeline since for networking/comm. applications, frequent pipeline flushes are less expensive with shorter pipelines.
But we are doing a detailed study using enterprise benchmarks in our lab. So stay tuned ! We may find that FS's design choices make sense in a enterprise env. too.