The debate on ISAs is interesting. I have designed x86 CPUs and other ISAs as well. It is a fact that x86 is inherently more complex than MIPS or ARM or PowerPC to varying degrees. There is certainly the CISC instruction decode penalty but there are other complex mechanisms that have been built into x86 over generations which still need to be supported by the latest x86 processors. All of these mechanisms take die-size and/or complexity. Almost every implementation of x86 CPU has a built in micro-code engine. This is like a programable engine within the CPU to handle these complex tasks. Intel has continued to stress floating point performance and each generation adds additional instructions adding transistors to the design.
So why is this relevant? This "overhead" becomes smaller in very high performance implementations - out-of-order, multi-threaded, large cache designs. Here the overhead can be amortized over the performance gains of a complex CPU. This is why Intel has competed well at the very high end compute but failed in low power efficient designs that are required for mobile.
In these less complex implementations where the CPU has fewer transistors, this overhead starts to make a difference. This is why the mobile processors from Intel and even the Atom cores have not competed so well.
Well it's obvious you've never looked in detail at the complexity of the x86 ISA. The overheads of x86 affect the whole microarchitecture. With an identical microarchitecture x86 would end up slower (and thus less power efficient). For x86 to achieve the same performance as a RISC, it needs a far more complex microarchitecture, increasing die size and power. You can compare die sizes for various ARM and x86 CPUs here: http://chip-architect.com/news/2013_core_sizes_768.jpg
The claim that x86 has a dense encoding is yet another myth. In fact the complex encoding means that x86 binaries are typically a little larger than ARM binaries, and significantly larger than Thumb-2. x64 is usually 15% larger than x86.
Yes I've read that paper and discussed it in detail on RWT. It is a badly written paper with most of the conclusions not supported by evidence. If you choose to compare wildly different and relatively ancient CPUs, an old compiler and completely ignore the memory system then of course the only possible conclusion is that microarchitecture matters the most! But that's only true if you make wild extrapolations and ignore or handwave at all other aspects. Let's hope this paper was a one-off mistake and doesn't reflect on the quality of papers coming from this university.
Note PPC is certainly not CISC. Neither is ARM or Thumb. PPC vs ARM is less interesting as their ISA features are nearly identical (not that there aren't differences but the differences tend to be insignificant details).
1. It's a myth that ISA overhead is just in decode. There are many aspects of an ISA that affect the overall microarchitecture. Just to mention one example, x86 requires more load/store units due to having fewer registers and load+op instructions. x86 also uses a more complex memory ordering model.
2. Given they designed their own CPU it seems likely Cavium are aiming for better than Cortex-A57 performance, as otherwise they could have just licensed that (the same argument applies to X-Gene). A 3-way in-order is not completely implausible, but to get decent throughput it would need to be at least 2-way and ideally 4-way multithreaded.
4. If all else is equal, an identically performing x86 would use more power than ARM due to its more complex ISA. So the x86 ISA really is LESS efficient. Of course different processes, microarchitectures etc can mitigate this difference.
In any case there is no doubt a dedicated CPU can outperform a generic Xeon despite having a process disadvantage (as you say in point 5). Beating Xeon on single-threaded performance is much harder of course, but that is not something Cavium or X-Gene are attempting (at least with their current line-up). For many tasks, using more, slower cores is actually far more energy efficient.
@Servernut: Applied definitely got out there early. I have written 3-4 stories about them so far. I am not a Computex so would love to hear the latest. For a while they have been in Cavium's spot: we have been waiting for them to ship and report performance specs. Anyone have an update on that?
@Rick. Cavium's Octeon designs are not out-of-order but in-order. Thunder is likely to be the same. All other server CPUs - Xeons, Opterons, even X-Gene are fully out-of-order machines.
Actually XGene from appliedmicro is completely missing from your post. They showed a mini-datacenter running at Computex this week. Any reason, you are not covering them? They seem to be shipping already. From the specs they seem to have everything that thunder is claiming and a few years ahead.
Now I'd like to hear some reality about where we are at and need to be at in server software for ARM if this borader initiative of which Cavium is just one part is going to get traction. Details, please!