Chisel helps precisely because it's easier to change the core around, and to build parameterized cores in the first place. It's difficult to predict how things will turn out once pushed all the way to layout, so the design iterations involve a lot of rewriting of RTL or searching over design parameters.
V8 was the open standard with IEEE. Although Sun, to their credit, released a few open-source designs of 64-bit SPARC V9 cores, I don't believe the ISA was officially freed, and now Oracle owns the architecture..
"If you had read our response, you would see that the A5 compared doesn't have FPU or NEON either."
That's not the issue at all. First of all what exactly is Rocket? I assume it just implements the base 64-bit RISC-V ISA and uses 1-way 8KB L1 caches. Now if we agree on that then that means that:
1. Any comparison with Cortex-A5 is fundamentally flawed (irrespectively of the FPU) 2. It won't run anything other than Dhrystone well, if at all (even CoreMark does multiplication for example, many integer codes in SPEC use floating point, and any code that uses serious memory bandwidth will be slow)
Yes it would be very educational to get some SPEC scores. It'll show that designing an all-round CPU is hard. It took ARM quite a few generations to get decent performance.
1. Yes my experience is different so I question the results. Without more details it's hard to believe the numbers are right. Of course this guy had all the reasons to show RISC-V in the best possible light...
2. I'm well aware of the Cortex-A5 features, which is why I keep reminding you that this core is far more advanced than you believe. Again, it has multiplies, DSP instructions, SIMD instructions, performance counters, debug, virtualization, interrupt controllers, support for ARM and compressed Thumb-2, support for multiple cores, load/store exclusive etc etc etc in the base configuration. This is not something you can remove as all of this is part of ARMv7-A. None of the features I mentioned above are in Rocket. So no, these cores are nothing alike, and claiming they are is simply being dishonest.
As for performance, so far we haven't seen any evidence that Rocket is faster on real benchmarks. Even the Dhrystone score is questionable without more details (there are a few tricks you can pull to get a good score in 64-bits). But... are you really suggesting that Rocket will win on SPEC with its 8KB 1-way set associative caches??? I'd really love be in the meeting where you try to pitch that to a potential customer :-)
3. I have personally benchmarked CoreMark extensively on various CPUs and no it does not stress the caches at all. It does stress the branch predictor, especially the indirect predictor, but that's pretty much it. And there are a few compiler tricks that defeat the benchmark and make it run 40-50% faster. Some may well recommend it (presumably after they broke the benchmark), but good luck comparing CPU performance based on that!
4. Surely the ISAs matter, that's the whole point of this conversation. For example this is how you do an array access on ARM:
LDR R0, [R1, R2, LSL #2]
And this is what you do on RISC-V:
sll x3, x2, 2 add x3, x3, x1 lw x3, [x3, 0]
This is an important difference due to the load instructions in the ISAs. In the second case we not only have more instructions which require serialized execution, we may also get an ALU->AGU stall depending on the details of the pipeline.
Yes, benchmark results would be great, but we have not seen any definitive benchmark results on similar microarchitectures.
"As to what sells in the market, I think I have a pretty good idea ! Please see our website"
I like the enthousiasm, but it would be nice to see some actual SoCs that are competitive with high-end ARM ones. Then people might take it seriously. I think you have no idea how much work goes into designing a real commercial CPU. Intel has been pumping many billions of dollars into their Atom line over the last 6 years, trying to make it competitive with ARM cores - and despite their huge advantage in process technology it has been without any success.
"Sure it may run Linux, but the Rocket variant that was compared with the A5 doesn't appear to have an FPU or any other extension beyond the very basic 64-bit ISA. So that variant most certainly can't run SPEC well or do anything that you expect in a modern CPU (multi-core, debug, performance counters, timers, interrupt controllers and so on). Cortex-A5, while one of the smallest ARMv7-A cores, is far more advanced, significantly faster and has good code density (unlike RISC-V)."
If you had read our response, you would see that the A5 compared doesn't have FPU or NEON either.
Lacking your ability to project SPEC scores from Dhrystone results, we are hard at work porting SPEC codes to both ARM and RISC-V to obtain fairer comparisons. But mostly we're porting SPEC in the interests of improving our cores' performance rather than marketing. ARM and RISC-V have very different business models, but it's very educational to compare performance metrics.
"I hope RISC-V is successful, but I would rather it focussed on forward-looking innovation in ISA development, instead of industry standardization, which is innevitably backwards looking."
Thanks for reading the spec. One of the primary reasons we designed RISC-V was to support achitecture research (the other was to support education). Only later did we realize what we'd built might make a good industry-wide standard. We believe the way we provide sufficient opcode space and easy-to-parse variable-length instructions, along with our conventions on ISA extensions, will support a very rich set of new instructions, while never burying the very solid core.
Andreas, I do appreciate the benefit of collaboration, but having first-hand experience with the open source community I know that it is no panacea. It works at a much slower pace than a commercial closed environment and involves a lot of compromises to get anywhere near where you'd like to get.
You're quite right, like with any project (open or not), the most difficult part is to build critical mass. I don't have an opinion on whether that will happen for RISC-V - time will tell whether open ISAs/cores are a fashion fad or here to stay, but given the fortunes of its predecessor MIPS, I suspect it is not going to be easy.
I think you are underestimating the value of collaboration. As smart and talented as my friends at ARM are, they are no match for the combined intelligence and engineering experience of the whole industry combined.
Creating a good ISA and base set of tools is non-trivial, but the really hard part is making it stick. There were already tones of open source processors (>100?) on opencores, all of them open source.
In my opinion, the RISC-V is the open source processor to date with the best chance to make a broad impact long term.
Getting an open source project to critical mass and keeping it from crashing is a very difficult task, but when it's done right it benefits the whole world (see Linux and the Eclipse foundation as good examples).
Andreas, if you have the right ISA experience then yes, one can gather a lot from just a spec. And the spec basically shows a cut-down MIPS with a few Alpha features. Neither is well known for their great code density or IPC due to the simple instructions. So nothing revolutionairy - just a simple evolution of the original MIPS with various mistakes removed (and IMHO a few new mistakes added). It's interesting but try to compare it against ARM64 for example (next year we'll likely see a big shift towards 64-bit in the mobile space).
I don't think this is the first free/open ISA and I don't understand the advantage of being so. For software it makes sense as there are literally millions of developers who benefit. For ISAs/CPUs there are may be a hundred or so companies with the resources to make a competitive SoC, and most already license existing "closed" ISAs and CPUs.
1. I've done similar code density comparisons before and MIPS always came out significantly larger than ARM (in fact even MIPS-16 was usually a little larger than ARM!). Also I typically see x86 and ARM being similar in size. So these results seem a bit odd, and I wonder whether they have been done correctly.
2. I disagree. One core has all the modern features you'd expect in a CPU including virtualization, DSP, SIMD, multipliers, multi-core support, the other has none of those, so that clearly skews the comparison. We also don't exactly know the size of the branch predictor, TLB entries etc, so these are not comparable either. Given that all this affects performance, area as well as power it is important to do a like with like comparison.
For Linux, Android and SPEC you actually need large advanced branch predictors, a good memory pipeline, a large and highly associative L1, big TLBs, big low latency L2, prefetchers, fast memory controller etc. You may have a great integer pipeline but if you don't have all of those features your performance is going to suck (unless you only ever run Dhrystone or CoreMark).
3. CoreMark is a bad benchmark, I class it as worse than Dhrystone. At best you can call it an indirect branch predictor torture test. I recently heard that even the Pro version of CoreMark is horribly broken as it appears to spend >75% of its time in strcat...
When we are talking about a single issue in-order pipeline executing mostly single-cycle instructions then the only thing that matters for performance is reducing the number of instructions executed. So clearly any 2+ instruction sequence that can be combined into single-cycle instructions will improve performance.
On the other hand, if you consider a 3/4 way out of order CPU then I'd agree that a simpler ISA works fine there as any extra instructions can often be executed by the OoO machinery without penalty (you still have to worry about latency, eg. not having LDR R0, [R1, R2, LSL #2] costs you a few extra cycles latency to calculate the address plus AGU stalls).
I do like to minimalize ISAs however this has gone too far. On cheap MCUs you have fast multipliers and dividers, DSP extensions, and nowadays even floating point! Today it is no longer about minimizing transistor count of just the CPU, it's the whole SoC you need to consider. If not having certain instructions means you need extra library code (eg. division or floating point emulation), and that also has an area, power and performance cost.
Anyway don't take my word for it. Just try selling a SoC without division or multiply to people used to single-cycle multiplies and 10 cycle divisions. Good luck!