The OpenRISC 1000 architecture does however have several flaws. The biggest is perhaps code density, something that has already beeen pointed out. The other big ones that I can think of right now are the flag, interrupt vector layout and delay slots. Just fixing this would probably bring OpenRISC much closer to the feature list set by the RISC-V authors while allowing us to reuse most of the already available software.
I like the idea of a free/open standardized ISA. Unfortunately I'm don't have enough time right now to sit down and read through the RISC-V documentation, but overall it sounds like a solid ISA. My main critisism of this however is that it has been done more or less behind closed doors, which unfortunately is far too common in academia, and therefore seems to suffer a lot from the NIH syndrome. I'm for example interested in how much time was spent to create and implement Chisel, and why none of the existing HDL generator languages were used. You say you started in 2010 and yet we only hear about this now.
As several others have already pointed out, the big task is not defining the ISA, but making it stick. Out of the hundreds of open source CPU designs available, only a few have got any momentum. On the OpenRISC side we have several OS ports, both GCC and LLVM, several libc implementations, boot loaders and I I'm inclined to believe that there have been much more effort spent on the software support than hardware. A big risk with a new arch is that the users has lost interest before the required tools are implemented. We have noticed this for OpenRISC in the past that the lack of clear documentation and an easy way to get started cost us several potential contributors. lm32 is a good example of an arch that suffers from the lack of a large software collection rather than any technical deficiencies
I have a few complaints (as an outsider interested in computer architecture) about RISC-V and the RISC-V project. It looks like only recently has a mailing list been promoted on the riscv.org site (the archive for the Hardware Developers list indicates the first message was 7 Aug 2014) and as I recall (my memory is far from perfect) the contact information was not clearly presented.
(I had emailed Andrew Waterman quite some time ago about RISC-V Compressed—after reading his Masters thesis—and other aspects of RISC-V. I did not receive a response. This is somewhat understandable as it was a long email from a nobody naturally leading to a longer delay to provide a decent answer and delayed low priority tasks are often forgotten. I myself have delayed responses so long that I eventually decided not to respond, so I cannot justly complain. However, if I had known about a mailing list or forum, I could have posted comment there.)
This is not a problem unique to RISC-V. Even though the OpenRISC project had a mailing list/forum, I found myself losing all interest in posting there as posts on architectural and microarchitectural ideas received little interest. I have some talent in computer architecture, and i would have liked to have contributed something of real value, beyond some edu-tainment from Internet posts. (Andy Glew, 13 Aug 2003: "You have the sort of obsessive attention to tradeoffs in computer architecture that is typical of some of us computer architects. ... But you have some talent. / If this was an Open Source project, I'd try to drag you in. / For a company, yeah, your resume scares me. / But I still see promise in your posts.")
(It is still not clear where Architectural thoughts should be posted. Microarchitectural thoughts should probably go to the Hardware Developers list, but Architectural thoughts might not be appropriate for Software Developers or Hardware Developers.)
Another complaint I have about RISC-V is that version 2.0 was finalized before the compressed format was established and the placement of instruction fields do not account for 16-bit instructions. (I am also more inclined to marker bits than a contiguous field for size indication, at least for 16-/32-bit; I am guessing that such would be slightly more decode friendly for wide superscalar with 2 instruction sizes.) Since microcontroller implementations would benefit most from compression and have the tightest size and energy contraints, optimizing field placement for the compressed extension seems desirable.
(The ABI will also have a significant impact on the encoding. The choice of R0 as a zero register also seems to work against simple RVC decode and works against 16-GPR variants if the ABI uses lower-numbered GPRs for arguments. [I think providing 16-GPR options could be useful for increasing thread count, vaguely similar to the flexible allocation of registers to threads in a GPU.])
I have a few other similarly minor complaints about the ISA (e.g., it looks like there is no canonical register clearing instruction—such can be used for renamer "zeroing elimination"—using ADDI would have the advantage of being already special, being used for NOP, and allowing potential small immediate setting instructions to use "register inlining" in the renamer). Many of my thoughts are too late given the finalization of the relevant portions of the ISA, but some would still probably have some value.
I really appreciated the inclusion of rationales in the specification (even when I disagree). For some instructions it may also be useful to provide examples of use, and I think greater similarity to existing manuals (rather than the less formally structured current presentation) might be good. I think that having a wiki where such rationales could be expanded upon, code examples could be provided, and microarchitectural tricks could be described (with few length constraints) could be useful. Of course, a wiki could also serve as a teaching resource.
I do not intend to be excessively negative. I realize that even graduate student time is not an unlimited resource ☺ and that ISA design and project management choices must be made in a more complex context than a outside "computer architecture hobbyist" would be aware of (and that commitment to a specification must be made before everything is perfect—or even practically perfect in every way).
So, overall, we're pretty sure it's a reasonable comparison, though we're not completely sure about all the details in ARM's result to make sure we're being fair.
The appropriate initial reaction to your skepticism about our Dhrystone numbers would have been to ask us for more details so you would actually have some facts on which to base your opinion. Instead, in your very first post, you incorrectly assumed the worst possible behavior on our part, and incorrectly glorified what ARM had included in the core they measured. You might want to reconsider posting pejorative assertions that you have no way of knowing are correct. We're not above accepting apologies, if you're not above admitting your mistakes. ***
The branch prediction hardware is a BTB with 64 entries, a BHT with 128 entries, and a RAS with 2 entries. This amount of branch prediction helps Dhyrstone, but would help a lot of other codes too.
As I said, we don't spend our lives worryng about Dhrystone, so followed the following document giving guidelines from our friends at ARM: DAI0273A_dhrystone_benchmarking.pdf when compiling the code.
Our standard C library does include hand-optimized assembly, and does make use of all 64-bits (of course!), but we also did this for functions not used by Dhrystone also as a standard library helps all code. We'll be posting our disassembled Dhrystone on the website shortly (bit big for a blog post).
Same as ARM, we didn't actually fabricate this version but we have fabricated and measured enough variants in different processes to be confident in our layout results.
[ I'm having trouble posting the whole response in one go, so will try posting as a sequence of messages. Please read the sequence before responding and reply to last one in sequence. I'll mark it as the last one, because I don't know a priori how many I'll need. ]
We pulled the Dhrystone comparison together quickly, as we kept getting asked about how we compared to ARM cores and these were the only publicly available numbers we could easily compare against. We didn't spent a lot of time on it, as we're not particularly interested in "Dhrystone Drag Racing" with minimal stripped-down cores. Basically, we sized the caches to match ARM's configuration and just removed the vector floating-point unit we usually add to be a fairer comparison with the ARM which also doesn't have an FPU or vector unit (you are incorrect, these are optional in ARM A5). We didn't strip out a lot of other stuff that we could have. Specifically:
The Rocket core implements RV64IMA, i.e., base integer, integer multiply/divide, and atomic operations (which are quite extensive in RISC-V and go unused in Dhrystone). Our registers are twice as wide (64 vs 32) and we have twice as many user registers (32 versus 16) as ARM. The 64-bit width does help Dhrystone, but also lots of other code, and they are obviously included in our area number. The instruction cache was 16KB 2-way set-associative, 64-byte lines, and blocking on misses.
Data cache was also 16KB 2-way set-associative, 64-byte lines but because it was designed to work with our high-performance vector unit, it's non-blocking with 2 MSHRs, 16 replay-queue entries, and 17 store-data queue entries for D$. Obviously, none of these help Dhrystone, which never misses in the caches.
When we compared numbers with and without caches, we weren't sure what ARM left out, so we only removed the SRAM tag and data arrays and left in all of the above cache control logic in our core area. The MMU has 8 ITLB and 8 DTLB entries, fully associative, and the MMU has a hardware page-table walker. Obviously, the hardware page table walker doesn't help Dhrystone.
There are possibly a small set of pipeline designs for which delayed branches might make some sense as a simple way to reduce some control hazards. But there is a far larger universe of pipeline designs where they only hinder performance. This is not a controversial view point amongst architects. Given that even a small investment in branch prediction hardware has high rewards, I'm not even sure there are any processor implementation budgets for which delayed branches in ISA make sense on general-purpose code.