We typically only get news about the major players in the CPU game. It is quite refreshing to hear about a small group of people hoping to make big changes in CPU architecture, named Out-of-the-Box Computing. "The Mill" is their name for the processor. In this interview conducted by Hackaday, Ivan Godard explains what the Mill is and how it is different.
In this video, Ivan covers the basics. He discusses the history of Out-of-the-Box Computing and the ideas and inspiration that formed the Mill CPU. Stating the stagnation in processor advancements after the RISC vs. CISC wars, he says that his group just knew they could "do better." They identified a huge gap between the price-points of the embedded world and the desktop-computer world and thought that if they could just fill that massive gap, they could have something really big.
When asked if their intent was simply to produce a core, or to go all the way to producing silicon themselves, Ivan responded with this:
"Intel's quarterly dividend is bigger than ARM's annual sales. Consequently yes, we would like to be a chip company. The fallback option, of course, is that we can be an IP house."
Another quote that really stood out is:
We really are a great supercomputer chip. Nobody makes any money at it, but they'll do anything to get more -- and we're more."
In Part 2, we get to learn a bit about the internals of the Mill CPU. Ivan compares it to a standard DSP, but points out that the advantages lie in the way that the Mill manages roughly 10% of the power usage of other chips for the same computations. It does this by completely rethinking how instruction sets are handled, a topic he covers in-depth in this video. He points out that, even though many people may not require the higher computational power, or even the lower power usage of the Mill, the more efficient use of space will allow for higher yields in manufacturing.
While discussing the issue of expanded memory bandwidth, Ivan pulled out this gem, which is particularly amusing:
Some years back, we took a proposal to Lawrence Livermore and they said, "Can this be built?" We had to go to a partner (it was LSI Logic at the time) and say, "Can you build us a chip with 2,700 pins?" And they swallowed real hard and said yes.
He went on to note "The yield will be horrible, it will be incredibly expensive, and some people will want it." He actually discussed the memory interface in depth in this video, showing that, in fact, they need less memory, not more access. The rough estimate is 25% less memory access than others.
On the topic of difficulties that they are facing he explains that money is a huge hurtle. They have to replace their tool chain, port an operating system, and even file over 50 patents.
To create parallel multicore systems, many FPGA tools fall short because they are design assembly and implementation infrastructure, lacking in analysis. At Space Codesign, one of the ways that our SpaceStudio ESL hardware/software codesign tool can be used, is as a design creation front end for FPGA tool infrastructures like Xilinx Vivado (and likely others). We published a position paper on this topic on this site a few weeks ago ...
The key to supercomputer performance is that your architecture is optimized for an application, or family of applications. Knowing the internal details of a processor core or FPGA device (there are architecture diagrams available, after all!) but it is the system level performance that comes into play, at the end of the day.
Peter Kogge has an interesting article called Next-Generation Supercomputing (IEEE Spectrum, January 2011). In it he states that the bottleneck with next-generation supercomputing is not the speed of floating-point processors. The problem is that the power needed to transfer data to and from those processors is much higher than the power used by the processors themselves. So a conventional computer memory hierarchy with caches and main memory becomes impractical.
A possible solution? How about FPGAs as I mentioned above -- you arrange the FPGA logic implementing your problem so that each result is pumped to adjacent or at least nearby processing elements, not bothering with register files and caches. However, it's not practical to do this because of... FPGA tools, as I just described. JMO/YMMV
An FPGA-based reconfigurable computing engine has the potential to be a superb high-performance supercomputer. Unfortunately, FPGA tools are not up to the task as discussed in this 2007 article. It has to be as easy to design parallel hardware data paths as it is to write code for general-purpose CPUs, and that's not the case with current FPGA design languages and tools. FPGA tool research has always been stymied by the fact that no major FPGA manufacturer publishes their internal architecture so that the research community can develop efficient design tools for reconfigurable computing. It would be like Intel refusing to publish the X86 instruction set and requiring everyone to program in PL/M using a compiler provided by Intel. I believe this is the primary reason CPU makers sell billions and FPGA makers have stayed small. JMO/YMMV
My comment should be seen in context ! In this particular case we are talking only about CPU execution pipelines. Mill is a new implemetatio of an old idea , stack machines. Adds VLIW to teh mix. That by itself is interesting. But in the combined statespace of register and stack machines, the basic variants have been outlined a while ago. Major refinements are still possible but I am sceptical about radical new ideas. The discussion in comp.arch that is currently underway is about Von neumann architectures stagnating. Quoting Mitch Alsup from the discussion (saw this after I posted my reply)
"The vonNeumann model is pretty well played out. The big problem is this model does one thing and afterwards starts to do the next thing (i.e appears completely serial right down to the exception model.) This bottleneck is what is preventing forward progress on any large scale.
Computer architecture is awaiting a parallel vonNeumann model and will languish with minor updates/upgrades until such a new paradigm come forth. This model has to support multiple memory references at the same time with essentially no ordering requirements, multiple arithmetic operations with essentially no ordering constraints, and multiple paths of control with essentially no ordering constraints; yet result in computations that make sense from the programming model. The "essentially" part is where the exploitable parallelism will come from."
So we mostly are stuck with incremental Enhancements that typically come when reducing silicon geometries permit them. I teach comp. arch and run a large processor dev. Group (which is developing a family of processors for the India Processor Project) And believe me, I too find this straitjacket irritating. If I go superscalar, I still use a Tomasulo variant, a design that came in the 60s ! If you take a look at the mill, it tries to deal with the issues related to the tyranny imposed by the register file. Innovative implementation but not a fundamental change. It is like IC engine design using tne Carnot cycle.
New ideas are possible. Dataflow archirtectures do need to be revisited. For example a lots of groups including ours feel exact computing is too restrictive as a universal model. So a combination of stochastic computing with say transactional memory alter the execution pipeline more radically since you will get far more ILP but even that is only mildly radical in terms of how you compute results, the execution pipeline still is finally an entity that has to deliver results that has to converge to some order.But since the problems we are trying to adress these days is media, search and large data set related these do hold promise. After all no really is asking for SAP to run 3 times faster !
I guess the nature of the problems we are trying to address, think a typical accounting program, limits the design space. I have been pondering on this since the mid 80s. No easy way out ! Quantum computing and neural computing offers possibly the only option of radical change but ever the sceptic, I wonder what effect it will have on non search related problems. The brain after all is terrible in doing accounting !
But there is a revolution underway in terms of formally verified designs and secure computing. But these are not glamorous and hence do not make your pages ! One example would be the DARPA crash project, crash-safe project, crash-safe.org. The tagged ISA arch is not new, Burroughs did it in the 60s but new research in type systems allow you to use these tags in ways not envisaged before. Specifically in modelling information flow and enforcing the flow using HW. To put it differently, perhaps we should focus less on innovating at the lowest level , the CPU arch. and focus more on higher levels of computing where state of the art is frankly primitive.humans think at high levels of abstraction and in metaphors and to large degree declaratively. But all current program. Languages and computing models take us out of of our comfort zones by beinglow level and imperative.
Another possible area (which has not seen much traction after the MIT Transit project) is dynamically alterable ISAs. The idea being that using FPGAs you can essentially present each thread of execution with a CPU arch. suited to its behaviour. Currently only minor changes like no. of functional units, register set size have been attempted. But yiu could go radical and do both register based and stack architectures (Mill style VLIW or other variants). This also implies the compiler backend will vary depending on your program. The era of Just in Time Compiler Compilers is here. (You heard it here first).
There have been ongoing discussions on this at comp.arch for a while. My opinion is that it is in interesting take on older ideas and will be an interesting contender. But radical it is not ! I do agree that wringing out perf. With superscalar arch is a losing cause but you can play tricks with reg files which is what leads to Mill like arch.
As I keep saying, there are no new ideas in computing, only new implementations.