Not everyone working on CPU architecture is with one of the big manufacturers.
We typically only get news about the major players in the CPU game. It is quite refreshing to hear about a small group of people hoping to make big changes in CPU architecture, named Out-of-the-Box Computing. "The Mill" is their name for the processor. In this interview conducted by Hackaday, Ivan Godard explains what the Mill is and how it is different.
In this video, Ivan covers the basics. He discusses the history of Out-of-the-Box Computing and the ideas and inspiration that formed the Mill CPU. Stating the stagnation in processor advancements after the RISC vs. CISC wars, he says that his group just knew they could "do better." They identified a huge gap between the price-points of the embedded world and the desktop-computer world and thought that if they could just fill that massive gap, they could have something really big.
When asked if their intent was simply to produce a core, or to go all the way to producing silicon themselves, Ivan responded with this:
"Intel's quarterly dividend is bigger than ARM's annual sales. Consequently yes, we would like to be a chip company. The fallback option, of course, is that we can be an IP house."
Another quote that really stood out is:
We really are a great supercomputer chip. Nobody makes any money at it, but they'll do anything to get more -- and we're more."
In Part 2, we get to learn a bit about the internals of the Mill CPU. Ivan compares it to a standard DSP, but points out that the advantages lie in the way that the Mill manages roughly 10% of the power usage of other chips for the same computations. It does this by completely rethinking how instruction sets are handled, a topic he covers in-depth in this video. He points out that, even though many people may not require the higher computational power, or even the lower power usage of the Mill, the more efficient use of space will allow for higher yields in manufacturing.
While discussing the issue of expanded memory bandwidth, Ivan pulled out this gem, which is particularly amusing:
Some years back, we took a proposal to Lawrence Livermore and they said, "Can this be built?" We had to go to a partner (it was LSI Logic at the time) and say, "Can you build us a chip with 2,700 pins?" And they swallowed real hard and said yes.
He went on to note "The yield will be horrible, it will be incredibly expensive, and some people will want it." He actually discussed the memory interface in depth in this video, showing that, in fact, they need less memory, not more access. The rough estimate is 25% less memory access than others.
On the topic of difficulties that they are facing he explains that money is a huge hurtle. They have to replace their tool chain, port an operating system, and even file over 50 patents.
An FPGA-based reconfigurable computing engine has the potential to be a superb high-performance supercomputer. Unfortunately, FPGA tools are not up to the task as discussed in this 2007 article. It has to be as easy to design parallel hardware data paths as it is to write code for general-purpose CPUs, and that's not the case with current FPGA design languages and tools. FPGA tool research has always been stymied by the fact that no major FPGA manufacturer publishes their internal architecture so that the research community can develop efficient design tools for reconfigurable computing. It would be like Intel refusing to publish the X86 instruction set and requiring everyone to program in PL/M using a compiler provided by Intel. I believe this is the primary reason CPU makers sell billions and FPGA makers have stayed small. JMO/YMMV
Peter Kogge has an interesting article called Next-Generation Supercomputing (IEEE Spectrum, January 2011). In it he states that the bottleneck with next-generation supercomputing is not the speed of floating-point processors. The problem is that the power needed to transfer data to and from those processors is much higher than the power used by the processors themselves. So a conventional computer memory hierarchy with caches and main memory becomes impractical.
A possible solution? How about FPGAs as I mentioned above -- you arrange the FPGA logic implementing your problem so that each result is pumped to adjacent or at least nearby processing elements, not bothering with register files and caches. However, it's not practical to do this because of... FPGA tools, as I just described. JMO/YMMV
To create parallel multicore systems, many FPGA tools fall short because they are design assembly and implementation infrastructure, lacking in analysis. At Space Codesign, one of the ways that our SpaceStudio ESL hardware/software codesign tool can be used, is as a design creation front end for FPGA tool infrastructures like Xilinx Vivado (and likely others). We published a position paper on this topic on this site a few weeks ago ...
The key to supercomputer performance is that your architecture is optimized for an application, or family of applications. Knowing the internal details of a processor core or FPGA device (there are architecture diagrams available, after all!) but it is the system level performance that comes into play, at the end of the day.