Recently I posted a Hackaday.com interview of Ivan Godard, CTO of Out-of-the-Box Computing, a startup company creating Mill CPU. That interview, and the ongoing series of talks about the architecture, is raising interest and comment, not only at Hackaday and here on EE Times but also on Reddit, youTube, and elsewhere. "What makes this stuff worth some attention: this guy and his company are executing a cool idea that has been around since the 1970s: using stack machines as a basic abstract model instead of Turing/register machines." says Raphael Poss, of University of Amsterdam's Computer Systems Architecture -- Institute for Informatics. Mr. Godard has agreed to respond here to some of these questions.
— Caleb Kraft, Chief Community Editor
By far the most common question about the Mill CPU architecture has been "When can I get it?" Sadly, the answer is "It will be a while yet." Heavy semiconductor (which a CPU chip most certainly is) is like steel mills and oil refineries -- it takes a long time to get a new CPU architecture from concept to production. Out-of-the-Box Computing (OOTBC) has been working on the Mill for a decade. The design is done and patent applications are filed -- that's why we can talk about it publicly -- but now we must go beyond our software simulations to produce a proof-of-principle FPGA implementation, and then go through fab iterations before we have something you can stick in your pocket or on a circuit board. That's some years more -- please be patient.
Another common question is "What's the market?" This is most easily answered by saying what is not the market. First, the Mill is inherently a 64-bit chip; a 32-bit or smaller Mill is architecturally impossible, although one could design a 32-bit chip that was clearly a Mill relative. Consequently the low end of the embedded microprocessor market -- the 8- and 16-bit chips that go into traffic lights and thermostats -- is not a Mill market. Second, the Mill is designed for per-thread performance at low power. You can have a multicore Mill -- in fact the small size and low power of Mill implementations means you can put more cores on a chip at a given price and power point than contemporary superscalar chips can -- but the Mill is not targeted at GPU-style data pipe computation.
Beyond those excluded markets, the Mill is truly general-purpose: it is technically apt for cellphones, laptops, tablets, desktops, workstations, servers, supercomputers, middle- and high-end embedded, and most anything else you have in mind. Note: technically apt. OOTBC is a commercial enterprise, and we are not about to repeat the mistake of prior CPU startups like Transmeta and Montalvo in diving head-first into markets with behemoth incumbents; we will target niche markets. Which niches? Market niches come and go, so we will pick one when we approach having a product. The choices may be determined by future business partners; we're actively looking for partners with Mill-shaped problems.
So much for the business questions. Many of the technical questions, from EE Times and elsewhere, are too detailed to answer here, but if you really want to know how it works, I recommend our talks, available at http://ootbcomp.com/docs, and the comp.arch newsgroup, which has been full of Mill-related discussions for a year now.
One question popping up frequently is whether the Mill is a new CPU category, or just a variation on an old idea such as stack machines. In one sense, all CPU architectures are variations on the work of Alan Turing; in another sense every chip SKU is a new and different design. The Mill uses a new way to communicate data between operations, called the belt, and the belt is certainly reminiscent of a stack machine in some ways; implicit destination of results, for example. In other ways, such as non-destructive operand access, a belt does not follow the stack model at all.
The difference is akin to the difference between a three-address register machine and an accumulator machine -- related, but at heart a different category. We view belt machines as a new category of architectures, at rank equivalent to transport-triggered, or superscalar, or dataflow, or accumulator architectures. The Mill is the first belt machine, but it won't be the last; there's plenty of room in the design space for work by other people.
A similar question is whether the Mill is a VLIW. The Mill is wide-issue like a VLIW; one instruction can have 30 or more individual MIMD operations depending on the Mill family member involved. Yet there are differences -- no conventional VLIW can have such wide instructions -- and once past decode the Mill departs from VLIW designs. Perhaps it is best to say that a Mill is a belt machine with VLIW-like instruction encoding.
Other comments, including a lengthy one on this site, suggest that the Mill design is too timid: it has not gone far enough from conventional CPUs. I agree with the comment that a re-examination of dataflow architectures might be promising, and deeply sympathize with researchers who want to probe beyond mere accounting programs. However, such comments address design in the abstract, not design that is to give real benefit to real people within real lifetimes by real enterprises. Were I an academic I would not have designed the Mill; I would have designed something that could be completed in a doctoral theses. I wholeheartedly support computer science research; but the Mill is not a research project.
Some comments suggest that the first step should be to design a new programming language; you might be surprised at how many garage language designers there are in the world, or at least how many think that a CPU should be designed to fit their language. Here in the real world we did not have that luxury: at the very beginning we decided that any CPU product hoping for traction in the real world would have to run existing programs written in existing languages, unchanged, straight off the web with no rewrite required. The Mill has a new instruction set architecture, so the code must be recompiled (or binary translated; those tools are getting better every day), and code that contains inline assembly or other machine dependence will naturally take more work to port. But recompile your code and run it and the only differences you will see on a Mill are higher speed, lower power, and an immunity to many exploits and bugs.
Speaking of speed and power, in our our talks I have suggested that a Mill has a 10x or better architectural advantage over the usual out-of-order superscalar, at equivalent clock and process. Because the Mill is a compatible family of CPUs with widely differing parameters, the Mill's advantage may be tilted mostly to performance, or power, or manufacturing cost, or a mix, depending on the Mill family member configuration. Many commenters balk at this and demand benchmarks, justly, and unfortunately we currently have only a few very small benchmarks. The recent change in patent laws forced us to set aside work on the tool chain and put all of our limited resources into patents -- expected to be over fifty when issued. In some ways the delay has actually helped us -- we were converting our tool chain to use LLVM, and in the meantime while we have been writing patent applications, a few other companies have made the LLVM internal model look a little less specific to an x86. Still, we are very impatient to resume that work.
So, no big benchmarks; certainly nothing that needs an operating system underneath it -- we also had to put off starting the Linux port. Eventually, sure, but not now. Regardless, there are good grounds to believe the 10x, even without benchmarks. That's because our 10x is already displayed by other wide-issue CPUs; CPUs that have been in use for decades; CPUs that are busily running your car and TV today. Those CPUs are DSPs.
If you compare the power consumption and performance numbers of a DSP against a general-purpose architecture superscalar (such as Intel's Haswell, for one example among many), you will find that the DSP beats the superscalar by 10x easily, and usually by much more. Moreover, DSPs have been beating general-purpose machines for many years.
Fundamentally, the Mill internally works like a DSP. The encoding is VLIW-like, as is the encoding of most DSPs; the belt is, in engineering terms, merely an exposure of the bypass network that DSPs use. And the Mill, like a DSP, lacks the incredibly expensive rename registers, instruction and reorder buffers, scheduler stages, and all the other complex machinery required to run an out-of-order CPU.
Is then the Mill a DSP? Well, a Mill would make a pretty good DSP, but at heart no, it's not. The Mill has the performance and power of a DSP, but runs general purpose code, while conventional DSPs choke on the stuff. The Mill inherits a long-proven 10x, while taking it out into the general-purpose world.
For announcements of talks, white papers, videos, and so on about the Mill, you can sign up on our low-volume email list.