As the trend toward putting multiple processors on a chip spills into the commercial marketplace, thought leaders say it's time for an even more radical shift, to new software models that reveal, rather than hide, the details of all the hardware resources engineers can now cram into a microprocessor.
Academics rally loudly around this trend, and network processor designers represent a phalanx of revolutionary engineers who have heard the call. But many top commercial microprocessor engineers still resist the painful migration away from today's legacy software aimed at uniprocessor machines.
Meanwhile, the looming issues of power dissipation and memory bandwidth are raising concerns about storms on the horizon.
"The big change that will happen is that the distributed nature of [on-chip] hardware resources will have to be exposed to the software and the compiler," said Anant Agarwal, professor of electrical engineering and computer science at the Massachusetts Institute of Technology (Cambridge, Mass.). "As clock cycles get faster, you can no longer reach all the crevices of a chip in a single cycle, and it is incredibly difficult to hide that latency from the software."
Rather than hide that complexity as current processors do, Agarwal says, designers need to expose it. MIT's Reconfigurable Architecture Workstation (RAW) processor distributes a number of small and simple CPUs, each with its own memory, across a chip. A new compiler and programming language in development at MIT would help applications exploit all that parallelism, even allowing apps to reconfigure the wires networking the on-chip CPUs.
The RAW chip should be back from the fab early in 2002. But the compiler and language may take two years to be ready for prime time. Performance gains could range from as little as 10 percent for existing apps to sixteenfold for applications written and compiled with the new tools.
Bill Dally, professor of electrical engineering and computer science at Stanford University (Palo Alto, Calif.), generally agrees with this approach and has helped prototype a similar vision of the future MPU in Stanford's Smart Memories project.
"I personally feel the industry is poised for radical change for how people do microprocessor design. Getting to multithreaded software is the biggest problem, and a more huge task than any of the hardware issues," Dally said.
In Dally's view, tomorrow's software model will be based on a new kind of packed-data type called streams similar in form to a series of data records. Kernel units on multiprocessing chips will perform complex actions on these streams. That will pull much more of today's computing work onto a die, where it can be handled fast. But it will require a new approach to applications and systems software.
This sort of talk is anathema to engineers in top computer companies such as Intel and Sun. Both companies are preparing processors that use multiple CPU cores on a die and simultaneous multithreading technology to allow each on-chip processor to handle multiple application threads. However, both companies will ensure that those multiprocessors maintain compatibility with their existing applications.
"The installed software base is very difficult to change," said John Shen, director of Intel Corp.'s Microarchitecture Research Labs (Santa Clara, Calif.).
At least one network processor maker shares that view. Broadcom's SiByte division (San Jose, Calif.) chose a dual-processing MIPS architecture to tap an existing base of MIPS code at communications OEMs. "The customer doesn't want to program to a proprietary architecture. If you go that route you will have to provide a total solution with the software," said Dan Dobberpuhl, general manager of Broadcom's broadband processor group .
"The only way these ideas may take hold," concedes Stanford's Dally, "is in an incremental fashion, like Intel used with MMX [the X86 multimedia extension], where the new chip runs all the old Windows programs, but if you have the streaming-enabled version of the software you get this big boost."
Michael Splain, CTO of Sun Microsystems Inc.'s processor group (Mountain View, Calif.), calls preservation of compatibility a must for his team. However, Splain shares the view that the future will be shaped more by innovation in software than in CPUs for reasons that are more uniquely his own. Microprocessor engineers "had our day in the spotlight in the '80s and '90s. Now it's the time for the software engineer," Splain said. A CPU engineer "can take five years to design something that lives for two years in the marketplace. The software people are out ahead of everything, and processors just need to stay in the middle of the road. If you get too highly optimized for any one set of applications, the software will just run all over you. I think good processor designs will be somewhat conservative."
While computer makers like Sun and Intel are taking conservative steps forward, many network processor engineers are going boldly where no chip maker has gone before. Intel's next-generation IXP1200 NPU is said to have as many as 16 CPU cores on the die. That pales in comparison with startup EZChip, which packs 64 cores on a die. And many other startups, including Cognigine and others, are trying out additional innovative architectures such as variations of VLIW.
"Innovation is always easier when you don't have legacy software. The other thing you need is lots of investment, and the network processor space has attracted lots of that," said Linley Gwennap, a market watcher with The Linley Group (Mountain View) who focuses on NPUs.
Indeed, Gwennap tracks about 30 NPU startups today, each with about $10 million to $20 million in venture capital backing and some with considerably more than that. "I wouldn't be surprised if VCs have pumped half a billion dollars into this segment," Gwennap said.
Mario Nemirovsky provides a living example of that trend. Though his first NPU startup has yet to ship its first product, Nemirovsky is already at work securing financing for his second company.
"I realized you can do a lot more architectural improvement in networking than you can in high-end computing," he said. "The needs of networking systems are still changing rapidly. And even the OEMs don't have clarity on what they need. So more platforms will still emerge in this network-processing area."
While a researcher at the University of California, Santa Barbara, Nemirovsky pioneered the technique of simultaneous multithreading (SMT) that allows one processor to handle multiple pieces of an application or threads at a time, in effect making one processor act like a multiprocessor.
Among the companies planning to use SMT are Intel, Sun and Nemirovsky's first NPU startup, Clearwater Networks Inc. (Los Gatos, Calif.). Broadcom is also evaluating the approach for its SiByte architecture.
However, Nemirovsky left Clearwater in May 2001 after a disagreement with management before it shipped an eight-way SMT chip aimed at control-plane processing. His current startup, FlowStorm, is defining an architecture aimed at data path processing. "The NPU area is very dynamic and diverse, but I am not sure they are coming up with better architectures," said Intel's Shen.
Some microprocessor designers foresee problems they consider more fundamental than the move to multicore or communications-centric architectures.
"The biggest problem is power dissipation," said Dobberpuhl of Broadcom. "It's a fundamental problem for all designers, and it's only exacerbated by scaling. Without a breakthrough, this will become the limiting factor. All the other problems are more tractable. This is the fundamental one. It's bigger than lithography, and everybody is working on it."
Indeed, skyrocketing power requirements stand in the way of Intel's traditional push toward ever-greater clock speeds, which will continue to be a hallmark of its performance road map despite industry criticism of the approach, said Shen.
"Intel has been pushing hard on frequency, and that has served us well. Historically people have foreseen walls, but every time we get closer to them the walls move. I don't see a wall out there for how fast we can push frequency today as long as we can solve the power problem," Shen said.
Intel is working multiple levers to solve the power problem at the transistor, logic, circuit and microarchitectural levels. For example, today's P4 uses aggressive clock gating, shutting off the clock in parts of the die that are not in use. And the IA-64 transfers the complexity of optimizing code to a software compiler, which shaves off some power drain.
"My personal view is that memory is the predominant performance bottleneck," Shen said. "CPU speed increases 40 to 50 percent a year. However, memory speed increases at a paltry 5 percent a year. That gap will continue to widen. Today it takes 100 to 150 clock cycles to access main memory for 1- to 2-GHz CPUs. That could expand to several hundred clock cycles in the foreseeable future."
Stanford's Dally agreed. "Too many people think in terms of counting the number of arithmetic operations on a processor, but that's easy. The real problem is bandwidth. You need to stage data explicitly between hierarchies of memory," Dally said.
The RAW microprocessor at MIT
Smart Memories project at Stanford
See related chart