Phoenix - A prominent Stanford researcher has designed a novel microprocessor he claims could power a range of mighty yet less costly scientific computers. Whether the Merrimac CPU, unveiled at the recent Supercomputing 2003 conference, will see the light of day is a question that goes to the heart of a brewing controversy over how the small but highly strategic supercomputing sector will deliver next-generation petascale systems.
Merrimac flies in the face of the current practice of building up supercomputers from thousands of low-cost, off-the-shelf-often X86-CPUs. Such systems dominate the latest list of the world's top 500 supercomputers (see story, page 20), but fall short of the needs of the most-demanding scientific and government apps, some say. Intel Corp. announced last week that it is plowing $36 million into a three-year program to push such cluster systems to the next level.
In a paper presented at SC-2003 here, William Dally, a professor of computer science at Stanford University, said today's CPUs are inefficient because they spend too little time performing calculations and too much time waiting for memory. Merrimac attempts to shift microprocessor design by rewriting applications as so-called streams that expose the parallelism of multiple arithmetic units on the Merrimac CPU and provide mechanisms to handle more calculations without going to off-chip memories.
"There are many problems with today's off-the-shelf processors, including the fact they squander the bandwidth," Dally said in a panel here on petaflops architectures. "Petaflops don't matter. Bandwidth, not flops, is the issue."
The Merrimac design contains a cluster of sixty-four 64-bit floating-point multiply-add units fed from a hierarchy of registers and supervised by separate on-chip controllers. Dally estimates a 90-nanometer chip measuring 10 x 11 mm could deliver 128 Gflops, yet would cost only $200 to make and would dissipate about 31 watts. A 96-port router chip of a similar size would connect up to 16 Merrimac nodes on a single board or 512 nodes in a cabinet.
The resulting system architecture could deliver a 2-Tflops workstation for $20,000 or a 2-Pflops supercomputer for $20 million, according to Dally's paper.
His Stanford team has hand-coded three streaming applications to demonstrate a simulation of the processor. However, the group has yet to design a compiler for the system and is still tinkering with the organization of its register files to optimize performance.
So far, Dally, who helped design the Cray T3D and T3E systems, has not found anyone among the top five computer makers he has visited willing to build the chip, which is aimed at applications with lots of data parallelism. "Most of these companies want to build database servers and just sell them as scientific machines," he said. But in Dally's view, "It's silly not to build a custom processor for supercomputing."
The SC2003 panel at which Dally spoke included executives from Cray Inc. and Sandia National Labs. The two organizations are currently building the so-called Red Storm system from thousands of Advanced Micro Devices Opteron processors at an estimated cost of $100 million.
"Red Storm is a custom system, but they won't spend the extra $10 [million] to $12 million to build a custom processor," Dally said in a conversation after the panel. "This conservative approach is actually more risky than building custom processors, because they take the really expensive thing-bandwidth-and they squander it."Too small for custom
There are plenty of reasons for conservatism in today's high-performance computer market. "The whole technical-computing market is too small even for IBM to develop a [fully] custom system for it," said Earl C. Joseph, a vice president of research for International Data Corp. (Framingham, Mass.).
The sector, which comprises some 60,000 systems, slipped 7.2 percent to $4.7 billion in 2002, its second annual decline after 15 years of steady growth. IDT blames the downturn and the rise of clusters for the slide. Of that total, about 250 systems were supercomputers worth $1 million or more each, for a market totaling $1 billion a year and generally holding steady.
Other economic factors are tightening the screws on R&D. "For over 20 years, the venture capital community invested $2 billion to $3 billion in 15 to 20 companies," said Steven J. Wallach, vice president of technology of router maker Chiaro Networks (Richardson, Texas) and a founder of one such startup, Convex Computer, which was acquired by Hewlett-Packard Co. "Over the last five years, that VC investment hasn't happened, and it won't happen going forward. What's more, we have had a consolidation in microprocessor architectures. These two things have had a big impact on supercomputing."
In a town hall meeting here, a panel of experts said they will attempt to address the perceived lack of adequate incentives to drive supercomputing innovation. Members of the Committee on the Future of Supercomputing, commissioned by the National Research Council, will file a report at the end of next year.
"The recommendations will be about the ways government spends money on supercomputers and the levels of its spending," said Marc Snir, co-chair of the committee and head of the computer science department at the University of Illinois at Urbana- Champaign.
A separate interagency report already filed to White House budget makers recommends the United States more than double its current spending on supercomputers. However, people associated with the report expressed doubts about how it will be received in the current budget climate (visit www.eetimes.com/story/OEG20030818S0012).
Nevertheless, Cray, IBM Corp. and Sun Microsystems Inc. are gearing up separate projects to deliver novel petascale computers by 2010, buoyed by about $50 million each in grants from the government's High Productivity Computing Systems (HPCS) project. All three are keeping mum on the details of their plans, but many involve new system-on-chip, interconnect and software architectures (see Aug. 18, page 1).
Cray has recruited Dally as an adviser for its HPCS design, called Cascade, although the company will not use Merrimac. "We think streaming will be very important, but we will do it in a different way," said Burton Smith, Cray's chief scientist. The tension between off-the-shelf clusters and specialty systems like Cascade, he said, is "going to increasingly differentiate the two kinds of supercomputers out there, and that's wonderful for us."
IBM's system aims to be compatible with the PowerPC and to adapt to different technical and business work loads, said Mootaz Elnozahy, who runs the company's HPCS program. "We will have a new microarchitecture. It's not a conventional core. The notion of a core [in this time frame] becomes fuzzy."
Elnozahy said the IBM team is "pushing some aggressive ideas, but it's tough. I wake up every day needing to justify continuing this R&D program. Meanwhile, what I am scared of is that someday we are going to be able to configure a supercomputer at the Dell Web site."
Intel's new Advanced Computing Program is hoping to drive the technology in just that direction. Details of the program, seen as a reaction to the HPCS efforts, are still scarce. But Rick Herrmann, a marketing manager in Intel's high-performance computing office, said it will focus on a variety of systems and software projects, including programming models for multithreaded CPUs, power and thermal packaging for small-form-factor systems, and interconnects such as Infiniband.
"We want to accelerate the use of volume technologies in high-performance computing," Herrmann said.
Others hope to take a stepwise approach to petascale systems by evolving their current architectures.
Jim Tomkins, Red Storm project leader at Sandia, said that a petaflops system could be designed by 2010 using 25,000 processors built at 45-nm design rules. By that time, the chips should have four symmetric-multiprocessing CPUs on a die running at 10 GHz and delivering 40 Gflops per chip. "I think we can build a petaflops system with similar level of complexity to systems today, but using four or five times as many processors," Tomkins said.
Similarly, Fujitsu Labs may take an incremental approach. Kenichi Miura, a Fujitsu fellow, sketched out the possibility of pushing the company's Sparc processors to 4 GHz with two cores on a die and as many as 256 nodes, each equipped with 128 processors, on an optical net to get to the petascale milestone.
See related chart |
The Merrimac CPU designed at Stanford includes sixty-four 64-bit floating-point units organized in 16 clusters of four each, fed by a hierarchy of register files. A 90-nm chip is said to deliver 128 Gflops at a manufacturing cost of $200 using novel 'streaming' code.