SAN MATEO, Calif. Sun Microsystems Inc. and IBM Corp. last week disclosed plans to take high-end microprocessor design to the next level, separately describing new generations of multicore, multithreaded CPUs.
In a rare look at its road map, Sun tipped Niagara, a design set to debut in 2005 that effectively puts a 32-processor Sparc server onto a single chip. IBM gave a briefer glimpse ahead, revealing that its Power5 processor, set to ship next year, could be the first server CPU to sport four threads, in the form of two dual-threaded cores.
Sun's Niagara is expected to use eight highly streamlined Ultrasparc IIi cores, each running four threads, on a 340 mm2 die. It will also integrate a memory controller, multiple Gigabit Ethernet media-access controllers and hardware acceleration for triple-DES and Rc4 security. Sun intends to use Niagara as the CPU in its 2005-class uniprocessor server blade designs and as a network processor in other systems.
IBM's Power5 sports a new CPU core with execution units redesigned for multithreading. The chip, currently made in a 130-nanometer process, will debut at data rates faster than 1.5 GHz. IBM expects an overall fourfold performance improvement over Power4-based systems, said Mark Papermaster, director of microprocessor design for IBM's server division.
"We expect this to be very significant, especially on applications such as transaction processing that have a high degree of data dependencies," Papermaster said. "With Power5 it appears to the operating system that there are four CPUs on each chip."
The multithreading announcements beg the question of how the industry's largest microprocessor supplier, Intel Corp., will respond. The CPU giant put two threads on its 32-bit Xeon server processor last year and is designing a dual-core 64-bit Itanium for 2005. But Intel has no multicore, multithreaded CPUs on its immediate road map.
"On the server side, we should expect multicore and multithreading will be the rule in the second half of the decade," said Justin Rattner, director of Intel's microprocessor research lab.
Multithreading has generated plenty of buzz among designers for its ability to blast through the chief bottleneck in most server CPU designs: memory access. Threading masks latency by executing instructions for one process while another process is waiting for access to memory. The multicore approach, meanwhile, helps designers reconcile their burgeoning transistor budgets with their constricted design tools, potentially speeding time-to-market.
"The research community has been convinced this is the right direction, but there's a time lag before companies have products that embody it," said David A. Patterson, professor of computer science at the University of California at Berkeley and a part-time consultant for Sun.
"The first couple of implementations we have seen from Intel and IBM use only two threads and don't adequately show the full power of this idea," he said. "But Sun's Niagara chip is really exciting and will get the attention of microprocessor designers."
Niagara is based on technology Sun acquired last July from startup Afara WebSystems Inc. (Santa Clara, Calif.). Afara was founded by Stanford EE professor Oyekunle Olukotun and Sun's former chief architect, Les Kohn, who helped design the Ultrasparc I and II.
David Yen, vice president of Sun's processor group, said the Afara technology also will appear in future Sparc processors that debut sometime after the 2005 launch of the Ultrasparc V, a nimble processor that can be software-configured to handle one or two threads. The subsequent Ultrasparc VI-class chips will also sport "hardware features for Java acceleration and extensive use of asynchronous circuitry," Yen said.
Sun's description of the Afara technology and the Sparc road map came one day after IBM disclosed in an interview with EE Times that its next-generation Power5 chip would support two cores, each running two simultaneous threads. "We have the chip back and we are in early testing of the processor. It is performing exactly as we hoped," said Papermaster.
IBM is not yet revealing the cache structure for the processor. However, Papermaster said the Power5 sports a new technique for fast data transfers between regions of main memory. Today's Power4 uses an external 32-Mbyte Level 3 cache module.
The move to multicore, multithreaded CPUs marks a shift in emphasis from the traditional focus of building ever-more-complex processor pipelines for instruction-level parallelism. Designers give three reasons for wanting to make the move.
Running multiple programming threads simultaneously keeps a processor busy doing real work while any one thread is waiting for data from off-chip memory, something that has become a growing problem in more-conventional single-pipeline processors.
Second, designers are hitting the limits of their design tools in trying to synchronize and control signals speeding at gigahertz rates across monolithic chips spanning hundreds of millions of gates. Multicore CPUs offer them cookie-cutter blocks that can be easily replicated, limiting the length of signal paths and spreading hot spots more evenly across a die.
Finally, today's multiprocessing-enabled server software requires no changes to run on these on-chip multicore, multithreaded processors, another big plus for designers.
Question of benchmarks
While Sun gathered kudos for its Niagara concepts, analysts said it's still unclear how well those chips will perform, and which companies will lead in the new design dynamics and benchmarks for this era. "We'll have to see just how much the software can take advantage of what they are doing," said Mike Fister, general manager of Intel's server division, commenting on the IBM and Sun chips at the Intel Developer Forum last month.
"Sun is going to have a leg up on extreme threading. And the learning process with Niagara could result in design improvements in the rest of the Ultrasparc family," said Kevin Krewell, a senior analyst with The Microprocessor Report. But lacking details about the frequency, memory and I/O structures of the chip, actual performance is hard to gauge, he said. "It's one thing to have all these resources on-chip, but you have to keep them fed," said Krewell.
Indeed, the devil is in the implementation details, said Mario Nemirovsky. A pioneer in the technique of simultaneous multithreading (SMT), he is also founder of the San Jose, Calif., startup Kayamba, which is developing a network processor that can handle 256 simultaneous threads over eight cores on a die.
"The details that will kill you are how do you synchronize threads and how do you communicate between them," Nemirovsky said. "There have been many times we have thought we could get a 4x performance improvement and wound up with less than 1x."
The best designs require separate register files, execution units, instruction prefetch engines and buffers, cache lines and memory buses for each thread, he said. "As a minimum you need a register file for each thread. You also want to minimize cache thrash," Nemirovsky added. "If you share functional units you can create very difficult dependencies that make for a complex interconnect. It really becomes a kind of crossbar switch."
That's one reason why designers like Fred Weber, chief technology officer of Advanced Micro Device Inc.'s CPU group, believe multithreading may not be worth the extra die space and design headaches. Nevertheless, AMD has already validated its upcoming 64-bit Opteron microprocessor for multicore CPUs.
To share or not to share?
Intel, for its part, worked around some of the issues by having two threads share many processor resources. "They did not do a very aggressive implementation of SMT," Nemirovsky commented. The company claims that its Hyperthreading scheme takes up about 5 percent of the die area of a Xeon or Pentium 4 processor but delivers up to 20 percent in performance improvements.
Michael Splain, chief technology officer for Sun's processor group, said the Niagara implementation of threading will far exceed Intel's Hyperthreading. "We don't have to share resources because we have duplicate processors, memory management units and caches. We are engineering in threading from the beginning," he said.
Nathan Brookwood, analyst with Insight64 (Saratoga, Calif.), said it may be tough to sort out the performance claims. "You won't be able to tell anything about these processors from the Spec numbers," said Brookwood, referring to the traditional measure of single-processor performance. Benchmarks based on real applications may be more useful, along with the TPC-C test for transaction-processing performance, analysts said.
But once the new chips take hold, frequency could finally fall from its perch as a measure of server processor prowess. "What may change is a new metric of throughput per watt or per millimeter squared instead of raw frequency," said Patterson, the Berkeley professor and Sun consultant.