SAN JOSE, Calif. Intel Corp.'s next-generation Itanium processor, code-named McKinley, has undergone such a performance-enhancing makeover that it has a shot at running at least 50 percent faster than the current Itanium line, the company said.
With a revamped cache structure, which includes a ground-breaking Level 3 cache, faster front-side bus and more logic resources at its disposal, McKinley is expected to hit that performance level using the same code as Itanium, the company said here at the Intel Developer Forum (IDF).
At the same time, Intel tipped plans to eventually incorporate multithreading which it calls "hyperthreading" into its full range of processor lines, starting with the Xeon processor. The move could help mute critics who have said the IA-64 architecture is overly dependent on instruction-level parallelism, which relies heavily on software compilers.
With the revamped McKinley architecture, the Itanium product line will see its speed increase from 800 to 1 GHz, which is half the frequency of the company's fastest 2-GHz Pentium 4. The Pentium 4 uses a long, 20-stage pipeline that is fundamentally more amenable to higher clock speeds. McKinley, by contrast, uses an eight-stage core pipeline. Intel contends, however, that the faster front-side bus, more on-chip memory and redundant logic resources will more than make up for the processor's lag in clock speed.
Though McKinley's pipeline is shorter than the Pentium 4's, this setup will reduce the penalty during branch mispredictions. McKinley also has two auxiliary pipelines for the Level 2 cache and floating point that overlap the final stages of the core pipeline. "We know early whether we're going to have an L2 hit, so this speeds things up," said Gary Hammond, principal architect at Intel's enterprise platform group, during a technical presentation on McKinley here.
The McKinley team, comprising Intel and Hewlett-Packard Co., also tossed in more redundant resources. McKinley will sport 11 issue ports instead of nine for the existing Itanium, and six integer units vs. Itanium's current four. As for registers, McKinley has 328, more than three times Sun Microsystems Inc.'s UltraSparc3 processor, Hammond said.
Cache value
The McKinley team also made liberal use of on-chip cache. Its decision to move Level 3 onto the same die as the processor is sure to be questioned by some of Intel's competitors because of the extra die area it requires. Motorola Inc., for one, recently said it's more inclined to add an on-chip DRAM controller than Level 3 cache to its PowerPC line as a way to cut memory latency.
Intel said it considered such a move as well, but decided that integrating Level 3 cache on-chip was the best way to address latency. In that way, memory access to the outermost cache memory can be done in 12 clock cycles compared with 24 cycles when the Level 3 is a separate device, providing 32 gigabytes per second of bandwidth. And Intel isn't stopping there: Madison, a pin-compatible follow-on to McKinley, will have 6 Mbytes of L3 cache.
McKinley doesn't make use of any specially designed SRAM cells for the cache, but it did make some changes to the control circuitry for better density something the company will disclose later, said Sam Naffziger, lead architect for Hewlett-Packard's microprocessor technology lab, which co-designed McKinley.
The HP-Intel team also took a fresh look at the lower layers of cache. The L1 is a single-cycle cache designed to minimize the load use penalty by allowing it to focus on integer code while diverting floating-point code to the Level 2 cache. The L2, meanwhile, was designed as a non-blocking, out-of-order cache that allows higher-priority instructions to bypass the queue, Naffziger said.
To boost bandwidth to and from the processor, Intel and HP widened the front-side bus to 128 bits from 64 bits and increased the clock frequency so that it runs at 6.4 Gbytes per second, which is three times faster than Itanium.
After all those improvements, Intel stressed McKinley will run 1.5 to 2x faster than the current Itanium using the same code. "Most of this performance is attainable without any new compilation," said Gadi Singer, vice president of the architecture group and general manager of the enterprise processor division at Intel, during his keynote address at IDF. Specint2000 tests show that McKinley runs 70 percent faster than Itanium using the same code, Singer said.
In the future, Intel is looking to add multithreading to its Itanium line, giving Itanium both instruction- and task-level parallelism features. Sun Microsystems also said recently that its UltraSparc 5 will have both capabilities.
One of the big benefits of multithreading is that servers, which already use multithreading among different physical processors, will be able to run multithreaded applications on a single processor once the operating systems support the feature.
"When hyperthreading is realized it will create the illusion that there are two logical processors, even though there's only one CPU," Intel's Hammond said.
Intel won't disclose publicly when it added multithreading to its Itanium line, but the company's Pentium 4 architecture already has the capability built in. The feature, however, has been disabled until the company comes out with its first Xeon processor with multithreading.
Even then there will be some limitations, because not all of the processor's resources are going to be duplicated. "You should know that if you need two floating-point units, you need a two-processor system," said Shannon Poulin, Intel's enterprise marketing manager.
One question that Intel isn't addressing, at least publicly, is McKinley's power consumption. At 130 watts, the existing Itanium has become a concern for OEMs making dense servers, who have had to devise special cooling systems to get the heat out of the box. So far, Intel has proposed a reference design based on the 870 chip set that would fit two to four McKinleys in a 28-inch-deep, 4-U rack.
Intel's bias toward adding more on-chip resources could exacerbate power consumption. After the redundant logic and L3 cache is added, the device will likely have more transistors than the current Itanium. Intel is also planning to manufacture the chip using the same 0.18-micron process technology as the current Itanium, so the first run of McKinley will not benefit from a process technology shrink.
Safety net
To protect against heat-related system meltdowns, McKinley includes a programmable thermal trip that can throttle processor performance by 40 percent to cut power consumption. But the company sees that more as a safety net, not as an answer to thermal issues. "This should never be needed in a properly designed system," said Naffziger.
The thermal trip is one of several reliability features built into McKinley. The device includes parity, and error-correcting code (ECC) on the L2 and L3 cache to shield it from soft errors caused by either alpha particles or cosmic rays. "Our mean time between failures is more than a thousand years," Naffziger said.
Intel is also expanding its error coverage so that all the components and buses on the McKinley platform, which is based on the 870 chip set, will have ECC, said Bassam Elkhoury, principal engineer for Intel's server division.