MONTEREY, Calif. If Intel Corp.'s microprocessor architects had had their way when they were designing the Pentium 4, it would have been a very different beast than it is today, said the company's principal processor engineer.
A third-level cache strapped to the die, two full-fledged floating-point units and a bigger execution trace cache and level-one cache were all part of the original blueprint for Intel's most recently introduced processor. As it turned out, these features had to be either modified, stripped down or dumped altogether to keep costs in line.
Intel engineers were forced to rethink their lofty intentions when it became clear that chip size had gotten too unwieldy as they tried packing in more hardware units to maximize performance. Power consumption, architecture complexity and testing also posed serious problems, said Darrell Boggs, Intel's principal engineer for the desktop platform group (Hillsboro, Ore.), addressing a room full of researchers and engineers at the Micro-33 conference here.
"The general trend has been to make [the CPU] larger in physical area," he said. "But anytime you have a large die size, that means you have to have many fabs. You can become capacity-constrained unless you build a new fab."
Under the original plan, the Pentium 4 was to have one slow ALU, two fast ALUs, two arithmetic address-generation units, two fully functional floating-point units, 16 kbytes of L1 cache, 12,000 instructions of execution trace cache, 128 kbytes of L2 cache, 1 Mbyte of external L3 cache, an allocator/register renamer and a bus architecture.
But with fabs costing more than $2 billion, even the world's largest, most profitable semiconductor company had to reconsider its plans when it became apparent that the chip size was growing too big. The company had decided that the first 0.18-micron Pentium 4 die would not be larger than the previous-generation Pentium Pro when it was introduced in 1995, Boggs said.
"If the first ones are large," he said, referring to previous-generation microarchitectures, "the next ones are going to have the propensity to be large. It was a very big issue for us."
Intel was able to meet or exceed its die size and power requirement goals, but not without pruning or tearing out some of the hardware inside and outside of the processor core and sacrificing some performance.
The pair of fully functional floating-point units sitting side by side, which were the most plainly visible features of the die, was a sitting duck. "There was tremendous die area and power associated with the floating-point units," Boggs said.
Taking out the scalpel, the design team cut a pipeline off one of the FPUs. They also dumbed it down to just move data rather than execute MMX, SSE and SSE2 multimedia instructions.
The decision cost 5 percent of performance, Boggs said, but reduced the size of the floating point more than twofold.
Intel also took a look at the execution trace cache, a feature designed to compensate for the long instruction pipeline by caching only decoded micro-ops. It's a key part of the memory subsystem that reduces decode latency, and engineers were leaning toward making it bigger rather than smaller, Boggs said.
Compromising, Intel kept the size of the trace cache at 12,000 instructions and developed in instruction a micro-op "compression algorithm" so that micro-ops can be stored in the cache using fewer bits. That gave the execution trace cache "essentially the same performance and less die size," Boggs said.
Completely scrapped was the 1 Mbyte of L3 cache, which would have been something of a ground-breaking move to reduce processor-to-memory latency for a processor developed for the mass market. Intel's idea was to strap a separate memory chip, perhaps an SDRAM, on the back of the processor to act as the L3.
But that added another 100 pads to the processor, making the logic ratio "very high from an area perspective." It would have also forced Intel to devise an expensive cartridge package to contain the processor and cache memory, Boggs said.
To compensate for the L3 loss, Intel doubled the density of the L2 cache to 256 kbytes. It also cut the L1 cache size in half to 8 kbytes and reduced its load capacity to one per clock, thereby reducing latency, Boggs said.
Things are bound to get tougher for microarchitectures, which at Intel usually come out every five years. The most obvious ways to boost performance longer pipelines, deeper buffering, more speculation have been done. That will mean more-complex, hard-to-test designs.
"The low hanging fruit is all gone," Boggs said. "Now we have to build scaffolds around the tree. We'll stand on our head and do strange things for a little more performance."