MONTEREY, Calif. Intel Corp. and others are investigating Level 3 cache as a way to juice microprocessor performance. Having exhausted its best X86 design tricks, Intel is looking at ways to boost clock frequency while keeping the number of instructions executed per clock from falling off a cliff. Large Level 3 caches are prime among them.
Also on the table are new pipeline techniques, more floating-point horsepower and even finer-grained speculation methods in forthcoming generations of the IA-32 processors, which analysts say have another five years before they are subsumed into the 64-bit processing realm.
Level 3 caches are most likely to show up first, and may become part of an off-the-shelf offering within two years. That's because the data sets in some applications are getting so large they can't be fully contained by a 256-kbyte L2 cache, putting the CPU at risk of performance-killing cache misses. CPUs with extremely deep pipelines, such as the 20-stage-pipeline Pentium 4, are especially susceptible.
Observers noted that some workstation and server makers have proprietary L3 cache designs already, but microprocessor and chip set makers are not yet providing off-the-shelf products with L3.
At the Micro-33 conference here last week, Darrell Boggs, Intel's principal engineer for the desktop platform group (Hillsboro, Ore.), said Intel originally planned to add an off-chip L3 cache to the Pentium 4, but killed the plan when it proved too costly. He noted, however, that L3 could still be in Intel's game plan.
Meanwhile, Micron Technology Inc. has developed a chip set with 8 Mbytes of on-chip DRAM in the north bridge chip that acts as an L3 cache, said Dean Klein, Micron's vice president of integrated products.
Intended to support Advanced Micro Devices Inc.'s Thunderbird processor, the Micron chip set will sustain 10 Gbytes/second of internal bandwidth. Half of the memory banks will be used to hide refresh and precharge commands to reduce latency, and the other half will be used to feed the processor, Klein said.
Interest weighed
Armed with a license from AMD, Micron will first gauge customer demand before deciding whether to put the chip set out on the market. "We have a fair amount of interest in the [north bridge] part," Klein said. "We're definitely going to take this to first silicon."
Boggs hinted that Intel too might be leaning toward integrating L3 into its chip sets. Keeping the cache close to the processor would minimize capacitance and inductance and improve signal integrity. The trade-off is added cost for more pads around the processor die. Building a new cartridge to contain the processor and cache is prohibitively expensive, Boggs said.
Chip set vendors, however, are in a good position to bring in L3 cache. "One of the ways for chip set vendors to differentiate themselves would be to add value with memory or disk caches," Boggs said.
Observers said there is still some debate over whether L3 caches will trickle down from high-end systems to desktops.
Citing the cost pressures of the PC market, Dean McCarron, principal analyst for Mercury Research (Scottsdale, Ariz.), said, "To add an L3 cache would be counter to that direction. It does buy performance, but each layer of cache buys you less and less performance."
But Peter Glaskowsky, a senior analyst with MicroDesign Resources, said processor-to-memory latency is getting so important that even low-end systems will need an L3 cache. "Within two years we will start seeing desktop systems ship with three levels of cache," he said.
There are several L3 options at the chip industry's disposal. SRAM is generally the fastest, most commonly used type of cache memory. Embedded DRAM is about three times denser than SRAM but is slower because its cells must be precharged before being accessed. Another option that has gained some acceptance is the one-transistor "SRAM" developed by Mosys, which is a DRAM cell that acts as a fast SRAM cell, Glaskowsky said.
Adding L3 cache is probably the easiest way to ratchet up performance with the least pain, but Intel is also looking at architectural improvements that will take a bigger design effort.
In a paper delivered at Micro-33, Intel researchers described what they called a "circuit-level speculation" technique that some consider the next stage in superscalar design. "It's one of the hot topics of CPU design right now," Glaskowsky said.
Rampant speculation
The goal of circuit-level speculation (also known as value speculation) is to increase the number of instructions per cycle, a measure of performance that can suffer with deeper pipelines. In the paper, Intel called for the use of an "approximation" circuit to predict the output of logic blocks that can hold back clock frequency, including register rename, issue and adder logic. Like other speculative techniques, the circuit needs to correctly predict the result most of the time. "It's the same for branch prediction sometimes it's wrong but usually it's not," Glaskowsky said.
It's clear that Intel wants to do all it can to keep the clock-frequency treadmill from stalling. At the conference, several Intel researchers concurred that lengthening the pipeline was a possibility. A second Intel paper described a way to pipeline scheduling logic over two cycles with back-to-back execution of dependent instructions, allowing deeper pipelines to be built. "It wouldn't surprise me if we saw longer pipelines in the future," Boggs said.
Pipelines for some 3-D graphics engines can stretch as long as 100 stages, so it's not a matter of whether Intel can go deeper. But usually extremely long pipelines execute algorithms that rarely have branches going off in different directions, so there's less chance of encountering a branch mis-predict and having to flush out the pipeline and start over.
That's why an AMD Athlon processor running at 1.2 GHz will outperform a Pentium 4 running at 1.4 GHz on office productivity applications, while Pentium 4 will excel in multimedia applications, Glaskowsky said. "The price you pay is on lower-efficiency, older applications like Word, and we see that very clearly on Pentium 4," he said.
Some of this could have been mitigated by increasing the size of the execution trace cache, a new feature in the Pentium 4 that stores decoded micro-ops along a predicted path of execution to reduce instruction decode latency, Glaskowsky noted. In fact, Intel had such a plan, but decided to keep the size at 12,000 instructions as one way to hold down die area, Boggs said.
With later die shrinks, more transistors may be available to increase the size of the on-chip caches, observers said. Another possibility is to beef up floating-point units (FPUs). For the Pentium 4, Intel had to scale back plans to add two fully functional FPUs, opting instead to enable one to handle multimedia instructions.
"It would be really nice . . . having two fully functional FPUs," Boggs said, but "it remains to be seen."