Advanced Micro Devices, apparently timing the announcement to coincide with a quiet period for rival Intel, on Sept. 10 unveiled a single-chip quad-core processor. The "native" device, code-named Barcelona, is AMD's first quad-core microprocessor and was architected to migrate the K8 architecture into a product that would compete with, and outperform, Intel's new Core architecture, used in the Core 2 Duo processor line.
The next-generation Opteron processor integrates four enhanced-performance X86 cores, each with 512-kbyte L2 cache and an enhanced 128-bit floating-point unit. The cores are integrated with a shared 2-Mbyte L3 cache and an improved on-chip memory controller that supports up to four 16-bit HyperTransport links and a dual-channel 128-bit DDR2/DDR3 interface.
The design contains more than 460 million transistors, about 120 million less than Intel's quad-core, which itself comprises two dual-core chips in a single package and is code-named Clovertown. The AMD chip is fabricated in a 65-nanometer silicon-on-insulator (SOI) CMOS process with dual stress liners and embedded SiGe for pMOS source/drains. The design uses 11 layers of copper interconnect and advanced low-k dielectrics that tie the four cores together. The dual stress nitride liners with embedded SiGe source/drain regions increase n- and p-channel mobility, resulting in higher current drive. As before, AMD's implementation of its 65-nm technology on an SOI substrate can increase latch-up resistance and reduce short-channel effects over an analogous bulk-silicon implementation.
The transistor performances of Intel's Woodcrest and AMD's Barcelona appear to match fairly closely, with the Barce- lona's gate leakage about half that of the Woodcrest. This is not so surprising, as Intel uses a 25 percent thinner gate dielectric. AMD's device shows consistently lower gate dielectric leakage than Intel's, especially on the pFETs. The current drive for both devices is comparable, with the Barcelona coming out marginally higher for the pFET but lower for the nFET devices measured. However, the leakage current (Ioff) for the nFETs was two to five times lower in the Woodcrest, suggesting the need for more optimization of AMD's transistor. Since AMD and Intel have always considered the total package, system-level performance for a particular application generally has the final word. That information is not yet available for the Barcelona.
Some of the changes AMD has made are intended to:
1. Increase operation execution bandwidth, thereby increasing the loads per cycle from the cache. This should improve video-encoding performance.
2. Improve performance by adding an indirect branch predictor, which reduces mispredicted branches and increases processor efficiency. This architectural improvement adopted in the Barcelona architecture follows Intel's implementation in the Prescott.
3. Offload certain frequent operations to dedicated hardware, using a sideband stack optimizer. This approach, similar in function to Intel's dedicated stack manager, removes some of the load from the processor's decoders and reduces pipeline clogging.
4. Add the capability to reorder load instructions and enable memory access optimization; this serves to increase instruction load speed--again, similar to the capability implemented by Intel in its Core 2 processor architecture.
5. Reduce the frequency of switching between read and write memory-control operations by using a "write bursting" operation. With standard DDR2 memory, one or the other can be done, but not simultaneously; switching from one to the other introduces delays. In Intel's case, the fully buffered dual in-line memory module architecture lets the operations be performed simultaneously.
6. Improve chip performance by adding a DRAM prefetcher within the memory controller, where none had existed before (though prefetchers have been used extensively in different areas and components of the microprocessor). This prefetcher monitors the various memory requests to predict trends to identify and pull data that appears likely to be used in the future. This is stored in a separate buffer.
Each core contains its own PLL, clock distribution system and power grid, with independent power/performance management capability (the core voltage and individual core frequencies operate independently of the northbridge). This enables them to enter power-efficient states while the processor interface operates at full speed to service DDR2/3 memory and HyperTransport traffic.
AMD has incorporated temperature controls for each of the cores by distributing eight remote temperature sensors across the core and six more remote sensors in the northbridge block. The controller tracks temperatures against predetermined limits and selects power-saving mode options.
The cache is implemented with a standard 6T memory cell. AMD has provided for custom tuning of the write pulse time after device fabrication by enabling programming with electrical fuses. This provides scalability across a range of cache sizes.
Intel and AMD appear to have optimized their devices differently, so that Intel has lower Ioff leakage current, and AMD has lower gate dielectric leakage. How that relates to overall system performance will be seen in time.
When shipments start, the advanced technology expected to be employed in the Penryn architecture will be difficult or impossible to match until AMD's 45-nm technology is introduced in turn. Intel is not only racing the clock with AMD for the microprocessor performance crown, but also with Matsushita Electric Industrial Co. Ltd. for technology leadership. In this less-visible race, Matsushita may beat Intel to 45-nm commercialization, albeit without a high-k gate offering. AMD has chosen not to participate in this contest, but rather to pursue the same objective in its own fashion, on its own timetable.
John Boyd (johnb@semiconductor.com) is technology analyst at Semiconductor Insights, a CMP Technology company in Kanata, Ontario. He holds more than 60 U.S. patents and has more than 40 applications pending.
|
See related charts