AMD lays out K7 competitor to Katmai
SAN JOSE, Calif. Advanced Micro Devices Inc. placed its K7 microprocessor at stage center of the Microprocessor Forum. Expected to hit the market in mid-1999, K7 will target Intel Corp.'s lucrative 85 percent share of the personal computer microprocessor market.
The K7 is an X86 design that takes AMD further afield of Intel in terms of its instruction set, bus architecture and chip set support. The K7, which is expected to debut with a clock speed above 500 MHz, is to be AMD's counter to Intel's Katmai processor with the Katmai New Instructions, set to hit the market in the first quarter.
Michael Slater, editorial director of Microprocessor Report, said Intel's Willamette IA-32 microachitecture promises to be even more aggressive than the K7. But Katmai is based on the Pentium II microarchitecture, which is now nearly five years old. "The K7 is substantially beyond what Intel will offer with Katmai, giving AMD an 18-month or more advantage in terms of microarchitecture," he said.
If AMD can deliver the K7's promised performance and clock speed, it would be able to compete with Intel head-on in the lucrative performance desktop sector, Slater said.
While AMD's K6 processor debuted with a similar promise to surpass Intel's best available performance, that hope that was quickly scotched by manufacturing problems at AMD.
But "AMD is a different company now," Slater said. "Atiq [Raza, executive vice president of AMD, who heads product development] is running the company, and the K7 team is not same team that did the K5. It is a very experienced team, and with this design they have hit on a lot of different things" that improve performance.
While the K7 will eventually be used across a range of less expensive systems, AMD's initial objective is to offer a high-end processor that will compete against Intel in the market for performance desktops in the $2,000 range, as for low-end servers.
Richard Heye, general manager of the K7 project, said the processor will work with 100-MHz SDRAMs when it is introduced next year. "Later in 1999 we will extend that to Rambus memories," he said. "But the timing is up to the OEMs, and we don't intend to dictate to them."
Initially, the K7's 22 million transistors will take up 184 square millimeters of silicon. That relatively large die will be manufactured in AMD's 0.25-micron process at Fab 25 in Austin, Texas. By 2000, plans call for the design to be implemented in a 0.18-micron process, which will reduce die size to less than 100 square mm. At that point, the device's six metal layers (plus local interconnect) will shift from aluminum to copper.
AMD is developing its own core logic for the K7. The logic will be manufactured in a 0.35-micron process by United Microelectronics Corp. (Hsinchu, Taiwan). In one difference from Intel, AMD's K7 will be supported by core logic from several major chip set vendors, all of which derive most of their revenues from K6-based systems.
Another difference from Intel is the K7's adoption of the 200-MHz system bus, which was licensed from Digital Equipment Corp. The EV-6 bus is the same one used for the Alpha 21264 processor designed by Digital and now owned by Compaq Computer Corp. This sets up the intriguing possibility that computer OEMs could develop motherboards and complete systems that could swap out daughtercards holding either a K7 or Alpha 21264, depending on a customer's requirements.
The K7 will be placed on a daughtercard that is mechanically compatible with the Slot 1 design from Intel, though it will be different electrically. Compaq could offer the Alpha on daughtercards to the commercial market, which would ensure swappability with the K7 for system OEMs.
Dirk Meyer, director of engineering for the K7 project, was the co-architect of Digital's Alpha 21264 before he joined AMD in early 1995, when he took charge of the K7's 150-person design team.
"One concern I had early on is that when people heard that we had licensed the EV-6 bus from Digital, people would think, 'Alphas are expensive so this must be an expensive bus.' But in reality, this requires 25 to 30 fewer pins, and it is simpler in terms of electrical complexity" compared with the Slot 1 bus, Meyer said.
The EV-6 bus is a point-to-point bus, which "makes the electrical environment much, much cleaner," Meyer said. "With Intel's approach, once you put more than one drop on a line you don't have a transmission line anymore, you have a stubs. So you end up with ringing and reflections that make the electrical design much more difficult. I would expect that this 200-MHz bus to be simpler than a 133-MHz Slot 1 bus, because electrically it is a much cleaner environment."
The bus sends clocks with data in either direction. This approach "removes clock skew from the equation that determines how fast you can run the wires," Meyer said. "By sending the clock with the data, it puts less demand on the system and allows higher bandwidth across the interconnect."
Bundled instructions
In his presentation, Meyer said the K7 executes the most common X86 instructions in bundles that AMD calls macroOPS. Some macroOPS contan a single X86 instruction, but many combine two normal instructions, a load instruction and an XOR instruction, for example.
These macroOPS are 15 bytes long and are handled by direct path decoders , while other less commonly used instructions are routed to a vector decode path.
The instruction decode pipeline can serve up three macroOPS per cycle, and they are forwarded to the instruction control unit. Up to 72 separate X86 instructions can be "in flight" either in instruction dispatch or retirement at any one time.
"One key differentiator, compared with the Pentium II core, is that the Pentium II architecture takes X86 instructions and breaks them apart into RISC instructions, and the RISC operations are what get scheduled in the machine," Meyer said. "With the K7, the central quantum of information that floats around the machine is not decomposed RISC operations, it is a macro operation. It's basically a denser thing that holds more information, but by doing it that way, it makes the hardware simpler. In other words, we can build the hardware to directly execute the X86 instructions instead of executing the RISC instructions that emulate the X86 instructions."
To support the high-bandwidth bus, Meyer said the design includes enough internal scheduling instruction depth, and deep buffers, to actually uncover the parallelism in the code and make use of it. "One thing that differentiates the K7 from essentially any other X86 processor is the deep buffers we have in place," he said. "The scheduler can schedule 15 of these complex operations, which is 30 of the simple operations. There is a huge scheduler core on the floating-point unit for 44 entries, and lots of address and data buffering in the bus interface unit.
"So if there is one theme, it is deep, deep instruction and memory schedulers, and lots of memory and address data buffering," Meyer said. "The more bandwidth you have, the more buffering you need to sustain the bandwidth. We went to fairly great lengths to make sure the machine would have the buffering it needed to uncover all the work that is available in auto decode."
Floating-point hurdle
With three general execution units and three address calculation units, Meyer described the K7 as "a wider superscalar machine than the Pentium II."
On the floating-point side, particularly for X87 code, the difference between the K7 and the Pentium II "is even more apparent," he said. The K7 has two double-precision X87 data paths, which are fully pipelined, compared with one for the Pentium II which, Meyer said, is not fully pipelined for double precision. "If you want to do an X87 instruction on the Pentium II, in one sense you cycle the instruction through the pipe twice. So in some ways for double precision X87, the K7 has four times the peak execution rate."
Because the 3DNow multimedia instructions are single instruction multiple data (SIMD), in every register there are two single precision numbers, and two pipelines can operate on that data. "So, essentially, the K7 provides twice the performance [available on the Pentium II] for X87 code," Meyer said.



65 comments


