Design Article
Tell us What You Think
We want to know what you thought about this Design. Let us know by adding a comment.
Memory Hierarchy Design - Part 6. The Intel Core i7, fallacies, and pitfalls
John L. Hennessy, Stanford University, and David A. Patterson, University of California, Berkeley
10/30/2012 12:49 PM EDT
Editor's Note: Demand for increasing functionality and performance in systems designs continues to drive the need for more memory even as hardware engineers balance the dynamics of system capability, power, and cost against the growing performance gap between processor and memory. Architectures based on memory hierarchy address these issues, and what better source for the details of this approach than an excerpt on the subject from the seminal book on Computer Architecture by John Hennessy and David Patterson.
This excerpt comprises:
- Part 1, Basics of Memory Hierarchies, which looked at the key issues surrounding memory hierarchies and set the stage for subsequent installments addressing cache design, memory optimization, and design approaches.
- Memory Hierarchy Design - Part 2. Ten advanced optimizations of cache performance, which reviewed ten advanced optimizations of cache performance
- Memory Hierarchy Design - Part 3. Memory technology and optimizations, which examined innovations in main memory that offer improved system performance
- Memory Hierarchy Design - Part 4. Virtual memory and virtual machines, which examined architecture support for protecting processes from each other via virtual memory and the role of virtual machines
- Memory Hierarchy Design - Part 5. Crosscutting issues and the memory design of the ARM Cortex-A8, which looked at crosscutting issues for memory hierarchy design and reviewed the memory design of the ARM Cortex-A8.
- This installment, which examines the memory hierarchy design of the Intel Core i7; reviews fallacies and pitfalls memory hierarchy design; and concludes with a look ahead.
The Intel Core i7
The i7 supports the x86-64 instruction set architecture, a 64-bit extension of the 80x86 architecture. The i7 is an out-of-order execution processor that includes four cores. In this chapter, we focus on the memory system design and performance from the viewpoint of a single core. The system performance of multiprocessor designs, including the i7 multicore, is examined in detail in Chapter 5.
Each core in an i7 can execute up to four 80x86 instructions per clock cycle, using a multiple issue, dynamically scheduled, 16-stage pipeline, which we describe in detail in Chapter 3. The i7 can also support up to two simultaneous threads per processor, using a technique called simultaneous multithreading, described in Chapter 4. In 2010, the fastest i7 had a clock rate of 3.3 GHz, which yields a peak instruction execution rate of 13.2 billion instructions per second, or over 50 billion instructions per second for the four-core design.
The i7 can support up to three memory channels, each consisting of a separate set of DIMMs, and each of which can transfer in parallel. Using DDR3-1066 (DIMM PC8500), the i7 has a peak memory bandwith of just over 25 GB/sec. i7 uses 48-bit virtual addresses and 36-bit physical addresses, yielding a maximum physical memory of 36 GB. Memory management is handled with a two-level TLB (see Appendix B, Section B.4), summarized in Figure 2.19.

Figure 2.19. Characteristics of the i7’s TLB structure, which has separate first-level instruction and data TLBs, both backed by a joint second-level TLB. The first-level TLBs support the standard 4 KB page size, as well as having a limited number of entries of large 2 to 4 MB pages; only 4 KB pages are supported in the second-level TLB.
Figure 2.20 summarizes the i7’s three-level cache hierarchy.

Figure 2.20. Characteristics of the three-level cache hierarchy in the i7. All three caches use write-back and a block size of 64 bytes. The L1 and L2 caches are separate for each core, while the L3 cache is shared among the cores on a chip and is a total of 2 MB per core. All three caches are nonblocking and allow multiple outstanding writes. Amerging write buffer is used for the L1 cache, which holds data in the event that the line is not present in L1 when it is written. (That is, an L1 write miss does not cause the line to be allocated.) L3 is inclusive of L1 and L2; we explore this property in further detail when we explain multiprocessor caches. Replacement is by a variant on pseudo-LRU; in the case of L3 the block replaced is always the lowest numbered way whose access bit is turned off. This is not quite random but is easy to compute.
Next: Title-1

