United Business Media EE Times


Search

HOMEMARKET INTELLIGENCE UNITFORUMSDESIGNNEW PRODUCTSCAREERSBLOGSCONTACTEVENTSSIGN UP!RSSMost Popular contentTrusted Sources

 

Semiconductor

Memory Hierarchies & the Speed Gap

A workstation's cache memory subsystem design can restrain or speed up system performance.

by Steve Goldstein


Since the 1960s, microprocessor speeds have doubled every 18 months. This performance growth can be attributed to a combination of faster technology, better hardware design, and smarter compilers. As a result of this increased performance, powerful desktop systems now present opportunities for expanded use in a variety of compute-intensive applications. However, at the same time, system designers face a serious challenge: maintaining a system balance that allows the processor's potential to be realized.

To maintain this balance while keeping up with increasing microprocessor speed rates, a system designer must determine how his or her memory subsystem is going to support a broad set of applications, and how it will feed data-hungry processors.

In the 1970s, mainframe designers faced a similar set of problems, with respect to disk I/O. Processor speeds were increasing. Disk-drive technology, on the other hand, was rapidly expanding storage capacity, but not speed. To improve disk performance, electronic data caching was introduced. This was done with main memory caching strategies in both Unix and the proprietary operating systems. In addition, outboard caching was added to SCSI devices and mainframe disk controllers.

In essence, these disk-caching strategies contributed to a balanced system because processors were seldom left waiting for I/O. However, current high-performance processors frequently stall waiting for data from main memory. Improvements in DRAM technologies, like SDRAM and EDO, have increased data rates, while DRAM page mode has reduced latency. These technological changes still do not eliminate the performance gap.

Three-tiered memory hierarchy Virtually all state-of-the-art microprocessors include SRAM off-chip caches. Typically, the raw latency for an SRAM is 8 to 10 ns. However, current high-performance superscalar processors run at speeds in excess of 100 MHz and are capable of issuing four instructions per cycle.

If such a system is going to run at maximum speed, it needs an instruction stream of four instructions per cycle, plus an associated data stream. Instructions frequently come from sequential memory locations, so only two or three physical fetches may be needed per cycle. This level of demand is far beyond the capability of DRAMs that are busy for 60 ns per request. It is somewhat beyond the capability of SRAMs that have a latency, including overhead, of about 30 ns.

The solution is to build a three-tiered memory hierarchy with main memory DRAMs, off-chip cache SRAMs, and on-chip caching (see Figure 1 ).

Design issues with microprocessor caches Within this general three-tiered hierarchical structure there is ample room for innovation. Products coming onto the market have adopted a wide range of solutions:

How big should a cache be? For performance, the bigger the capacity, the higher the hit rate. Other considerations being equal, higher hit rates yield better performance. Of course, there are other considerations, with cost and physical constraints usually being the most important.

How should cache capacity be divided into slots? Caches are structured as a collection of equally sized cache slots. 1K slots of 128 bytes each will have the same capacity as 4K slots of 32 bytes each.

Cache miss-rates defined
There is no standard definition of cache miss rate. Some authors use the ratio of cache misses over cache references. Others use the ratio of cache misses to instructions executed. While the ratio of misses to references seems intuitive, it leads to complications in multi-level cache hierarchies. Generally, it is useful to have a straightforward conversion from miss rate to the probability of an instruction waiting for a cache to deliver data. For the purposes of this article, the miss ratio will be defined as misses per instruction executed.

Whatever definition is used for miss rate, hit rate is always defined as

hit rate = 1-miss rate

Which line should be replaced? Whenever a cache reference results in a miss, a new line is brought into cache. Bringing in a new line generally implies evicting a valid line that is already in cache.

Should data and instructions share a single unified cache? There are two types of cache: unified and separate. With separate caches, data and instructions are kept in physically separate structures. The system in Figure 1 has separate on-chip caches and a unified off-chip cache.

Discussion of Cache Tradeoffs The design issues raised above have been addressed in contemporary systems.

Cache size As caches get large, the incremental value of bigger caches diminishes. This is an example of the law of diminishing returns. Performance models and measurements almost invariably trace the hit rate (see Figure 2 ).

The most efficient approach to find the right capacity is to start from a hit ratio curve developed from the target applications for the machine. If the design will have separated caches, then a pair of curves should be developed. In Figure 2 , the knee of the curve is about 1/3 of the way along the "cache size" axis. At the initial capacity design point, select the smallest power of 2 that covers the knee; then look at tradeoffs around the initial point.

The factors that determine cache size are different between on-chip and off-chip caches. On-chip caches are integral to processor design. Their size is usually constrained by physical considerations. The designer is really trading against total chip area and other optimizations such as deeper queues or an additional functional unit--for example, another integer adder. These are difficult decisions, and they are often made with the help of thousands of detailed simulation runs.

The size tradeoff for off-chip caches are usually based on a more straight-forward cost-versus-performance evaluation. A larger cache has a better hit rate but a higher cost. Frequently, selection of the off-chip cache size can be postponed until very late in the design cycle. In fact, some systems allow the user to determine the size of the off-chip caches as part of the system configuration, with several cache-size options available. The cost versus performance tradeoff is then transferred to the customer.

While selection of the specific off-chip cache-size may sometimes be postponed, all the parameters of its operation must be deter-mined during design. The control logic and algorithms must be determined even if total capacity is left open. Experience shows that for a fairly wide range of cache sizes, the same set of optimizations is usually correct (or close enough). Hence, leaving the off-chip cache size open is not a trap. However, like on-chip caches, off-chip cache options may be limited by packaging, heat, or power budgets.

Design of the HALstation 300 series systems employed traces of instruction and data address references from several popular benchmarks. The off-chip cache size totaled 256 Kbytes. This cache size is smaller than on some other systems in its performance class. A practical constraint for HALstation 300 systems involved the decision to package our chip set on a multi-chip module (MCM). While the MCM imposed a physical space constraint, it afforded the opportunity to achieve unusually low latency to the off-chip cache. We minimized the product of miss rate and latency.

Cache line-size selection Caches are structured as a collection of equally sized cache slots. One line of cache data fits into each slot. To accommodate addressing, both the size of the line and the number of slots are always a power of 2. The cache capacity, C, is a number of bytes in the cache. If the number of bytes per line is denoted by L and the number of slots by s then C = s x L. Of course, if s and L are powers of 2, C must also be a power of 2. For a given cache size, there is a tradeoff to be made between the line size and the number of slots.

One mechanism for making valid tradeoffs is to compare the performance of designs of roughly equal cost. For caches, cost equates closely with capacity.

Cache performance has two important aspects. First, the required data must be in the cache--this is measured by the cache hit rate or miss rate. Second, the number of cycles are required to transfer data to or from the cache--this is measured by latency. While cache line size might affect latency, it primarily affects the cache miss rate (see "Cache miss-rates defined").

For a fixed cache size, many studies have shown that miss rate versus line size tends to have a shallow convex shape. Several fixed-capacity caches viewed together create a family of curves (see Figure 3 ). When the number of bytes per slot increases, the number of slots decreases. With few independent lines in cache, references to data items or code strings start to miss more frequently. On the other hand, both programs and data generally display a property known as locality of reference. Reference to any byte of data or text enhances the likelihood that bytes with nearby addresses will also be referenced soon. As cache lines get small (say less than 16 bytes), miss rates tend to rise because of the loss of locality hits.

Finally, for a given slot-size and number of lines, greater capacity yields a lower miss rate. Therefore, the curves do not intersect.

Figure 1. This three-tiered hierarchy is quite common in today's systems.

Remember the following rule of thumb: doubling the capacity divides the miss rate by the square root of 2.

Cache lines are not free. Each slot needs an overhead of several bytes. A part of each line's address, called its "tag," is stored in a separate tag memory. In addition, several bits provide information regarding the state of a slot's contents, such as whether the slot holds a currently valid line, whether or not it's "dirty," etc. These overhead factors tend to bias the optimum toward longer lines. On the other hand, longer lines use more bus bandwidth per move, which may be an important consideration in systems where bus bandwidth is at a premium.

In the HALstation 300 systems, we selected a line size of 128 bytes for the off-chip cache. This is one step longer than the current norm of 64 bytes. The decision was influenced by a desire to take full advantage of locality of reference and a knowledge that robust memory bandwidth would reduce any undesirable side effects.


Line replacement A simple algorithm for line replacement is the least recently used (LRU) line, which replaces the line with the least recent activity. However, LRU requires that a new line can replace any arbitrary line; hence, it can go into any slot. This capability is referred to as fully-associative mapping because a line
Figure 2. Cost increases linearly with size, while hit rate is bounded by 1.0. For a specific application, a jump in the hit rate is usually an artifact of capturing an array or the full text for an application. Broader measurements yield smooth curves. Also, note that hit ratio is used as a surrogate for performance. Hit ratio is a good indicator of performance, and it is relatively easy to compute from models driven by instruction traces.
may be associated with any slot. The drawback of LRU is that to find the referenced line, it may be necessary to search every slot. With current technology this is not practical.

At the other end of the spectrum, a line may be restricted to a unique slot in cache. The slot is computed from the (original) memory address--a process called direct mapping. There are a number of alternatives between these two extreme points, which offer a rich and complex set of tradeoffs.

One common tradeoff uses n-way set-associative caches. In this structure, a cache line can reside in any of the n cache slots. This property guarantees that any arbitrary combination of n lines can coexist in cache without conflict. Depending on code and data layout, this property can significantly lower miss rates, compared to the direct mapped caches of the same capacity. In general, for n-way associativity, values of n larger than eight contribute little to reducing miss rates. In practice, caches are either direct-mapped or they are a 2 or 4 set-associative cache.

To retrieve data from an n-way associative cache and minimize latency, all n members of the set and their identifying tags are read in parallel. On a hit, exactly one tag will match the referenced address. Its corresponding data is selected. If none of the tags match, a miss has occurred.

If the parallel read out and compare hardware cannot be justified, the best choice is usually a direct-mapped cache. In a direct-mapped cache, the same operations occur, but there is only one cache slot to read and only one tag to fetch and check. It is either a hit or a miss.

When misses occur in a direct-mapped cache, the new line must be inserted into a particular slot determined by a hash of its address. However, with an n-way set-associative organization, there are n slots to choose from. Here an LRU algorithm among the n slots is generally used. Maintaining the LRU order requires that state information be maintained. For example, with a four-way-associative scheme, there are four factorial or 24 possible LRU sequences. A full implementation would require 5 bits of state information per set of four slots. HAL saves space by using a 4-bit approximation to the exact state.

No cache, however large or cleverly managed, can eliminate all misses. There are always some first references to data that result in "compulsory" misses. But, even compulsory misses can be reduced by longer lines and prefetch of data prior to the actual reference to the data. The HALstation 300 employs both techniques. A prefetch instruction unique to the Sparc V9 architecture allows software to do anticipatory prefetch of data.

Separate versus unified cache In a unified cache, both instructions and data are generally treated identically, and they contend for the same pool of slots. In a system with multiple levels in the memory hierarchy, it is not uncommon to have separated on-chip caches and unified off-chip caches.

Figure 3. A minimum naturally occurs as bytes per slot increase.

There are strong reasons to have separate instruction and data caches on-chip. Data caches are positioned close to register structures and functional units, such as the integer and floating point units, to minimize latency and to simplify routing. For the same reasons, instruction caches (i-caches) are positioned in proximity to the instruction decode and instruction-issue logic. Also, instructions may be stored on-chip in a special pre-decoded or partially decoded format that changes the instruction size. No commercial microprocessors have unified, on-chip caches.

Unified allows dynamic tradeoffs of capacity between programs and data, potentially increasing overall hit rates. On the other hand, there is a greater opportunity for sequential arrays of data to roll through and conflict with a relatively stable inner loop of code.

In the HALstation 300 series, it was decided to have separate, equally-sized off-chip caches. The goal was to transfer both instruction and data between caches and the processor simultaneously. This is important for a speculative superscalar processor. While simultaneous transfer capability is possible with a unified cache, it unnecessarily complicates the design.

There are numerous challenges in keeping up with the exponential increase in processor speeds. Effectively dealing with the memory speed gap is among the top challenges. Unfortunately, there is no single technology that can supply acceptable performance and cost.

Even if there are large improvements in DRAM technology, for the foreseeable future, an integrated, multi-level memory hierarchy will be best.

Steve Goldstein is the director of strategic planning for Fujisu's HAL Computer Systems (Campbell, CA).

To voice an opinion on this or any Integrated System Design article, please e-mail your message to michael@asic.com.


integrated system design  October 1996



[ Articles from Integrated System Design Magazine ] [ ICs and uPs ]
[ Custom ICs and Programmable Logic ] [ Vendor Guide ]
[ Design and Development Tools ] [ Home ]



For more information about isdmag.com e-mail cam@isdmag.com
For advertising information e-mail amstjohn@mfi.com
Comments on our editorial are welcome
Copyright © 1996 Integrated System Design Magazine

  Free Subscription to EE Times
First Name Last Name
Company Name Title
Email address
  Click here for your Free Subscription to EETimes Europe
 
CAREER CENTER
Looking for a new job?
SEARCH JOBS
SPONSOR

RECENT JOB POSTINGS
CAREER NEWS
SRC Expands R&D Centers
The Semiconductor Research Corp has added a new center to its university R&D efforts.

For more great jobs, career related news, features and services, please visit EETimes' Career Center.


All White Papers »   

 
Education and
Learning


Learn Now:












Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints|  RSS|   Digital|  Mobile
Network Websites
International
Network Features




All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement | Terms of Service | About