United Business Media EE Times


Search

HOMEMARKET INTELLIGENCE UNITFORUMSDESIGNNEW PRODUCTSCAREERSBLOGSCONTACTEVENTSSIGN UP!RSSMost Popular contentTrusted Sources

 

ASIC Design

Evaluating ASIC Memory Trade-offs

On-chip memory and a small cache can increase performance when you're using an embedded RISC core.

by Mark Buchanan



Stand-alone microprocessors and microcontrollers often include an on-chip cache to improve overall performance. The performance of RISC processors, which perform a memory access to fetch an instruction or a load-and-store operation nearly every cycle, can improve dramatically by adding a cache that makes most of these accesses occur in a single cycle. The downside is that a cache leads to unpredictability: interrupt latency may vary considerably, depending on the contents of the cache when an interrupt occurs. Customizing a circuit's memory configuration to fit the needs of the application is therefore critical.

The availability of RISC cores that can be embedded in an ASIC makes it possible to customize memory configurations. The chip designer can use an on-chip cache in the same way as a standard microcontroller chip or use a regular on-chip memory (SRAM or DRAM) to store interrupt handlers and speed-critical routines. A combination of these two approaches is yet another option. Once the designer decides which method to use, he must make more decisions: How big should the cache or on-chip memory be? What organization of the cache offers satisfactory performance without excessive die size overhead? What is the best ratio between cache and regular on-chip memory? Should system performance be predictable (deterministic)?

To address some of these questions, we created an example application and evaluated various memory configurations. We used an instruction set simulator and a cache simulator to measure performance.

Our test results show that a combination of on-chip memory and a small cache can increase overall circuit performance. What's more, the on-chip memory and small cache combination can be less expensive to implement.

No 'typical' application It's important to pick a "typical" application when evaluating memory trade-offs. Unfortunately, there is no such thing as a typical application. By evaluating the memory trade-offs with our example application, however, we sought insights that will apply to other applications as well.

The example application runs on an ARM7TDMI-based system. It incorporates a multitasking embedded operating system (µC/OS), 50 tasks, an interrupt handler, and a task triggered by the interrupt handler. No hardware is required for it to run in the ARMulator instruction set simulator. A "trickbox" model was used with the ARMulator to generate interrupts. Messages are passed between the application's 50 tasks in a continuous loop. Each task modifies the message slightly, calls a common output routine, and passes it to the next task. By writing to a memory location in the trickbox, a task can cause an interrupt, at which point the interrupt handler sets a semaphore that will cause a specific IRQ task to execute after the handler finishes. The IRQ task waits to receive the semaphore, counts the interrupt, and determines when to stop based on the number of interrupts received.

The code was compiled without optimization so that none of it would be lost. The total size of the application is as follows: code and constants were 17,600 bytes, and read/write and zero-initialized data were 112,272 bytes. (Much of the large amount of zero-initialized data is allocated to task stacks, which are only lightly used.)

When a specific number of interrupts have been generated and serviced by the IRQ handler and IRQ task, the application terminates. You can easily vary the number of interrupts the application generates to simulate different levels of interrupt activity. For the experiment, we used three interrupt rates and assumed a processor frequency of 40 MHz.

  1. Low interrupt rate: Every fifth task generates an interrupt, for a total of 100 interrupts and 508,151 total cycles. Based on zero wait state cycles, these figures correspond to a rate of approximately one interrupt every 127 ms.
  2. Medium interrupt rate: Tasks 2, 5, 7, 10, 12, 15, 17, 20, 22, 25, 27, 30, 32, 35, 37, 40, 42, 45, 47, and 50 generate a total of 200 interrupts and 611,761 total cycles, for a rate of approximately one interrupt every 76 ms.
  3. High interrupt rate: Every task generates an interrupt, resulting in a total of 500 interrupts and 922,401 total cycles, for a rate of approximately one interrupt every 46 ms.

Since the number of wait states will vary depending on the memory configuration, the interrupt rates listed above are approximate. For each interrupt rate, the same number of interrupts and CPU cycles will occur for each memory configuration. (Although we assumed a processor frequency of 40 MHz, it is not uncommon for embedded processor cores to run at much higher frequencies.)

Table 1
Memory option hit rates, valid-bit-per-word cache architecture (%)
  Hit rate
Interrupt rate 1-kbyte
cache
2-kbyte
cache
4-kbyte
cache
6-kbyte
on-chip
SRAM
1-kbyte
cache,
6-kbyte
on-chip
SRAM
2-kbyte
cache,
6-kbyte
on-chip
SRAM
Low 42.15 62.75 77.68 55.93 77.82 85.94
Medium 39.74 58.98 78.01 59.66 79.54 87.85
High 36.83 54.47 80.48 65.82 83.30 90.26

The ARMulator generated a memory access trace for each case for use as input to the Symbios Wincache cache simulator. Wincache analyzes the trace file based on a user-specified memory configuration, which can include caches of various types and sizes and regions of on-chip memory.

Memory configurations We evaluated the following memory configurations:

  1. a 1-kbyte mixed instruction and data cache
  2. a 2-kbyte mixed instruction and data cache
  3. a 4-kbyte mixed instruction and data cache
  4. no cache and 6 kbytes of on-chip memory containing the interrupt vector table, interrupt handler, IRQ task and key interrupt-related routines, and data from the kernel
  5. a 1-kbyte mixed instruction and data cache plus 6 kbytes of on-chip memory containing the code and data listed in configuration 4
  6. a 2-kbyte mixed instruction and data cache, plus 6 kbytes of on-chip memory containing the code and data listed in configuration 4

An important point is that a cache requires roughly three times as much die area as the equivalent amount of simple SRAM memory. In all of the test configurations, we used a four-way set-associative cache. Each cache line consisted of four 32-bit words with a valid bit per word.

Results Table 1 lists the effective hit rates for the memory options we tested. (Any access to the on-chip memory is counted as a hit.)

Performance can be described as the "effective processor frequency." Figure 1 shows the effective processor frequencies of the various memory configurations we tested. Each miss requires one wait state (which takes two clock cycles), and hits have no wait states. The calculated frequencies include the effect of internal cycles, which are unaffected by the memory access speed. If the processor runs at a higher frequency, the results will scale, except that more wait states may be required in the case of a cache or SRAM miss. More wait states would increase the spread of the results (that is, the differences in performance between configurations would increase).


Figure 1. The memory configuration affects system performance. For the example application, a mix of a small cache and on-chip SRAM provides the best results.

The effect of cache architecture If we had used a cache that was organized differently (two-way or direct-mapped) or a different cache line size, the results would differ. The valid-bit-per-word architecture will have a lower hit rate than a cache that uses a single valid bit for a cache line (valid-bit-per-line architecture). Whereas a valid-bit-per-line architecture loads an entire cache line (four 32-bit words) when a miss occurs, the valid-bit-per-word design loads only a single word (resulting in cache misses for each subsequent word in the line that will be loaded individually). Although the valid-bit-per-line architecture will have fewer cache misses, the number of wait cycles required when a miss does occur is higher, which ultimately decreases the effective processor frequency.

Table 2 shows the cache hit rate for the example application when a valid-bit-per-line architecture is used. The hit rate is substantially higher than with the valid-bit-per-word architecture (see Table 1 again). (The column for 6 kbytes of on-chip SRAM is supplied for completeness, although the results are the same, since there's no cache.)

The effect of the valid-bit-per-line architecture on performance is contrary to what the hit rate suggests. If a cache miss holds the processor for eight cycles to perform a cache-line fill of four 32-bit words, the effective processor frequency would decrease (see Figure 2).

In all cases, the valid-bit-per-word architecture offers better performance, even though the hit rate is lower. The die area required for either cache architecture is about the same. Although extra valid bits are necessary for the valid-bit-per-word architecture, the cache doesn't have to generate addresses as it does in the valid-bit-per-line architecture. Another benefit from the valid-bit-per-word approach is that performance is more deterministic and interrupt latencies can be shorter, since it eliminates the long delay created by refilling a complete cache line.

Smaller can be better Our test results illustrate how the combination of on-chip memory and a small cache can increase overall circuit performance. Because the interrupt-related code is available in on-chip memory, the interrupt latency is always consistent and as fast as possible. Also, the on-chip memory and small cache combination can be less expensive to implement. A 4-kbyte cache would require approximately the same area as 12 kbytes of on-chip SRAM because of the control logic and tag RAM that are needed. The 4-kbyte cache and the configuration of a 2-kbyte cache with a 6-kbyte SRAM are roughly equal in die area, but the latter offers both better overall performance and predictably fast interrupt response. On-chip DRAM, which occupies even less area, may be another option.

Table 2
Memory option hit rates, valid-bit-per-line cache architecture (%)
  Hit rate
Interrupt rate 1-kbyte
cache
2-kbyte
cache
4-kbyte
cache
6-kbyte
on-chip
SRAM
1-kbyte
cache,
6-kbyte
on-chip
SRAM
2-kbyte
cache,
6-kbyte
on-chip
SRAM
Low 77.19 86.44 92.26 55.93 91.47 95.00
Medium 76.35 85.18 92.25 59.66 91.92 95.63
High 75.49 83.59 93.06 65.82 93.13 96.45

Another approach is the "read-and-release" cache, which has a valid-bit-per-line design but reads the requested word first and allows the processor to continue executing while filling the rest of the cache line. In most cases, you can achieve high hit rates without the large miss penalties. However, this type of cache requires more silicon area, and in some cases the processor can still be suspended while remaining words in the cache line are loaded (such as when a memory access doesn't follow the previous one sequentially).


Figure 2. The cache structure also affects system performance. A valid-bit-per-line structure reduces the effective frequency compared with a valid-bit-per-word structure.

Another possible choice is a "lock-down" cache, in which part of the cache acts as a simple SRAM. This scheme allows the designer to postpone deciding how much SRAM is necessary for interrupt routines and other critical code. However, a lock-down cache wastes die area, because part of the space it requires is used as simple SRAM that could otherwise be implemented much more cost-effectively. To match the benchmark results of the 2-kbyte cache, 6-kbyte SRAM configuration, you would need an 8-kbyte lock-down cache. Only 2 kbytes would be used as a cache, and the remaining 6 kbytes would operate as an expensive SRAM, taking up roughly three times the area of a simple SRAM.

A big advantage of using a memory cache is that you can increase performance without knowing that much about the application in advance. However, with a bit more code analysis, you can use an on-chip memory or a small cache in combination with an on-chip memory for a less expensive solution with better and more predictable real-time performance. One of the nice things about embedded processor cores is that they free you from the manufacturer's standard microprocessor cache offerings to choose the best memory configuration for your requirements. *

Mark Buchanan is a design system architect at Symbios, Inc. (Fort Collins, Colo.).

To voice an opinion on this or any Integrated System Design article, please e-mail your message to miker@isdmag.com.


integrated system design  February 1998



[ Articles from Integrated System Design Magazine ] [ ICs and uPs ]
[ Custom ICs and Programmable Logic ] [ Vendor Guide ]
[ Design and Development Tools ] [ Home ]



For more information about isdmag.com e-mail cam@isdmag.com
For advertising information e-mail amstjohn@mfi.com
Comments on our editorial are welcome
Copyright © 2000 Integrated System Design

  Free Subscription to EE Times
First Name Last Name
Company Name Title
Email address
  Click here for your Free Subscription to EETimes Europe
 
CAREER CENTER
Looking for a new job?
SEARCH JOBS
SPONSOR

RECENT JOB POSTINGS
CAREER NEWS
SRC Expands R&D Centers
The Semiconductor Research Corp has added a new center to its university R&D efforts.

For more great jobs, career related news, features and services, please visit EETimes' Career Center.


All White Papers »   

 
Education and
Learning


Learn Now:












Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints|  RSS|   Digital|  Mobile
Network Websites
International
Network Features




All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement | Terms of Service | About