Design Article
Improving performance using SPI-DDR NOR flash memory
Qamrul Hasan and Cliff Zitlaw, Spansion
9/2/2011 11:15 AM EDT
Executing Code from Non-Volatile Memory
Systems using an Execute-in-Place (XiP) approach must consider that because the non-volatile memory subsystem is constantly being accessed to retrieve program code, it can potentially introduce memory bottlenecks into the primary execution path. Analyzing the efficiency of an XiP-based memory subsystem is not a simple calculation like it is for systems that execute code from RAM. Depending how a system is architected, there are many factors that contribute to memory performance.
System performance is often measured in terms of the number of instructions per cycle (IPC) that can be achieved by the system. Consider a CPU that takes 4 cycles to execute an instruction. For this CPU, an IPC of 0.25 would be ideal. There are many factors that influence the IPC, for example, a cache miss will stall the system as an instruction is fetched from memory, resulting in a lower IPC.
For XiP-based applications where the program will be executed directly out of non-volatile memory, system performance is affected by the ability of the memory subsystem to fill the cache whenever there is a cache miss. Given the tendency of code to execute within a locality of reference systems with level 1 and level 2 caches can achieve hit rates over 99%. The memory subsystem needs to be able to fill the entire cache line as quickly as possible to maintain system performance when a cache miss does occur. There are many factors that determine how quickly this can be accomplished:
Read Bandwidth: A high bandwidth bus is needed to minimize the overall read latency even though only a single cache line of memory is being read (typically 32 bytes). In addition, the nature of application code requires the ability to make small, fast memory accesses throughout the entire code region with minimum latency.
Read bandwidth performance varies across bus interfaces and operating frequencies and must be balanced against pin count. Figure 3 compares the performance of the different NOR bus interfaces. Consider the performance of SPI-DDR NOR with an initial access time of 120ns. SPI-DDR significantly outperforms both Page Mode and especially Async NOR. Burst Mode NOR has the highest bandwidth but this advantage over SPI-DDR is minimized in a cache based system.

Controller Latency: Initiating a read command incurs controller latency when dealing with address and protocol overhead. A common way to measure controller latency is from the time the command is sent to the controller to when the controller returns the first byte of data. Controller latency is higher for SPI-DDR NOR, especially at low operating frequencies given that command/address and data is transferred serially. Figure 4 shows that SPI-DDR has a somewhat longer controller latency than the parallel NOR offerings. The lower performance is primarily due to the serialization of the command and address information that is required at the beginning of an SPI transaction. Note that the gap in performance closes significantly as the memory bus frequency is increased. In many mobile and embedded systems a sub 200ns controller latency would provide adequate performance and allow SPI-DDR to be considered as a viable alternative to Parallel NOR.

FIGURE 4: Controller Latency
Instant CPU Stall Time: When the next instruction to execute is not available in the cache, it must be loaded from memory. Figure 5 shows the impact of a cache miss when using a 100 MHz memory bus. The delay when using Burst NOR, Page NOR, and SPI-DDR NOR ranges from 160 to 210 ns. The instant delay is the worst for Async NOR. As can be seen from the graph, the instant delay comes in over 330 ns, which could be tolerable depending upon the frequency of the cache missing. However, as can be seen from Figure 5, all subsequent Async NOR instruction fetches experience the 330 ns delay as well. For a cache line containing eight instructions, the actual instant Async NOR delay incurred is 2.6 us which may adversely impact the user experience. From this perspective SPI-DDR compares favorably to both Async and Page Mode products from a performance and pin count perspective. When SPI-DDR is compared to Burst Mode devices a system developer will need to consider whether the additional pins (30+) required for the higher performance Burst Mode interface is an application requirement.

Average CPU Stall Time: The impact on system responsiveness from instant delay depends upon how often the cache misses; if the miss rate is very low, the system can tolerate a relatively higher instant delay. Table 1 shows the average CPU stall time measured in CPU clock cycles as calculated for a 2% the cache miss rate (i.e., 4 cache misses over 200 instructions). The impact of stall time on system performance depends upon the CPU clock frequency. As can be seen from the graph, Burst NOR provides minimal stalling of the CPU in the range of 1 or 2 clock cycles. For CPU operating frequencies from 100 MHz to 166 MHz, SPI-DDR also provides an acceptable stall response when compared with both Burst and Page NOR.

Figure 6 shows the overall effect these factors have on a system’s IPC using a system with a CPU operating at 166 MHz and a 100 MHz memory bus. To put these figures in perspective, a typical mobile or embedded system has a cache miss rate of less than 1%. In general, SPI-DDR performance compares favorably to both Async and Page Mode NOR products. For systems with a cache miss rate of 0.5%, both Burst NOR and SPI-DDR NOR have a minimal impact on IPC of 1 to 2%. For systems with a higher cache miss rate of 1%, Burst NOR provides an advantage by impacting the IPC by 6% compared to 12% for SPI-DDR NOR. In systems that require the highest performance Burst NOR will continue to be the preferred solution but if slightly lower performance can be tolerated SPI-DDR provides a competitive, low pin count alternative.

Designing an efficient memory subsystem for mobile and embedded systems requires developers to consider many system factors beyond memory bus read bandwidth (see Table 2). For applications which copy program code into RAM for execution, sustained read performance determines system responsiveness, and systems currently based on Parallel NOR might consider SPI-DDR to achieve pin count reductions while improving both code shadowing during boot and demand paging during normal operation.
For XiP-based applications, where memory performance and cache miss rate influence the IPC, factors such as read bandwidth, controller latency, instant and average stall time for cache misses determine the overall efficiency of the implementation. For example, 166 MHz systems can often migrate from Async/Page NOR with the associated high pin counts to SPI-DDR NOR without significantly impacting bandwidth, latency, or overall system performance. When considering the replacement of Burst NOR a system developer must consider whether the additional pins required for the burst interface are an acceptable price to pay for the improved performance.
It is also important to note the flexibility of SPI as a technology that can adapt to changing application needs and that the slightly longer initial access time of SPI-DDR NOR is not generally a limiting factor. Broad chipset support and lower operating voltages will lead to support for higher clock rates and greater bandwidth for SPI-DDR-based NOR products, ensuring that developers will be able to achieve small end-product form factors, lower power consumption, and reduced system cost.

About the Authors
Qamrul Hasan is a Principal Member of Technical Staff, System Solution Engineering Division, Spansion Inc. Qamrul Hasan is working as a system architect with special focus on performance modeling of hardware components and next-generation memory systems for embedded and mobile applications. Qamrul has been involved in collaborative work with JEDEC standardization working group and provided performance simulation results driving leading to protocol specification of LPDDR2-NVM, Unified Flash Storage (UFS). He holds an MSEE from Oklahoma State University, Stillwater, Oklahoma.
Cliff Zitlaw has 28 years of experience in the non-volatile memory industry. He has authored several articles and is the inventor or co-inventor of more than 20 patents related to memory architectures. He has previously served as the JEDEC Chair of JC42.2 covering low power PSRAM devices and is currently Spansion’s representative on JEDEC’s Board of Directors. Cliff has been with Spansion for four years and is currently a Spansion Fellow; prior to joining Spansion he held technical positions at Xicor, Tunitas Microsystems and Micron.
--------------------------------------
If you liked this article...
Systems using an Execute-in-Place (XiP) approach must consider that because the non-volatile memory subsystem is constantly being accessed to retrieve program code, it can potentially introduce memory bottlenecks into the primary execution path. Analyzing the efficiency of an XiP-based memory subsystem is not a simple calculation like it is for systems that execute code from RAM. Depending how a system is architected, there are many factors that contribute to memory performance.
System performance is often measured in terms of the number of instructions per cycle (IPC) that can be achieved by the system. Consider a CPU that takes 4 cycles to execute an instruction. For this CPU, an IPC of 0.25 would be ideal. There are many factors that influence the IPC, for example, a cache miss will stall the system as an instruction is fetched from memory, resulting in a lower IPC.
For XiP-based applications where the program will be executed directly out of non-volatile memory, system performance is affected by the ability of the memory subsystem to fill the cache whenever there is a cache miss. Given the tendency of code to execute within a locality of reference systems with level 1 and level 2 caches can achieve hit rates over 99%. The memory subsystem needs to be able to fill the entire cache line as quickly as possible to maintain system performance when a cache miss does occur. There are many factors that determine how quickly this can be accomplished:
Read Bandwidth: A high bandwidth bus is needed to minimize the overall read latency even though only a single cache line of memory is being read (typically 32 bytes). In addition, the nature of application code requires the ability to make small, fast memory accesses throughout the entire code region with minimum latency.
Read bandwidth performance varies across bus interfaces and operating frequencies and must be balanced against pin count. Figure 3 compares the performance of the different NOR bus interfaces. Consider the performance of SPI-DDR NOR with an initial access time of 120ns. SPI-DDR significantly outperforms both Page Mode and especially Async NOR. Burst Mode NOR has the highest bandwidth but this advantage over SPI-DDR is minimized in a cache based system.

Controller Latency: Initiating a read command incurs controller latency when dealing with address and protocol overhead. A common way to measure controller latency is from the time the command is sent to the controller to when the controller returns the first byte of data. Controller latency is higher for SPI-DDR NOR, especially at low operating frequencies given that command/address and data is transferred serially. Figure 4 shows that SPI-DDR has a somewhat longer controller latency than the parallel NOR offerings. The lower performance is primarily due to the serialization of the command and address information that is required at the beginning of an SPI transaction. Note that the gap in performance closes significantly as the memory bus frequency is increased. In many mobile and embedded systems a sub 200ns controller latency would provide adequate performance and allow SPI-DDR to be considered as a viable alternative to Parallel NOR.

FIGURE 4: Controller Latency
Instant CPU Stall Time: When the next instruction to execute is not available in the cache, it must be loaded from memory. Figure 5 shows the impact of a cache miss when using a 100 MHz memory bus. The delay when using Burst NOR, Page NOR, and SPI-DDR NOR ranges from 160 to 210 ns. The instant delay is the worst for Async NOR. As can be seen from the graph, the instant delay comes in over 330 ns, which could be tolerable depending upon the frequency of the cache missing. However, as can be seen from Figure 5, all subsequent Async NOR instruction fetches experience the 330 ns delay as well. For a cache line containing eight instructions, the actual instant Async NOR delay incurred is 2.6 us which may adversely impact the user experience. From this perspective SPI-DDR compares favorably to both Async and Page Mode products from a performance and pin count perspective. When SPI-DDR is compared to Burst Mode devices a system developer will need to consider whether the additional pins (30+) required for the higher performance Burst Mode interface is an application requirement.

Average CPU Stall Time: The impact on system responsiveness from instant delay depends upon how often the cache misses; if the miss rate is very low, the system can tolerate a relatively higher instant delay. Table 1 shows the average CPU stall time measured in CPU clock cycles as calculated for a 2% the cache miss rate (i.e., 4 cache misses over 200 instructions). The impact of stall time on system performance depends upon the CPU clock frequency. As can be seen from the graph, Burst NOR provides minimal stalling of the CPU in the range of 1 or 2 clock cycles. For CPU operating frequencies from 100 MHz to 166 MHz, SPI-DDR also provides an acceptable stall response when compared with both Burst and Page NOR.

Figure 6 shows the overall effect these factors have on a system’s IPC using a system with a CPU operating at 166 MHz and a 100 MHz memory bus. To put these figures in perspective, a typical mobile or embedded system has a cache miss rate of less than 1%. In general, SPI-DDR performance compares favorably to both Async and Page Mode NOR products. For systems with a cache miss rate of 0.5%, both Burst NOR and SPI-DDR NOR have a minimal impact on IPC of 1 to 2%. For systems with a higher cache miss rate of 1%, Burst NOR provides an advantage by impacting the IPC by 6% compared to 12% for SPI-DDR NOR. In systems that require the highest performance Burst NOR will continue to be the preferred solution but if slightly lower performance can be tolerated SPI-DDR provides a competitive, low pin count alternative.

Designing an efficient memory subsystem for mobile and embedded systems requires developers to consider many system factors beyond memory bus read bandwidth (see Table 2). For applications which copy program code into RAM for execution, sustained read performance determines system responsiveness, and systems currently based on Parallel NOR might consider SPI-DDR to achieve pin count reductions while improving both code shadowing during boot and demand paging during normal operation.
For XiP-based applications, where memory performance and cache miss rate influence the IPC, factors such as read bandwidth, controller latency, instant and average stall time for cache misses determine the overall efficiency of the implementation. For example, 166 MHz systems can often migrate from Async/Page NOR with the associated high pin counts to SPI-DDR NOR without significantly impacting bandwidth, latency, or overall system performance. When considering the replacement of Burst NOR a system developer must consider whether the additional pins required for the burst interface are an acceptable price to pay for the improved performance.
It is also important to note the flexibility of SPI as a technology that can adapt to changing application needs and that the slightly longer initial access time of SPI-DDR NOR is not generally a limiting factor. Broad chipset support and lower operating voltages will lead to support for higher clock rates and greater bandwidth for SPI-DDR-based NOR products, ensuring that developers will be able to achieve small end-product form factors, lower power consumption, and reduced system cost.

About the Authors
Qamrul Hasan is a Principal Member of Technical Staff, System Solution Engineering Division, Spansion Inc. Qamrul Hasan is working as a system architect with special focus on performance modeling of hardware components and next-generation memory systems for embedded and mobile applications. Qamrul has been involved in collaborative work with JEDEC standardization working group and provided performance simulation results driving leading to protocol specification of LPDDR2-NVM, Unified Flash Storage (UFS). He holds an MSEE from Oklahoma State University, Stillwater, Oklahoma.
Cliff Zitlaw has 28 years of experience in the non-volatile memory industry. He has authored several articles and is the inventor or co-inventor of more than 20 patents related to memory architectures. He has previously served as the JEDEC Chair of JC42.2 covering low power PSRAM devices and is currently Spansion’s representative on JEDEC’s Board of Directors. Cliff has been with Spansion for four years and is currently a Spansion Fellow; prior to joining Spansion he held technical positions at Xicor, Tunitas Microsystems and Micron.
--------------------------------------
If you liked this article...
- Head to the Memory Designline homepage for the latest updates in memory and storage.
- Sign up for the Memory Designline Newsletter, 2X a month delivered to your mailbox with the latest highlights from the site.
Navigate to related information

