Design Article
Improved memory throughput using serial NOR flash - part 2
Cliff Zitlaw, Spansion Inc.
6/18/2012 12:51 PM EDT
Editor’s note: In part two of this story, the author investigates hardware and SPI protocol changes that increase serial NOR flash memory throughput. Part one reviewed system-level and memory-device strategies that allow higher read throughput for economical NOR flash memory. This work was first presented at the Embedded Systems
Conference (ESC) 2012. You can register to view other papers from the
proceedings here. For more information about ESC 2013 (San Jose, CA; April 22-25), click here.
Protocol improvements
Significant focus has also been placed on minimizing the protocol overhead on the SPI bus. One advantage that parallel (NOR) buses have is the ability to instantly identify the transaction that needs to be performed. Serial devices incur additional latency because the command to perform and the target address is presented in a serial (or “semi-serial”) fashion. This serial process requires several clock cycles just to fully describe the operation to perform.
In the multi-IO versions of the SPI protocol, the target address and sometimes the command is presented in a multi-bit wide manner to minimize the number of clock cycles required to initiate an operation. Other strategies to minimize command, address, and data transfer overhead are described below.
Burst types
Continuous burst read mode has been available since the first SPI-EEPROM devices were available. A read burst starts at the target address and data continue to be clocked out of the device from sequential addresses (see figure 8).

More recently, SoC products are able to execute code directly out of SPI flash memories. For cached systems a wrapped read burst is available in many SPI-NOR devices to efficiently fill a cache line. In a wrapped burst, the critical data from the target address is output first and subsequent data values are output until the last data value within the burst length is output. The next value that is output wraps back to the first address in the burst length. The burst continues in a wrapped burst until all data within the burst length is output (see figure 9). The length of the burst read is often chosen to be the same as the size of the target cache line. Wrapped burst are commonly available in 8B, 16B, 32B, and 64B lengths.

Protocol minimization during bus transactions
Innovation has also occurred in optimization in the basic SPI bus transaction protocol. Reducing the number of clock cycles used to specify the bus transaction minimizes the inherent latency disadvantage of a serial interface. To maintain backwards compatibility with the x1 interface, the legacy Quad Output Read sequence sends the command and address information to the device using only the IO0 signal, as shown in figure 10. Once the Quad Output Read command and target address has been received by the memory device, a number of latency clocks are required to retrieve the target data from the array. Once this initial latency has been satisfied the target data will be output in a nibble-wide manner on IO0-IO3.

The Quad IO Read command sequence takes advantage of the four IOs to transfer the target address from the host to the memory in a nibble-wide manner. Once the command and address has been specified the memory retrieves the target data from the array. Once the data is available it is output in a nibble-wide manner. This nibble-wide transfer of the target address reduces the number of clocks required to specify the command and target address (from 32 clocks down to 14 clocks; see figure 11). Note that after the target address has been identified there are two clock cycles used to specify eight mode bits.

On many recent product offerings the eight mode bits are used to indicate whether the next bus transaction can be assumed to be the same as the current transaction. This ability to implicitly specify a new transaction eliminates the eight bit command sequence portion of a read transaction (see figure 12). When the system wants to switch back to an explicit command protocol the mode bits are set in a manner indicating an exit from implied command mode. The further reduction of eight clock cycles brings the total number of cycles required to fully describe a read operation down to six clocks (from 24 using the legacy serial interface).

Next: Double data rate (DDR)
Protocol improvements
Significant focus has also been placed on minimizing the protocol overhead on the SPI bus. One advantage that parallel (NOR) buses have is the ability to instantly identify the transaction that needs to be performed. Serial devices incur additional latency because the command to perform and the target address is presented in a serial (or “semi-serial”) fashion. This serial process requires several clock cycles just to fully describe the operation to perform.
In the multi-IO versions of the SPI protocol, the target address and sometimes the command is presented in a multi-bit wide manner to minimize the number of clock cycles required to initiate an operation. Other strategies to minimize command, address, and data transfer overhead are described below.
Burst types
Continuous burst read mode has been available since the first SPI-EEPROM devices were available. A read burst starts at the target address and data continue to be clocked out of the device from sequential addresses (see figure 8).

Click image to enlarge
Figure 8: Continuous (non-wrapped) burst of eight
More recently, SoC products are able to execute code directly out of SPI flash memories. For cached systems a wrapped read burst is available in many SPI-NOR devices to efficiently fill a cache line. In a wrapped burst, the critical data from the target address is output first and subsequent data values are output until the last data value within the burst length is output. The next value that is output wraps back to the first address in the burst length. The burst continues in a wrapped burst until all data within the burst length is output (see figure 9). The length of the burst read is often chosen to be the same as the size of the target cache line. Wrapped burst are commonly available in 8B, 16B, 32B, and 64B lengths.

Click image to enlarge
Figure 9: Wrapped burst of eight
Protocol minimization during bus transactions
Innovation has also occurred in optimization in the basic SPI bus transaction protocol. Reducing the number of clock cycles used to specify the bus transaction minimizes the inherent latency disadvantage of a serial interface. To maintain backwards compatibility with the x1 interface, the legacy Quad Output Read sequence sends the command and address information to the device using only the IO0 signal, as shown in figure 10. Once the Quad Output Read command and target address has been received by the memory device, a number of latency clocks are required to retrieve the target data from the array. Once this initial latency has been satisfied the target data will be output in a nibble-wide manner on IO0-IO3.

Click image to enlarge
Figure 10: Quad Output Read (32 clocks for command/address)
The Quad IO Read command sequence takes advantage of the four IOs to transfer the target address from the host to the memory in a nibble-wide manner. Once the command and address has been specified the memory retrieves the target data from the array. Once the data is available it is output in a nibble-wide manner. This nibble-wide transfer of the target address reduces the number of clocks required to specify the command and target address (from 32 clocks down to 14 clocks; see figure 11). Note that after the target address has been identified there are two clock cycles used to specify eight mode bits.

Click image to enlarge
Figure 11: Quad IO Read (14 clocks for command/address)
On many recent product offerings the eight mode bits are used to indicate whether the next bus transaction can be assumed to be the same as the current transaction. This ability to implicitly specify a new transaction eliminates the eight bit command sequence portion of a read transaction (see figure 12). When the system wants to switch back to an explicit command protocol the mode bits are set in a manner indicating an exit from implied command mode. The further reduction of eight clock cycles brings the total number of cycles required to fully describe a read operation down to six clocks (from 24 using the legacy serial interface).

Click image to enlarge
Figure 12: Quad IO Read with implied read command (6 clocks for command/address)
Next: Double data rate (DDR)
Navigate to related information

