Design Article

IMG1

Looking for new SRAM options in embedded ASIC and SOC designs

Cyrus Afghahi and Farzad Zarrinfar, Novelics

5/9/2007 12:15 AM EDT

Static RAM memory blocks based on traditional six-transistor (6T) storage cells have been the workhorse of developers of the ASIC/SoC implementations used in many embedded designs, since such memory structures typically fit right into the mainstream CMOS process flow and don't require any additional process steps.

As shown in Figure 1a below, the basic cross-coupled latch and active load elements form the 6T memory cell and that cell can be used in memory arrays ranging in capacity from a few bits to multiple megabits.

The memory arrays can be designed to meet many different performance requirements depending on whether the designer opts to use a CMOS process optimized for high performance or low power. High-performance processes can yield SRAM blocks that have access times well below 5 ns in a 130 nm process, while low-power processes typically yield memory blocks that offer access times of 10 ns or slower.

Figure 1a : Typical six-transistor static RAM memory cell

The static nature of the memory cell keeps the amount of support circuitry to a minimum, requiring just address decoding and enable signals to design the decoder, sensing, and timing circuitry.

As feature sizes shrink with each more advanced process node, static RAMs built using the traditional six-transistor memory cells can deliver shorter access times and smaller cell size,

But, also, as feature sizes shrink, leakage currents and sensitivity to soft errors increase and designers may have to add additional circuitry to reduce leakage and provide error-checking and correction capabilities to "scrub" the memory for soft errors.

Limitations of current 6T SoC RAM cells
However, the large size of the 6T cell due to the six transistors used to form the latch and high-impedance loads may limit the number of bits that can be economically implemented in the memory array.

That limitation is mostly due to the area consumed by the memory block and cell leakage based on the technology process node (130, 90, 65 nm) used to implement the chip design. As the total area of the memory arrays grows as a percentage of the overall chip area, the size, and thus the cost of the chip will increase as well.

The leakage current may also exceed the total power budget or limit the application of 6T cells for portable devices. The larger or high leakage chip may not end up meeting the targeted price point for the application and thus may not be an economical solution.

Figure 1b: Typical single-transistor/single-capacitor dynamic memory storage cell.

1T alternatives to 6T RAM cells
There is an alternative for applications that require large amounts of on-chip storage " typically more than 256 kbits " but dosen't require the absolute fastest access time. The solution consists of memory arrays that work like SRAMs but are based on a one-transistor/one-capacitor (1-T) memory cell such as used in dynamic RAMs (Figure 1b above).

Such memory arrays can deliver two to three times the density in the same chip area as a 6T-based memory array. Simple dynamic RAM arrays can be used when embedded memory requirements exceed several megabits, but such arrays require that the system controller and logic be aware of the dynamic nature of the memory and take an active role in providing the refresh control and timing signals.

The alternative to embedding a simple DRAM memory block is to wrap the DRAM array with its own controller to make it appear like the simple-to-use SRAM array. By combining the high-density 1-T storage cells with some support logic that provides the refresh signals, the dynamic nature of the memory cells is hidden to the ASIC/SoC designer, and designers can treat the memory block as if it were a static RAM when implementing their ASIC and SoC solutions (Figure 2 below).

Some companies and foundries have developed 1-T cells that require additional mask layers in addition to the standard CMOS layers. Such an approach increases the wafer cost and is foundry-specific, thus limiting the fabrication to a specific foundry. To justify the extra wafer processing cost, the total DRAM array size used in a chip must typically be more than 50% of the die area. Also, most of the offered DRAM macros are hard macros with limited size, aspect ratio and interfaces.

Figure 2: The addition of control and interface support logic around a DRAM memory array makes the array appear to operate like a static RAM, thus delivering improved memory density

What's required for SoC design is a more cost-effective IP macro that can easily be processed in any fab or transferred from one fab to another for cost or capacity reasons. That macro should also offer more flexibility to the ASIC designer when it comes to layout and configuration.

<>Such an approach, called 'one transistor SRAM', is available for several foundries as licensable intellectual property. One such compiler-driven method is available in bulk CMOS with no additional mask steps for 15%-20% lower wafer cost and faster time to market.

The resulting memory block interface looks just like a static RAM to the rest of the system, but achieves about two to three times the density (bits/unit area) vs memory arrays based on the 6-T cells (after averaging in the support circuitry overhead as part of the area calculation). The larger the memory array, the less the overall area required by the support circuitry and more area-efficient the memory block will be.

To create the desired memory array, memory compiler tools, such as MemQuest , are available which allow designers to configure the cooler, faster, or denser, coolSRAM-1T configurations that are portable across foundries and technology nodes. (Figure 3, below), thus avoiding non-recurring engineering fees for manual array implementation.

The compiler also enables customers to use the most optimum core size, interface and aspect ratio with the shortest time to market, and provides designers with electrical, physical, simulation (Verilog and VHDL), test, and synthesis views of the memory array it compiles.

Figure 3: The  portable coolSRAM-1T was designed for extremely low-power operation through the use of adaptive circuit sizing, Virtual grounding, adaptive back biasing, and other  circuit techniques to lower leakage current. Furthermore, in the coolSRAM-1T cell structure, attention has been paid to  minimizing junction and sub-threshold leakage current.

In a 1-Mbit memory array instantiation, a coolSRAM-1T configuration, for example, has a leakage current is a few microamps at room temperature and typical corner specs for supply voltage and clock rate (Figure 3, above).

At a typical refresh rate of 100 kHz or less with a 128 kword by 8-bit organization, the 1 Mbit coolSRAM-1T array has an idle power with data retention comparable to that of a similar-capacity SRAM. (A 1-Mbit instance of the coolSRAM-6T occupies an area of about 2.6 square millimeters and consumes less than 100 microwatts per Megahertz when the memory block is fabricated in a 130 nm G process from TSMC.)

Although the SRAM-1T functions like an SRAM, it does have DRAM characteristics on the inside—at room temperature when implemented in a 130 nm process, the memory cell can retain data for tens of milliseconds. The supporting refresh control logic transparently provides the refresh and will adjust the refresh period based on the temperature.

Designers can also opt to bypass the refresh controller in the memory array and provide their own refresh signals from the SoC logic if they want the SoC to manage the refresh. This can potentially save some dynamic power on the SoC since the system logic can operate on an "as-needed" basis rather than on an "automatic" basis for the SRAM-1T's embedded refresh logic.

The memory cells in the SRAM-1T instance also support sleep and standby modes. During sleep mode, the clock to a large percentage of the memory array is suppressed to drastically reduce power consumption.

When the array is "awakened" data must be reloaded into the memory cells. During the standby mode, the memory retains data by using a low-frequency refresh operation that dissipates minimal power. When brought back to active mode, the memory is ready for use; data does not have to be reloaded into the memory array.

Designers can also configure the memory array to refresh in various row sizes - 256, 512, 1024, or 2048 bits, or even refresh multiple rows in parallel. This allows the designer to provide selective refresh to only a portion of the array to keep critical data "alive" while powering down the rest of the array.

With any memory array there is always the chance that manufacturing variations will result in a bad bit or two in the memory array. Rather than discard the chip, our designers added both column and row redundancy schemes to enhance yield.

A built-in self-repair capability, used in conjunction with one time programmable coolOTP memory, can be employed to repair the memory array if bits fail once the chip has been shipped. Optionally available is a built-in self-test capability that can be added to the memory IP block with no performance degradation.

Figure 4: In a typical SoC design, wide internal memory buses can be used to rapidly transfer time-critical data for graphics and DSP operations.

When the basic performance of the memory array doesn't meet the system needs, there are some architectural techniques designers can use to achieve higher performance from the memory array. However, these techniques will have a price " they will impact chip power, size and complexity, so a careful tradeoff analysis must be done to determine the optimum combination of memory array and chip architecture to achieve the desired performance and cost goals.

One available technique for chip architects would be to use a wide-word architecture that might have the memory organized to deliver a 128, 256, or even 1024-bit-wide data word internally and then multiplexed down to the desired word size (Figure 4 above).

This technique can double or quadruple the apparent clock rate, thus reducing the effective access time and substantially reduce the power consumption. The penalty in this case might be the area impact on the IP design due to the demultiplexing logic needed to reduce the wide word down to the appropriate-sized words for the rest of the SoC to use.

Figure 5a: Multiple memory instances (banks) can be interleaved by adding some additional control and timing circuits to double, triple, quadruple, etc. (depending on the number of banks) the data rate to the host processor.

Another option would be to split the memory into multiple instances (banks) and set up a memory controller to alternately access the instances in consecutive cycles so that some of the access time is hidden by switching between the banks (Figure 5a above).

In a non-interleaved system, the memory subsystem must operate at the system clock speed, and that may slow down the system if the memory accesses can't keep pace with the clock (Figure 5b below).

Figure 5b: In non-interleaved systems, the memory-bank access time limits the system clock speed when accessing the memory array.

However in the interleaved memory approach, the clock frequency can be doubled, tripled, quadrupled, etc, depending on the number of banks. System complexity, though increases considerably when more than two banks are interleaved.

In the case of a dual bank system, the clock frequency can be double the maximum speed that each memory bank can handle, but since each instance is cycled at half of the clock frequency, the individual bank doesn't see the change in clock speed. (Figure 5c below).

Figure 5c: In an interleaved multibank system, the clock can run at a multiple (clock x number of banks) of the non-interleaved clock rate.

Rather, some global logic surrounding the memory banks runs at double the memory speed and steers the address information to each of the two banks on alternate clock cycles. This global logic can be shared among the multiple banks, thus saving area and power.

Additional logic at the data input/output port multiplexes or demultiplexes the data to deliver data at double the data rate to the host system, or delivering data to the banks at half the incoming rate. The effective throughput of the memory subsystem is doubled, yet the active power is lower than that of a single block with twice the storage capacity.

Although this approach could cut the access time by close to 50%, it does come with the cost penalty of additional support circuitry and design/timing complexity. In this approach, the data access from the memory is typically delayed by one cycle (single-cycle latency access) and the access is quasi-random " the system cannot access the same internal bank every cycle.

Cyrus Afghahi, PhD is CEO and co-founder, and Farzad Zarrinfar is president of Novelics. Prior to co-founding Novelics in 2005, Dr. Afghahi was Technical Director for the Office of the CTO at Broadcom Corporation. Previous to Novelics, Zarrinfar was Vice President of Worldwide Sales for Strategic Accounts at ARC International and is a board member for the GSPX DSP conference.


print

email

rss

Bookmark and Share

Joinpost comment




Please sign in to post comment

Navigate to related information

Product Parts Search

Enter part number or keyword
PartsSearch

FeedbackForm