Design Article
Not All MLC SSDs are created equal
Scott Stetzer, STEC, Inc.
1/21/2012 3:05 PM EST
Understanding Enterprise SSD Endurance
SSD endurance is defined by the measure of the usable life of the flash memory cells typically specified as the number of writes a cell can sustain. ‘Writing’ to a cell requires more electrical charge than ‘reading’ of a cell. When writing to a cell, each cell needs to be erased before it can be written to again. In either case, for every electrical charge that is passed through a NAND flash memory cell as part of a read or write operation, that cell will wear down.
In the enterprise, accelerated access to data is the primary reason SSDs are deployed, and since flash memory cells will be written to multiple times each day, endurance literally determines the reliable life of each drive. Endurance, as well as performance, reliability and availability of MLC-based SSDs are directly dependent on the design of the SSD controller (not the NAND flash memory as many suspect). The SSD controller is the brains and responds to host commands, transfers data between the host and flash media, and manages the flash media to achieve high reliability and endurance. How effectively this controller manages the flash memory will determine whether the SSD can be used in enterprise applications that require 24/7/365 uninterrupted operations under heavy read and write workloads. The real question is can an SSD manufacturer guarantee up to 30 full capacity writes per day for 5 years using MLC media to rival the endurance capability of SLC media.
A Deeper Dive into Flash Media Wear
To store data in NAND flash memory, an electrical charge is placed in the ‘floating gate’ portion of the NAND cell substrate which either blocks or enables electricity flow through the gate. As the NAND cell ages (or cycles), the floating gate will break down as electrons drop out of or get trapped below it. To slow the breakdown of floating gate electrons, which in turn, improves SSD endurance and reliability, enabling technology is available that slows and softens the impact that erase, write and read operations have on NAND flash memory cells. This advanced technology is described later in the article.
To prevent the NAND flash from degrading and adversely affecting SSD reliability, error correction code (ECC) technology is usually employed as a standard feature in most enterprise-class SSDs. The ECC technology enables the built-in SSD controller to detect and correct a limited number of bit errors in each block of data.
At some point, the ECC engine will be unable to correct the bit errors coming from the NAND as it wears out, so when this occurs, the SSD controller performs a read retry (to attempt to read the data again in the hope that the data is read correctly). This double layer of protection enables SSDs to have an exceptional unrecoverable bit error rate (UBER) which enables high reliability. As the NAND flash ages, the average number of read retries required will increase, and this retry will reduce the read performance, as well as the performance of the SSD over time. What is needed is an enabling technology that slows the ‘wear-out’ rate of the flash so ECC and retries do not need to be applied or are not significantly delayed when needed.
In reality, the larger issue in using NAND flash is the higher electrical charge used for the erase operation, and then the write operation, that primarily impacts endurance. To materially increase an SSD’s operating life, more advanced techniques are required.
Techniques such as over-provisioning, throttling, compression, and de-duplication are mechanisms for delaying writes to NAND flash memory and can be effective when deployed, but actual use of these techniques does not increase the number of times to which the flash can be written. As such, these techniques are limited in the gains they can provide. Wear-leveling, for example, doesn’t actually increase endurance, but instead, the flash controller spreads the writing of each data block evenly across all blocks in the SSD device to maintain consistent and even use of the NAND blocks over the life of the drive so that one location doesn’t wear out faster than any other location inside of the drive.
SSD endurance is defined by the measure of the usable life of the flash memory cells typically specified as the number of writes a cell can sustain. ‘Writing’ to a cell requires more electrical charge than ‘reading’ of a cell. When writing to a cell, each cell needs to be erased before it can be written to again. In either case, for every electrical charge that is passed through a NAND flash memory cell as part of a read or write operation, that cell will wear down.
In the enterprise, accelerated access to data is the primary reason SSDs are deployed, and since flash memory cells will be written to multiple times each day, endurance literally determines the reliable life of each drive. Endurance, as well as performance, reliability and availability of MLC-based SSDs are directly dependent on the design of the SSD controller (not the NAND flash memory as many suspect). The SSD controller is the brains and responds to host commands, transfers data between the host and flash media, and manages the flash media to achieve high reliability and endurance. How effectively this controller manages the flash memory will determine whether the SSD can be used in enterprise applications that require 24/7/365 uninterrupted operations under heavy read and write workloads. The real question is can an SSD manufacturer guarantee up to 30 full capacity writes per day for 5 years using MLC media to rival the endurance capability of SLC media.
A Deeper Dive into Flash Media Wear
To store data in NAND flash memory, an electrical charge is placed in the ‘floating gate’ portion of the NAND cell substrate which either blocks or enables electricity flow through the gate. As the NAND cell ages (or cycles), the floating gate will break down as electrons drop out of or get trapped below it. To slow the breakdown of floating gate electrons, which in turn, improves SSD endurance and reliability, enabling technology is available that slows and softens the impact that erase, write and read operations have on NAND flash memory cells. This advanced technology is described later in the article.
To prevent the NAND flash from degrading and adversely affecting SSD reliability, error correction code (ECC) technology is usually employed as a standard feature in most enterprise-class SSDs. The ECC technology enables the built-in SSD controller to detect and correct a limited number of bit errors in each block of data.
At some point, the ECC engine will be unable to correct the bit errors coming from the NAND as it wears out, so when this occurs, the SSD controller performs a read retry (to attempt to read the data again in the hope that the data is read correctly). This double layer of protection enables SSDs to have an exceptional unrecoverable bit error rate (UBER) which enables high reliability. As the NAND flash ages, the average number of read retries required will increase, and this retry will reduce the read performance, as well as the performance of the SSD over time. What is needed is an enabling technology that slows the ‘wear-out’ rate of the flash so ECC and retries do not need to be applied or are not significantly delayed when needed.
In reality, the larger issue in using NAND flash is the higher electrical charge used for the erase operation, and then the write operation, that primarily impacts endurance. To materially increase an SSD’s operating life, more advanced techniques are required.
Techniques such as over-provisioning, throttling, compression, and de-duplication are mechanisms for delaying writes to NAND flash memory and can be effective when deployed, but actual use of these techniques does not increase the number of times to which the flash can be written. As such, these techniques are limited in the gains they can provide. Wear-leveling, for example, doesn’t actually increase endurance, but instead, the flash controller spreads the writing of each data block evenly across all blocks in the SSD device to maintain consistent and even use of the NAND blocks over the life of the drive so that one location doesn’t wear out faster than any other location inside of the drive.
Navigate to related information


sharps_eng
1/22/2012 3:36 PM EST
This technology illustrates the pressure to create workarounds for the recent Moore's Law crunch that means smaller geometries are not appearing fast enough to meet demand.
Previous EDC/ECC and other flash-'nursing' initiatives failed because bigger chips appeared that allowed the protection to be implemented at a higher level, in software. Hardware was only necessary for custom high-integrity applications.
STEC have a window of opportunity to make MLC work for a wider range of applications before a memory breakthrough pushes the density up again cheaply enough to compete. But is that breakthrough in sight? I personally love FRAM but can it be made dense enough? I think not.
Production will also only be available when a big fab becomes surplus to DRAM or flash requirements. No-one will build a fab for FRAM speculatively, I think.
Perhaps a slowdown will create spare FAB capacity?
Sign in to Reply
DrWattsOn
2/5/2012 1:08 PM EST
Agreed about FRAM: love the idea, but I doubt density will reach levels high enough for use in computers as Storage: maybe BIOS/EUFI.
I like the materials from STEC, but not able to find any products identified as containing their technology, even searching all their links. Looks like vaporware to me.
Sign in to Reply
markhahn
11/15/2012 12:04 PM EST
where did this figure of 30 full-device writes per day come from? I'm sure there's a market for that, but it has to be fairly small. obviously, most storage and computation is more consumer-like, with read-mostly loads, and often much sparser duty cycles than 24x7. it's easy to find very cheap SSDs today that peak at 500 MB/s and 80k iops and still offer 3-5 year warranties. commodity storage is cheap enough to simply use above-device redundancy to solve issues of reliability and permanence.
STEC's pitch seems to be pretty intensive engineering at the device level - laudable, but do people buy these inherently more expensive (and apparently slower) devices and trust them without any above-device redundancy (raid, etc)?
Sign in to Reply