datasheets.com EBN.com EDN.com EETimes.com Embedded.com PlanetAnalog.com TechOnline.com  
Events
UBM Tech
UBM Tech

Design Article

Tell us What You Think

We want to know what you thought about this Design. Let us know by adding a comment.

ADD A COMMENT >

Understanding the effects of power failure on flash-based SSDs

Hung-Wei Tseng, Laura M. Grupp, and Steven Swanson, University of California, San Diego

2/27/2012 12:38 PM EST

As flash-based solid-state drives (SSDs) get more popular in all kinds of computing devices, the integrity of flash memory when power failure occurs becomes increasingly important. Power failure for SSD is potentially much more dangerous than it is for conventional hard drives. Because SSDs use complex flash translation layers (FTLs) to manage the mapping between logical block addresses and physical flash memory locations, if power failure corrupts the metadata about this mapping, the entire SSD can become inoperable. To ensure reliability, system designers must understand what kinds of corruption power failure can cause to design products that can withstand power failures and the resulting data corruption.

At the Non-Volatile Systems Laboratory (NVSL) at the University of California, San Diego, we are working to understand the impact of power failure on flash devices to enable designers to build more reliable SSDs. In this work, we built a test platform that allows applications to accurately cut off the power supply. We selected 11 chips that cover a variety of technologies and capacities in our experiments. To test the impact of power failure during program and erase operations, we cut power to flash at different points during the operation.

In our experiments, we found unexpected behaviors for both program and erase operations in the presence of power failure.

First, increasing the time before power failure does not always reduce error rates. Intuitively, the more time we give the flash chip to perform an operation before power failure, the fewer errors there should be; however, our results indicate that it is not always the case. For program and erase operations, the bit error rate may remain constant or skyrocket briefly as we give the chip more time to perform an operation before power failure. Figure 1 shows an example of the above phenomenon. In this graph, we can find many plateaus where the bit error rates remain constant and spikes where the bit error rates skyrocket briefly.



Figure 1: The bit error rate of program operations with different power cut off intervals (the time we give the chip to perform an operation) for a multi-level cell (MLC) chip.

Second, a power failure during a program operation can corrupt data that a previous, successful program operation wrote to a multi-level cell (MLC) chip (retroactive data corruption). Each MLC cell contains two bits of data, and each bit belongs to a different logical NAND page. If power failure occurs during the program operation of the later programmed page (the second page), data in the previously programmed page sharing the same cells (the first page) can become unreliable. For example, we programmed the first pages with 1’s without power failure, but the later program operations to the second pages corrupted the data in the first pages and had both first and second page bits become 0’s if power failure occurs between 500 μs and 900 μs (see figure 2). In our experiments that program random data into flash chips, the retroactive data corruption effect can result in as high as 25% bit error for the first pages. The retroactive data corruption effect poses a serious threat to SSD reliability since this effect makes the assumption that if a program operation completes, the data will remain intact regardless of any future failures incorrect.


Figure 2: The cell state distribution for an MLC chip when we program the second page from 1 to 0 given the first page is programmed as 1.

Third, interrupted program operations leave data more susceptible to read disturb and increase the probability that the programmed data will decay over time. We found that sometimes the program operation seems to complete (without errors or with very low error rates) after power failure occurs, but lots of errors appear after just 1000 reads. We also found that programming data with power failure may also reduce the long-term stability of the data stored in the flash chip. Figure 3 shows the data retention abilities of blocks that we programmed under different conditions. The results suggest that the error rate increases to four times after we aged the chip for 10 years if we did not fully program a block.


Figure 3: Baking chips to accelerate aging reveals that power failure during program operations reduces the long-term reliability of data stored in flash chips.

Finally, incomplete erase operations make future program operations to the same block unreliable. We found that a block erase operation may appear to be complete after power failure occurs. Programming a block erased with power failure can result in as high as 0.9% bit error rate among the chips we tested, however.

Based on the experimental results we obtained from this project, we can suggest some methods to mitigate the effects of power loss. First, since incomplete program operations may corrupt existing data and the bit error rate does not decrease monotonically as the operation time increases, the SSD should be equipped with backup batteries and capacitors that guarantee the program operation will complete when power failure occurs. Second, the SSDs can store metadata or important data in the second pages since the retroactive data corruption never affects the data on the second page. Third, the SSD can apply special code mechanisms to avoid the transition that cause retroactive data corruption. Finally, when power resumes, the SSD should move the data out of the page programmed with power failure or re-erase the block erased with power failure.

In addition to the power failure study that we have presented in this article, we are currently investigating the behavior of flash on power fluctuations aside from complete power failure. Our preliminary data suggests that both additional precautions the designers should take to protect data, but it also demonstrates the opportunity for management schemes that can save power without sacrificing reliability.

About the authors

Hung-Wei Tseng and Laura Grupp are graduate students and Steven Swanson is assistant professor at the University of California, San Diego.


____________________________

If you liked this article...
  • Head to the Memory Designline homepage for the latest updates in memory and storage.
  • Sign up for the Memory Designline Newsletter, delivered weekly to your mailbox with the latest highlights from the site.




Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)