Abstract
Modern storage technology (solid-state disks (SSDs), NoSQL databases, commoditized RAID hardware, etc.) brings new reliability challenges to the already-complicated storage stack. Among other things, the behavior of these new components during power faults—which happen relatively frequently in data centers—is an important yet mostly ignored issue in this dependability-critical area. Understanding how new storage components behave under power fault is the first step towards designing new robust storage systems.
In this article, we propose a new methodology to expose reliability issues in block devices under power faults. Our framework includes specially designed hardware to inject power faults directly to devices, workloads to stress storage components, and techniques to detect various types of failures. Applying our testing framework, we test 17 commodity SSDs from six different vendors using more than three thousand fault injection cycles in total. Our experimental results reveal that 14 of the 17 tested SSD devices exhibit surprising failure behaviors under power faults, including bit corruption, shorn writes, unserializable writes, metadata corruption, and total device failure.
- Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark Manasse, and Rina Panigrahy. 2008. Design tradeoffs for SSD performance. In Proceedings of the USENIX 2008 Annual Technical Conference (ATC’08). USENIX Association, Berkeley, CA, 57--70. Google Scholar
Digital Library
- Lakshmi N. Bairavasundaram, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Garth R. Goodson, and Bianca Schroeder. 2008. An analysis of data corruption in the storage stack. Trans. Stor. 4, 3, Article 8 (Nov. 2008), 28 pages. DOI:http://dx.doi.org/10.1145/1416944.1416947 Google Scholar
Digital Library
- Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07). ACM, New York, NY, 289--300. DOI:http://dx.doi.org/10.1145/1254882.1254917 Google Scholar
Digital Library
- H. P. Belgal, N. Righos, I. Kalastirsky, J. J. Peterson, R. Shiner, and N. Mielke. 2002. A new reliability model for post-cycling charge retention of flash memories. In Proceedings of the 40th IEEE International Reliability Physics Symposium (IRPS’02). Google Scholar
Cross Ref
- Roberto Bez, Emilio Camerlenghi, Alberto Modelli, and Angelo Visconti. 2003. Introduction to flash memory. In Procedings of the IEEE. 489--502. Google Scholar
Cross Ref
- Andrew Birrell, Michael Isard, Chuck Thacker, and Ted Wobber. 2007. A design for high-performance flash disks. SIGOPS Oper. Syst. Rev. 41, 2 (2007), 88--93. Google Scholar
Digital Library
- Matias Bjørling, Jens Axboe, David Nellans, and Philippe Bonnet. 2013. Linux block IO: Introducing multi-queue SSD access on multi-core systems. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR’13). ACM, New York, NY, Article 22, 10 pages. Google Scholar
Digital Library
- A. Brand, K. Wu, S. Pan, and D. Chin. 1993. Novel read disturb failure mechanism induced by FLASH cycling. In Proceedings of the 31st IEEE International Reliability Physics Symposium (IRPS’93). Google Scholar
Cross Ref
- John Bucy, Jiri Schindler, Steve Schlosser, and Greg Ganger. DiskSim v4.0. Retrieved from www.pdl.cmu.edu/DiskSim/.Google Scholar
- Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai. 2012. Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’12). EDA Consortium, 521--526. Google Scholar
Digital Library
- Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Osman Unsal, Adrian Cristal, and Ken Mai. 2014. Neighbor-cell assisted error correction for MLC NAND flash memories. In Proceedings of the 2014 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’14). ACM, New York, NY, 491--504. DOI:http://dx.doi.org/10.1145/2591971.2591994 Google Scholar
Digital Library
- Feng Chen, David A. Koufaty, and Xiaodong Zhang. 2009. Understanding intrinsic characteristics and system implications of flash memory based solid state drives. In ACM SIGMETRICS. Google Scholar
Digital Library
- Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. 2015. Using crash hoare logic for certifying the FSCQ file system. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 18--37. DOI:http://dx.doi.org/10.1145/2815400.2815402 Google Scholar
Digital Library
- Vijay Chidambaram, Tushar Sharma, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. Consistency without ordering. In Proceedings of the 10th Conference on File and Storage Technologies (FAST’12). Google Scholar
Digital Library
- Thomas Claburn. Amazon Web Services Hit By Power Outage. Retrieved from http://www.informationweek. com/cloud-computing/infrastructure/amazon-web-services-hit-by-power-outage/240002170.Google Scholar
- Lukas Czerner and Karel Zak. 2014. FSTRIM in Linux. Retrieved from http://man7.org/linux/man-pages/man8/fstrim.8.html. (2014).Google Scholar
- John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, and Kirk Olynyk. 2010. Effective data-race detection for the kernel. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10). USENIX Association, Berkeley, CA, 151--162. Google Scholar
Digital Library
- Pedro Fonseca, Rodrigo Rodrigues, and Björn B. Brandenburg. 2014. SKI: Exposing kernel concurrency bugs through systematic schedule exploration. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Berkeley, CA, 415--431. Google Scholar
Digital Library
- Daniel Fryer, Kuei Sun, Rahat Mahmood, TingHao Cheng, Shaun Benjamin, Ashvin Goel, and Angela Demke Brown. 2012. Recon: Verifying file system consistency at runtime. In Proceedings of the 10th Conference on File and Storage Technologies (FAST’12). Google Scholar
Digital Library
- Eran Gal and Sivan Toledo. 2005. Algorithms and data structures for flash memories. ACM Comput. Surv. 37, 2 (2005), 138--163. Google Scholar
Digital Library
- Garth Gibson. 1990. Redundant Disk Arrays: Reliable Parallel Secondary Storage. Ph.D. Dissertation. University of California, Berkeley. Google Scholar
Digital Library
- Wojciech Golab, Xiaozhou Li, and Mehul A. Shah. 2011. Analyzing consistency properties for fun and profit. In Proceedings of the 30th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC’11). ACM, New York, NY, 197--206. DOI:http://dx.doi.org/10.1145/1993806.1993834 Google Scholar
Digital Library
- Kevin M. Greenan, James S. Plank, and Jay J. Wylie. 2010. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In Proceedings of the 2nd USENIX Conference on Hot Topics in Storage and File Systems (HotStorage’10). USENIX Association, Berkeley, CA, 5. Google Scholar
Digital Library
- Laura M. Grupp, Adrian M. Caulfield, Joel Coburn, Steven Swanson, Eitan Yaakobi, Paul H. Siegel, and Jack K. Wolf. 2009. Characterizing flash memory: Anomalies, observations, and applications. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). Google Scholar
Digital Library
- Laura M. Grupp, John D. Davis, and Steven Swanson. 2012. The bleak future of NAND flash memory. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). Google Scholar
Digital Library
- Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Ben Liblit. 2008. EIO: Error handling is occasionally correct. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). Berkeley, CA. Google Scholar
Digital Library
- Christoph Hellwig. 2009. Kernel patch for v2.6.33-rc1. Retreived from http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=ab0a9735e06914ce4d2a94ffa41497dbc142fe7f. (2009).Google Scholar
- Xavier Jimenez, David Novo, and Paolo Ienne. 2014. Wear unleveling: Improving NAND flash lifetime by balancing page endurance. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST 14). USENIX, Berkeley, CA.47--59. Google Scholar
Digital Library
- Andrew Krioukov, Lakshmi N. Bairavasundaram, Garth R. Goodson, Kiran Srinivasan, Randy Thelen, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dussea. 2008a. Parity lost and parity regained. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). USENIX Association, Berkeley, CA, Article 9, 15 pages. Google Scholar
Digital Library
- Andrew Krioukov, Lakshmi N. Bairavasundaram, Garth R. Goodson, Kiran Srinivasan, Randy Thelen, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dussea. 2008b. Parity lost and parity regained. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). USENIX Association, Berkeley, CA, Article 9, 15 pages. Google Scholar
Digital Library
- Andrew Ku. 2011. Second-Generation SandForce: It’s All About Compression. Retrieved from http://www.tomshardware.com/review/vertex-3-sandforce-ssd,2869-3.html.Google Scholar
- H. Kurata, K. Otsuga, A. Kotabe, S. Kajiyama, T. Osabe, Y. Sasago, S. Narumi, K. Tokami, S. Kamohara, and O. Tsuchiya. 2006. The impact of random telegraph signals on the scaling of multilevel flash memories. In Symposium on VLSI Circuits (VLSI’06). Google Scholar
Cross Ref
- Anna Leach. Level 3’s UPS burnout sends websites down in flames. Retrieved from http://www.theregister.co.uk/2012/07/10/data_centre_power_cut/.Google Scholar
- Changman Lee, Dongho Sim, Jooyoung Hwang, and Sangyeun Cho. 2015. F2FS: A new file system for flash storage. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, Berkeley, CA, 273--286. Google Scholar
Digital Library
- Sungjin Lee, Dongkun Shin, Young-Jin Kim, and Jihong Kim. 2008. LAST: Locality-aware sector translation for NAND flash memory-based storage systems. SIGOPS Oper. Syst. Rev. 42, 6 (Oct. 2008), 36--42. DOI:http://dx.doi.org/10.1145/1453775.1453783 Google Scholar
Digital Library
- Jiangpeng Li, Kai Zhao, Xuebin Zhang, Jun Ma, Ming Zhao, and Tong Zhang. 2015. How much can data compressibility help to improve NAND flash memory lifetime? In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, Berkeley, CA, 227--240. Google Scholar
Digital Library
- Ren-Shou Liu, Chia-Lin Yang, and Wei Wu. 2012. Optimizing NAND flash-based SSDs via retention relaxation. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). USENIX, Berkeley, CA. Google Scholar
Digital Library
- Lanyue Lu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Shan Lu. 2013. A study of linux file system evolution. In Presented as Part of the 11th USENIX Conference on File and Storage Technologies (FAST’13). USENIX, Berkeley, CA, 31--44. Google Scholar
Digital Library
- Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. WiscKey: Separating keys from values in SSD-conscious storage. In Proceedings of the14th USENIX Conference on File and Storage Technologies (FAST’16). USENIX Association, Berkeley, CA, 133--148. Google Scholar
Digital Library
- Youyou Lu, Jiwu Shu, and Wei Wang. 2014. ReconFS: A reconstructable file system on flash storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST 14). USENIX, Berkeley, CA, 75--88. Google Scholar
Digital Library
- Youyou Lu, Jiwu Shu, and Weimin Zheng. 2013. Extending the lifetime of flash-based storage through reducing write amplification from file systems. In Presented as part of the 11th USENIX Conference on File and Storage Technologies (FAST'13). USENIX, Berkeley, CA, 257--270. Google Scholar
Digital Library
- Leonardo Marmol, Swaminathan Sundararaman, Nisha Talagala, Raju Rangaswami, Sushma Devendrappa, Bharath Ramsundar, and Sriram Ganesan. 2014. NVMKV: A scalable and lightweight flash aware key-value store. In Proceedings of the 6th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 14). USENIX, Berkeley, CA. Google Scholar
Digital Library
- Robert McMillan. 2012. Amazon Blames Generators for Blackout That Crushed Netflix. Retrieved from http://www.wired.com/wiredenterprise/2012/07/amazon_explains/.Google Scholar
- Cade Metz. 2012. Flash Drives Replace Disks at Amazon, Facebook, Dropbox. Retrieved from http://www.wired.com/wiredenterprise/2012/06/flash-data-centers/all/.Google Scholar
- Rich Miller. Human Error Cited in Hosting.com Outage. Retrieved from http://www.datacenterknowledge.com/archives/2012/07/28/human-error-cited-hosting-com-outage/.Google Scholar
- Changwoo Min, Sanidhya Kashyap, Byoungyoung Lee, Chengyu Song, and Taesoo Kim. 2015. Cross-checking semantic correctness: The case of finding file system bugs. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 361--377. DOI:http://dx.doi.org/10.1145/2815400.2815422 Google Scholar
Digital Library
- T. Ong, A. Frazio, N. Mielke, S. Pan, N. Righos, G. Atwood, and S. Lai. 1993. Erratic erase in ETOX/sup TM/ flash memory array. In Proceedings of the Symposium on VLSI Technology (VLSI’93). Google Scholar
Cross Ref
- Personal Communication. 2012. Personal communication with an employee of a major flash manufacturer.Google Scholar
- Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. All file systems are not created equal: On the complexity of crafting crash-consistent applications. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI’14). Google Scholar
Digital Library
- Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. IRON file systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05). 206--220. Google Scholar
Digital Library
- Minghai Qin, Eitan Yaakobi, and Paul H. Siegel. 2014. Constrained codes that mitigate inter-cell interference in read/write cycles for flash memories. IEEE J. Select. Areas Commun. 32, 5 (2014), 836--846. Google Scholar
Cross Ref
- Abhishek Rajimwale, Vijay Chidambaram, Deepak Ramamurthi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2011. Coerced cache eviction and discreet mode journaling: Dealing with misbehaving disks. In Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems8Networks (DSN’11). IEEE Computer Society, Washington, DC, 518--529. DOI:http://dx.doi.org/10.1109/DSN.2011.5958264 Google Scholar
Digital Library
- Mendel Rosenblum and John K. Ousterhout. 1992. The design and implementation of a log-structured file system. ACM Trans. Comput. Syst. 10, 1 (1992), 26--52. Google Scholar
Digital Library
- Marco A. A. Sanvido, Frank R. Chu, Anand Kulkarni, and Robert Selinger. 2008. NAND flash memory and its role in storage architectures. In Procedings of the IEEE. 1864--1874.Google Scholar
- Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). Google Scholar
Digital Library
- Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST 16). USENIX, Berkeley, CA, 67--80. Google Scholar
Digital Library
- Kang-Deog Suh, Byung-Hoon Suh, Young-Ho Lim, Jin-Ki Kim, Young-Joon Choi, Yong-Nam Koh, Sung-Soo Lee, Suk-Chon Kwon, Byung-Soon Choi, Jin-Sun Yum, Jung-Hyuk Choi, Jang-Rae Kim, and Hyung-Kyu Lim. 1995. A 3.3V 32Mb NAND flash memory with incremental step pulse programming scheme. IEEE Journal of Solid-State Circuits.Google Scholar
- K. Takeuchi, T. Tanaka, and T. Tanzawa. 1998. A multipage cell architecture for high-speed programming multilevel NAND flash memories. IEEE Journal of Solid-State Circuits. Google Scholar
Cross Ref
- Arie Tal. 2002. Two flash technologies compared: NOR vs NAND. White Paper of M-SYstems. M-Systems Flash Disk Pioneers, Ltd. https://focus.ti.com/pdfs/omap/diskonchipvsnor.pdf.Google Scholar
- Veeresh Taranalli, Hironori Uchikawa, and Paul H. Siegel. 2015. Error analysis and inter-cell interference mitigation in multi-level cell flash memories. In Proceedings of the 2015 IEEE International Conference on Communications (ICC’15). 271--276. Google Scholar
Cross Ref
- Nick Triantos. 2006. Lost Writes in Storage Systems. Retrieved from http://storagefoo.blogspot.com/2006/04/lost-writes.html.Google Scholar
- Huang-Wei Tseng, Laura M. Grupp, and Steven Swanson. 2011. Understanding the impact of power loss on flash memory. In Proceedings of the 48th Design Automation Conference (DAC’11). Google Scholar
Digital Library
- Gala Yadgar, Eitan Yaakobi, and Assaf Schuster. 2015. Write once, get 50% free: Saving SSD erase costs using WOM codes. In Proceedings of the13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX, Berkeley, CA, Santa Clara, CA, 257--271. Google Scholar
Digital Library
- Junfeng Yang, Can Sar, and Dawson Engler. 2006. EXPLODE: A lightweight, general system for finding serious storage system errors. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06). 131--146. Google Scholar
Digital Library
- Yiying Zhang, Leo Arulraj, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. De-indirection for flash-based SSDs with nameless writes. In Proceedings of the 10th Conference on File and Storage Technologies (FAST’12). Google Scholar
Digital Library
- Mai Zheng, Joseph Tucek, Dachuan Huang, Feng Qin, Mark Lillibridge, Elizabeth S. Yang, Bill W. Zhao, and Shashank Singh. 2014. Torturing databases for fun and profit. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX, Berkeley, CA, 449--464. Google Scholar
Digital Library
- Mai Zheng, Joseph Tucek, Feng Qin, and Mark Lillibridge. 2013. Understanding the robustness of SSDs under power fault. In Presented as part of the 11th USENIX Conference on File and Storage Technologies (FAST 13). USENIX, Berkeley, CA, 271--284. Google Scholar
Digital Library
Index Terms
Reliability Analysis of SSDs Under Power Fault
Recommendations
An empirical study of redundant array of independent solid-state drives (RAIS)
Solid-state drives (SSD) are popular storage media devices alongside magnetic hard disk drives (HDD). SSD flash chips are packaged in HDD form factors and SSDs are compatible with regular HDD device drivers and I/O buses. This compatibility allows easy ...
HPDA: A hybrid parity-based disk array for enhanced performance and reliability
Flash-based Solid State Drive (SSD) has been productively shipped and deployed in large scale storage systems. However, a single flash-based SSD cannot satisfy the capacity, performance and reliability requirements of the modern storage systems that ...
Improving SSD reliability with RAID via Elastic Striping and Anywhere Parity
DSN '13: Proceedings of the 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)While the move from SLC to MLC/TLC flash memory technology is increasing SSD capacity at lower cost, it is being done at the cost of sacrificing reliability. An approach to remedy this loss is to employ the RAID architecture with the chips that comprise ...






Comments