skip to main content
research-article

Reliability Analysis of SSDs Under Power Fault

Published:01 November 2016Publication History
Skip Abstract Section

Abstract

Modern storage technology (solid-state disks (SSDs), NoSQL databases, commoditized RAID hardware, etc.) brings new reliability challenges to the already-complicated storage stack. Among other things, the behavior of these new components during power faults—which happen relatively frequently in data centers—is an important yet mostly ignored issue in this dependability-critical area. Understanding how new storage components behave under power fault is the first step towards designing new robust storage systems.

In this article, we propose a new methodology to expose reliability issues in block devices under power faults. Our framework includes specially designed hardware to inject power faults directly to devices, workloads to stress storage components, and techniques to detect various types of failures. Applying our testing framework, we test 17 commodity SSDs from six different vendors using more than three thousand fault injection cycles in total. Our experimental results reveal that 14 of the 17 tested SSD devices exhibit surprising failure behaviors under power faults, including bit corruption, shorn writes, unserializable writes, metadata corruption, and total device failure.

References

  1. Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark Manasse, and Rina Panigrahy. 2008. Design tradeoffs for SSD performance. In Proceedings of the USENIX 2008 Annual Technical Conference (ATC’08). USENIX Association, Berkeley, CA, 57--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Lakshmi N. Bairavasundaram, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Garth R. Goodson, and Bianca Schroeder. 2008. An analysis of data corruption in the storage stack. Trans. Stor. 4, 3, Article 8 (Nov. 2008), 28 pages. DOI:http://dx.doi.org/10.1145/1416944.1416947 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07). ACM, New York, NY, 289--300. DOI:http://dx.doi.org/10.1145/1254882.1254917 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. P. Belgal, N. Righos, I. Kalastirsky, J. J. Peterson, R. Shiner, and N. Mielke. 2002. A new reliability model for post-cycling charge retention of flash memories. In Proceedings of the 40th IEEE International Reliability Physics Symposium (IRPS’02). Google ScholarGoogle ScholarCross RefCross Ref
  5. Roberto Bez, Emilio Camerlenghi, Alberto Modelli, and Angelo Visconti. 2003. Introduction to flash memory. In Procedings of the IEEE. 489--502. Google ScholarGoogle ScholarCross RefCross Ref
  6. Andrew Birrell, Michael Isard, Chuck Thacker, and Ted Wobber. 2007. A design for high-performance flash disks. SIGOPS Oper. Syst. Rev. 41, 2 (2007), 88--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Matias Bjørling, Jens Axboe, David Nellans, and Philippe Bonnet. 2013. Linux block IO: Introducing multi-queue SSD access on multi-core systems. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR’13). ACM, New York, NY, Article 22, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Brand, K. Wu, S. Pan, and D. Chin. 1993. Novel read disturb failure mechanism induced by FLASH cycling. In Proceedings of the 31st IEEE International Reliability Physics Symposium (IRPS’93). Google ScholarGoogle ScholarCross RefCross Ref
  9. John Bucy, Jiri Schindler, Steve Schlosser, and Greg Ganger. DiskSim v4.0. Retrieved from www.pdl.cmu.edu/DiskSim/.Google ScholarGoogle Scholar
  10. Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai. 2012. Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’12). EDA Consortium, 521--526. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Osman Unsal, Adrian Cristal, and Ken Mai. 2014. Neighbor-cell assisted error correction for MLC NAND flash memories. In Proceedings of the 2014 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’14). ACM, New York, NY, 491--504. DOI:http://dx.doi.org/10.1145/2591971.2591994 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Feng Chen, David A. Koufaty, and Xiaodong Zhang. 2009. Understanding intrinsic characteristics and system implications of flash memory based solid state drives. In ACM SIGMETRICS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. 2015. Using crash hoare logic for certifying the FSCQ file system. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 18--37. DOI:http://dx.doi.org/10.1145/2815400.2815402 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Vijay Chidambaram, Tushar Sharma, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. Consistency without ordering. In Proceedings of the 10th Conference on File and Storage Technologies (FAST’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Thomas Claburn. Amazon Web Services Hit By Power Outage. Retrieved from http://www.informationweek. com/cloud-computing/infrastructure/amazon-web-services-hit-by-power-outage/240002170.Google ScholarGoogle Scholar
  16. Lukas Czerner and Karel Zak. 2014. FSTRIM in Linux. Retrieved from http://man7.org/linux/man-pages/man8/fstrim.8.html. (2014).Google ScholarGoogle Scholar
  17. John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, and Kirk Olynyk. 2010. Effective data-race detection for the kernel. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10). USENIX Association, Berkeley, CA, 151--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Pedro Fonseca, Rodrigo Rodrigues, and Björn B. Brandenburg. 2014. SKI: Exposing kernel concurrency bugs through systematic schedule exploration. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Berkeley, CA, 415--431. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Daniel Fryer, Kuei Sun, Rahat Mahmood, TingHao Cheng, Shaun Benjamin, Ashvin Goel, and Angela Demke Brown. 2012. Recon: Verifying file system consistency at runtime. In Proceedings of the 10th Conference on File and Storage Technologies (FAST’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Eran Gal and Sivan Toledo. 2005. Algorithms and data structures for flash memories. ACM Comput. Surv. 37, 2 (2005), 138--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Garth Gibson. 1990. Redundant Disk Arrays: Reliable Parallel Secondary Storage. Ph.D. Dissertation. University of California, Berkeley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Wojciech Golab, Xiaozhou Li, and Mehul A. Shah. 2011. Analyzing consistency properties for fun and profit. In Proceedings of the 30th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC’11). ACM, New York, NY, 197--206. DOI:http://dx.doi.org/10.1145/1993806.1993834 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kevin M. Greenan, James S. Plank, and Jay J. Wylie. 2010. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In Proceedings of the 2nd USENIX Conference on Hot Topics in Storage and File Systems (HotStorage’10). USENIX Association, Berkeley, CA, 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Laura M. Grupp, Adrian M. Caulfield, Joel Coburn, Steven Swanson, Eitan Yaakobi, Paul H. Siegel, and Jack K. Wolf. 2009. Characterizing flash memory: Anomalies, observations, and applications. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Laura M. Grupp, John D. Davis, and Steven Swanson. 2012. The bleak future of NAND flash memory. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Ben Liblit. 2008. EIO: Error handling is occasionally correct. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). Berkeley, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Christoph Hellwig. 2009. Kernel patch for v2.6.33-rc1. Retreived from http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=ab0a9735e06914ce4d2a94ffa41497dbc142fe7f. (2009).Google ScholarGoogle Scholar
  28. Xavier Jimenez, David Novo, and Paolo Ienne. 2014. Wear unleveling: Improving NAND flash lifetime by balancing page endurance. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST 14). USENIX, Berkeley, CA.47--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Andrew Krioukov, Lakshmi N. Bairavasundaram, Garth R. Goodson, Kiran Srinivasan, Randy Thelen, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dussea. 2008a. Parity lost and parity regained. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). USENIX Association, Berkeley, CA, Article 9, 15 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Andrew Krioukov, Lakshmi N. Bairavasundaram, Garth R. Goodson, Kiran Srinivasan, Randy Thelen, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dussea. 2008b. Parity lost and parity regained. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). USENIX Association, Berkeley, CA, Article 9, 15 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Andrew Ku. 2011. Second-Generation SandForce: It’s All About Compression. Retrieved from http://www.tomshardware.com/review/vertex-3-sandforce-ssd,2869-3.html.Google ScholarGoogle Scholar
  32. H. Kurata, K. Otsuga, A. Kotabe, S. Kajiyama, T. Osabe, Y. Sasago, S. Narumi, K. Tokami, S. Kamohara, and O. Tsuchiya. 2006. The impact of random telegraph signals on the scaling of multilevel flash memories. In Symposium on VLSI Circuits (VLSI’06). Google ScholarGoogle ScholarCross RefCross Ref
  33. Anna Leach. Level 3’s UPS burnout sends websites down in flames. Retrieved from http://www.theregister.co.uk/2012/07/10/data_centre_power_cut/.Google ScholarGoogle Scholar
  34. Changman Lee, Dongho Sim, Jooyoung Hwang, and Sangyeun Cho. 2015. F2FS: A new file system for flash storage. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, Berkeley, CA, 273--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Sungjin Lee, Dongkun Shin, Young-Jin Kim, and Jihong Kim. 2008. LAST: Locality-aware sector translation for NAND flash memory-based storage systems. SIGOPS Oper. Syst. Rev. 42, 6 (Oct. 2008), 36--42. DOI:http://dx.doi.org/10.1145/1453775.1453783 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jiangpeng Li, Kai Zhao, Xuebin Zhang, Jun Ma, Ming Zhao, and Tong Zhang. 2015. How much can data compressibility help to improve NAND flash memory lifetime? In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, Berkeley, CA, 227--240. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ren-Shou Liu, Chia-Lin Yang, and Wei Wu. 2012. Optimizing NAND flash-based SSDs via retention relaxation. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). USENIX, Berkeley, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lanyue Lu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Shan Lu. 2013. A study of linux file system evolution. In Presented as Part of the 11th USENIX Conference on File and Storage Technologies (FAST’13). USENIX, Berkeley, CA, 31--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. WiscKey: Separating keys from values in SSD-conscious storage. In Proceedings of the14th USENIX Conference on File and Storage Technologies (FAST’16). USENIX Association, Berkeley, CA, 133--148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Youyou Lu, Jiwu Shu, and Wei Wang. 2014. ReconFS: A reconstructable file system on flash storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST 14). USENIX, Berkeley, CA, 75--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Youyou Lu, Jiwu Shu, and Weimin Zheng. 2013. Extending the lifetime of flash-based storage through reducing write amplification from file systems. In Presented as part of the 11th USENIX Conference on File and Storage Technologies (FAST'13). USENIX, Berkeley, CA, 257--270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Leonardo Marmol, Swaminathan Sundararaman, Nisha Talagala, Raju Rangaswami, Sushma Devendrappa, Bharath Ramsundar, and Sriram Ganesan. 2014. NVMKV: A scalable and lightweight flash aware key-value store. In Proceedings of the 6th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 14). USENIX, Berkeley, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Robert McMillan. 2012. Amazon Blames Generators for Blackout That Crushed Netflix. Retrieved from http://www.wired.com/wiredenterprise/2012/07/amazon_explains/.Google ScholarGoogle Scholar
  44. Cade Metz. 2012. Flash Drives Replace Disks at Amazon, Facebook, Dropbox. Retrieved from http://www.wired.com/wiredenterprise/2012/06/flash-data-centers/all/.Google ScholarGoogle Scholar
  45. Rich Miller. Human Error Cited in Hosting.com Outage. Retrieved from http://www.datacenterknowledge.com/archives/2012/07/28/human-error-cited-hosting-com-outage/.Google ScholarGoogle Scholar
  46. Changwoo Min, Sanidhya Kashyap, Byoungyoung Lee, Chengyu Song, and Taesoo Kim. 2015. Cross-checking semantic correctness: The case of finding file system bugs. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 361--377. DOI:http://dx.doi.org/10.1145/2815400.2815422 Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. T. Ong, A. Frazio, N. Mielke, S. Pan, N. Righos, G. Atwood, and S. Lai. 1993. Erratic erase in ETOX/sup TM/ flash memory array. In Proceedings of the Symposium on VLSI Technology (VLSI’93). Google ScholarGoogle ScholarCross RefCross Ref
  48. Personal Communication. 2012. Personal communication with an employee of a major flash manufacturer.Google ScholarGoogle Scholar
  49. Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. All file systems are not created equal: On the complexity of crafting crash-consistent applications. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. IRON file systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05). 206--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Minghai Qin, Eitan Yaakobi, and Paul H. Siegel. 2014. Constrained codes that mitigate inter-cell interference in read/write cycles for flash memories. IEEE J. Select. Areas Commun. 32, 5 (2014), 836--846. Google ScholarGoogle ScholarCross RefCross Ref
  52. Abhishek Rajimwale, Vijay Chidambaram, Deepak Ramamurthi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2011. Coerced cache eviction and discreet mode journaling: Dealing with misbehaving disks. In Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems8Networks (DSN’11). IEEE Computer Society, Washington, DC, 518--529. DOI:http://dx.doi.org/10.1109/DSN.2011.5958264 Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Mendel Rosenblum and John K. Ousterhout. 1992. The design and implementation of a log-structured file system. ACM Trans. Comput. Syst. 10, 1 (1992), 26--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Marco A. A. Sanvido, Frank R. Chu, Anand Kulkarni, and Robert Selinger. 2008. NAND flash memory and its role in storage architectures. In Procedings of the IEEE. 1864--1874.Google ScholarGoogle Scholar
  55. Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST 16). USENIX, Berkeley, CA, 67--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Kang-Deog Suh, Byung-Hoon Suh, Young-Ho Lim, Jin-Ki Kim, Young-Joon Choi, Yong-Nam Koh, Sung-Soo Lee, Suk-Chon Kwon, Byung-Soon Choi, Jin-Sun Yum, Jung-Hyuk Choi, Jang-Rae Kim, and Hyung-Kyu Lim. 1995. A 3.3V 32Mb NAND flash memory with incremental step pulse programming scheme. IEEE Journal of Solid-State Circuits.Google ScholarGoogle Scholar
  58. K. Takeuchi, T. Tanaka, and T. Tanzawa. 1998. A multipage cell architecture for high-speed programming multilevel NAND flash memories. IEEE Journal of Solid-State Circuits. Google ScholarGoogle ScholarCross RefCross Ref
  59. Arie Tal. 2002. Two flash technologies compared: NOR vs NAND. White Paper of M-SYstems. M-Systems Flash Disk Pioneers, Ltd. https://focus.ti.com/pdfs/omap/diskonchipvsnor.pdf.Google ScholarGoogle Scholar
  60. Veeresh Taranalli, Hironori Uchikawa, and Paul H. Siegel. 2015. Error analysis and inter-cell interference mitigation in multi-level cell flash memories. In Proceedings of the 2015 IEEE International Conference on Communications (ICC’15). 271--276. Google ScholarGoogle ScholarCross RefCross Ref
  61. Nick Triantos. 2006. Lost Writes in Storage Systems. Retrieved from http://storagefoo.blogspot.com/2006/04/lost-writes.html.Google ScholarGoogle Scholar
  62. Huang-Wei Tseng, Laura M. Grupp, and Steven Swanson. 2011. Understanding the impact of power loss on flash memory. In Proceedings of the 48th Design Automation Conference (DAC’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Gala Yadgar, Eitan Yaakobi, and Assaf Schuster. 2015. Write once, get 50% free: Saving SSD erase costs using WOM codes. In Proceedings of the13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX, Berkeley, CA, Santa Clara, CA, 257--271. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Junfeng Yang, Can Sar, and Dawson Engler. 2006. EXPLODE: A lightweight, general system for finding serious storage system errors. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06). 131--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Yiying Zhang, Leo Arulraj, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. De-indirection for flash-based SSDs with nameless writes. In Proceedings of the 10th Conference on File and Storage Technologies (FAST’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Mai Zheng, Joseph Tucek, Dachuan Huang, Feng Qin, Mark Lillibridge, Elizabeth S. Yang, Bill W. Zhao, and Shashank Singh. 2014. Torturing databases for fun and profit. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX, Berkeley, CA, 449--464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Mai Zheng, Joseph Tucek, Feng Qin, and Mark Lillibridge. 2013. Understanding the robustness of SSDs under power fault. In Presented as part of the 11th USENIX Conference on File and Storage Technologies (FAST 13). USENIX, Berkeley, CA, 271--284. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Reliability Analysis of SSDs Under Power Fault

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Computer Systems
        ACM Transactions on Computer Systems  Volume 34, Issue 4
        January 2017
        93 pages
        ISSN:0734-2071
        EISSN:1557-7333
        DOI:10.1145/3014162
        Issue’s Table of Contents

        Copyright © 2016 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 November 2016
        • Accepted: 1 August 2016
        • Revised: 1 May 2016
        • Received: 1 January 2015
        Published in tocs Volume 34, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!