skip to main content
research-article

Reparo: A Fast RAID Recovery Scheme for Ultra-large SSDs

Authors Info & Claims
Published:16 August 2021Publication History
Skip Abstract Section

Abstract

A recent ultra-large SSD (e.g., a 32-TB SSD) provides many benefits in building cost-efficient enterprise storage systems. Owing to its large capacity, however, when such SSDs fail in a RAID storage system, a long rebuild overhead is inevitable for RAID reconstruction that requires a huge amount of data copies among SSDs. Motivated by modern SSD failure characteristics, we propose a new recovery scheme, called reparo, for a RAID storage system with ultra-large SSDs. Unlike existing RAID recovery schemes, reparo repairs a failed SSD at the NAND die granularity without replacing it with a new SSD, thus avoiding most of the inter-SSD data copies during a RAID recovery step. When a NAND die of an SSD fails, reparo exploits a multi-core processor of the SSD controller in identifying failed LBAs from the failed NAND die and recovering data from the failed LBAs. Furthermore, reparo ensures no negative post-recovery impact on the performance and lifetime of the repaired SSD. Experimental results using 32-TB enterprise SSDs show that reparo can recover from a NAND die failure about 57 times faster than the existing rebuild method while little degradation on the SSD performance and lifetime is observed after recovery.

References

  1. Samsung SSD. 2018. Retrieved from https://www.samsung.com/semiconductor/insights/news-events/samsung-starts-producing-industrys-largest-capacity-ssd/.Google ScholarGoogle Scholar
  2. David Patterson, Garth Gibson, and Randy Katz. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM-SIGMOD International Conference on the Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Broadcom. 2018. 12Gb/s MegaRAID Tri-Mode Software. Retrieved from https://docs.broadcom.com/docs/MR-TM-SW-UG105.Google ScholarGoogle Scholar
  4. Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bianca Schroeder and Garth Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Stathis Maneas, Kaveh Mahdaviani, Tim Emami, and Bianca Schroeder. 2020. A study of SSD reliability in large scale enterprise storage deployments. In Proceedings of the USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jimmy Yang and Feng-Bin Sun. 1999. A comprehensive review of hard-disk drive reliability. In Proceedings of the Annual Reliability and Maintainability Symposium.Google ScholarGoogle ScholarCross RefCross Ref
  8. Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai. 2012. Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu. 2017. Error characterization, mitigation, and recovery in flash-memory-based solid-state drives. Proc. IEEE 105, 9 (2017), 1666–1704.Google ScholarGoogle ScholarCross RefCross Ref
  10. Myungsuk Kim, Youngsun Song, Myoungsoo Jung, and Jihong Kim. 2018. SARO: A state-aware reliability optimization technique for high density NAND flash memory. In Proceedings of the Great Lakes Symposium on VLSI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Micron. 2011. TN-29-59: Bad Block Management. Retrieved from https://www.micron.com/-/media/client/global/documents/products/technical-note/nand-flash/tn2959_bbm_in_nand_flash.pdf.Google ScholarGoogle Scholar
  12. Samsung. 2014. Samsung V-NAND Technology, White Paper. Retrieved from https://studylib.net/doc/8282074/samsung-v-nand-technology.Google ScholarGoogle Scholar
  13. Jacob Alter, Ji Xue, Alma Dimnaku, and Evgenia Smirni. 2019. SSD failures in the field: Symptoms, causes, and prediction models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Over-provisioning. 2020. Retrieved from https://www.seagate.com/tech-insights/ssd-over-provisioning-benefits-master-ti/.Google ScholarGoogle Scholar
  15. Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz, and David A. Patterson. 1994. RAID: High-performance, reliable secondary storage. ACM Comput. Surv. 26, 2 (1994), 145–185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Serial AT Attachment. Retrieved from https://sata-io.org/.Google ScholarGoogle Scholar
  17. NVM Express. Retrieved from https://nvmexpress.org/resources/specifications/.Google ScholarGoogle Scholar
  18. SCSI Storage Interfaces. Retrieved from http://www.t10.org.Google ScholarGoogle Scholar
  19. Seagate Technology. 2011. Reducing RAID Recovery Downtime. Retrieved from https://www.seagate.com/files/staticfiles/docs/pdf/whitepaper/tp620-1-1110us-reducing-raid-recovery.pdf.Google ScholarGoogle Scholar
  20. Mai Zheng, Joseph Tucek, Feng Qin, and Mark Lillibridge. 2013. Understanding the robustness of SSDs under power fault. In Proceedings of the USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ying Y. Tai. 2016. High performance FTL for PCIe/NVMe SSDs. In Proceedings of the Flash Memory Summit.Google ScholarGoogle Scholar
  22. Shunzhuo Wang, Fei Wu, Chengmo Yang, Jiaona Zhou, Changsheng Xie, and Jiguang Wan. 2019. WAS: Wear aware superblock management for prolonging SSD lifetime. In Proceedings of the Design Automation Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jeong-Uk Kang, Jeeseok Hyun, Hyunjoo Maeng, and Sangyeun Cho. 2014. The multi-streamed solid-state drive. In Proceedings of the Workshop on Hot Topics in Storage and File Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Taejin Kim, Duwon Hong, Sangwook Shane Hahn, Myoungjun Chun, Sungjin Lee, Jooyoung Hwang, Jongyoul Lee, and Jihong Kim. 2019. Fully automatic stream management for multi-streamed ssds using program contexts. In Proceedings of the USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ulink. DriveMaster. 2019. Retrieved from https://ulinktech.com/products/drivemaster-8-enterprise-sas/.Google ScholarGoogle Scholar
  26. Jens Axboe. 2020. FIO. Retrieved from https://github.com/axboe/fio.Google ScholarGoogle Scholar
  27. Iometer. 2014. Retrieved from http://www.iometer.org/.Google ScholarGoogle Scholar
  28. Eden Kim. 2014. Enterprise Applications: How to Create a Synthetic Workload Test. Retrieved from https://www.snia.org/sites/default/files/EdenKim_Enterprise_Applications_WorkLoad_Test_SDC_2014.pdf.Google ScholarGoogle Scholar
  29. Youngjae Kim, Sarp Oral, Galen M. Shipman, Junghee Lee, David A. Dillow, and Feiyi Wang. 2011. Harmonia: A globally coordinated garbage collector for arrays of solid-state drives. In Proceedings of the Symposium on Mass Storage Systems and Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ulrich Hansen. 2012. The SSD Endurance Race: Who’s Got the Write Stuff? Retrieved from https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2012/20120821_TC11_Hansen.pdf.Google ScholarGoogle Scholar
  31. Richard R. Muntz and John C. S. Lui. 1990. Performance analysis of disk arrays under failure. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mark Holland and Garth Gibson. 1992. Parity declustering for continuous operation in redundant disk arrays. In Proceedings of the Architectural Support for Programming Languages and Operating Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. G. A. Alverez, Walter A. Burkhard, L. L. Stockmeyer, and Flaviu Cristian. 1998. Declustered disk array architectures with optimal and near-optimal parallelism. In Proceedings of the International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Siu-Cheung Chau and Ada Wai-Chee Fu. 2002. A gracefully degradable declustered RAID architecture. Clust. Comput. 5, 1 (2002), 97–105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jiguang Wan, Jibin Wang, Changsheng Xie, and Qing Yang. 2013. S2-RAID: Parallel RAID architecture for fast data recovery. IEEE Trans. Parallel Distrib. Syst. 25, 6 (2013), 1638–1647. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Guangyan Zhang, Zican Huang, Xiaosong Ma, Songlin Yang, Zhufan Wang, and Weimin Zheng. 2018. RAID+: Deterministic and balanced data distribution for large disk enclosures. In Proceedings of the USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Scott Shadley. 2011. SSD RAIN. Retrieved from https://www.micron.com/ /media/documents/products/technical-marketing-brief/brief_ssd_rain.pdf.Google ScholarGoogle Scholar
  38. Yangsup Lee, Sanghyuk Jung, and Yong Ho Song. 2009. FRA: A flash-aware redundancy array of flash storage devices. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Soojun Im and Dongkun Shin. 2011. Flash-aware RAID techniques for dependable and high-performance flash memory SSD. IEEE Trans. Comput. 60, 1 (2011), 80–92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Sehwan Lee, Bitna Lee, Kern Koh, and Hyokyung Bahn. 2011. A lifespan-aware reliability scheme for RAID-based flash storage. In Proceedings of the ACM Symposium on Applied Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yi Qin, Dan Feng, Jingning Liu, Wei Tong, Yang Hu, and Zhiming Zhu. 2012. A parity scheme to enhance reliability for SSDs. In Proceedings of the International Conference on Networking, Architecture, and Storage. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Heejin Park, Jaeho Kim, Jongmoo Choi, Donghee Lee, and Sam H. Noh. 2015. Incremental redundancy to reduce data retention errors in flash-based SSDs. In Proceedings of the International Conferece on Massive Storage Systems and Technology.Google ScholarGoogle Scholar
  43. Jaeho Kim, Eunjae Lee, Jongmoo Choi, Donghee Lee, and Sam H Noh. 2016. Chip-level raid with flexible stripe size and parity placement for enhanced ssd reliability. IEEE Trans. Comput. 65, 4 (2016), 1116–1130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Bryan S Kim, Jongmoo Choi, and Sang Lyul Min. 2019. Design tradeoffs for SSD reliability. In Proceedings of the USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Reparo: A Fast RAID Recovery Scheme for Ultra-large SSDs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Storage
        ACM Transactions on Storage  Volume 17, Issue 3
        August 2021
        227 pages
        ISSN:1553-3077
        EISSN:1553-3093
        DOI:10.1145/3477268
        • Editor:
        • Sam H. Noh
        Issue’s Table of Contents

        Copyright © 2021 Association for Computing Machinery.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 August 2021
        • Accepted: 1 February 2021
        • Revised: 1 December 2020
        • Received: 1 August 2020
        Published in tos Volume 17, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!