skip to main content
research-article

Reliability Evaluation of Erasure-coded Storage Systems with Latent Errors

Published:11 January 2023Publication History
Skip Abstract Section

Abstract

Large-scale storage systems employ erasure-coding redundancy schemes to protect against device failures. The adverse effect of latent sector errors on the Mean Time to Data Loss (MTTDL) and the Expected Annual Fraction of Data Loss (EAFDL) reliability metrics is evaluated. A theoretical model capturing the effect of latent errors and device failures is developed, and closed-form expressions for the metrics of interest are derived. The MTTDL and EAFDL of erasure-coded systems are obtained analytically for (i) the entire range of bit error rates; (ii) the symmetric, clustered, and declustered data placement schemes; and (iii) arbitrary device failure and rebuild time distributions under network rebuild bandwidth constraints. The range of error rates that deteriorate system reliability is derived analytically. For realistic values of sector error rates, the results obtained demonstrate that MTTDL degrades, whereas, for moderate erasure codes, EAFDL remains practically unaffected. It is demonstrated that, in the range of typical sector error rates and for very powerful erasure codes, EAFDL degrades as well. It is also shown that the declustered data placement scheme offers superior reliability.

REFERENCES

  1. [1] Amazon Web Services. 2021. Amazon Simple Storage Service (Amazon S3). Retrieved from http://aws.amazon.com/s3/.Google ScholarGoogle Scholar
  2. [2] 2022. Backblaze Drive Stats for 2021. Retrieved April 15, 2022 from https://www.backblaze.com/blog/backblaze-drive-stats-for-2021/.Google ScholarGoogle Scholar
  3. [3] 2022. Seagate, Exos X20, Data Sheet. Retrieved April 15, 2022 from https://www.seagate.com/products/enterprise-drives/exos-x/x20/.Google ScholarGoogle Scholar
  4. [4] 2022. Tape Roadmap, Information Storage Industry Consortium (INSIC) Report, 2019. Retrieved April 15, 2022 from https://www.insic.org/wp-content/uploads/2019/07/INSIC-Applications-and-Systems-Roadmap.pdf.Google ScholarGoogle Scholar
  5. [5] Borthakur Dhruba. 2021. HDFS and Erasure Codes (HDFS-RAID), Aug. 2009. Retrieved April 15, 2022 from https://hadoopblog.blogspot.com/2009/08.Google ScholarGoogle Scholar
  6. [6] Borthakur Dhruba et al. 2011. Apache hadoop goes realtime at Facebook. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’11). 10711080.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Calder Brad et al. 2011. Windows Azure Storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). 143157.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Chamazcoti Saeideh Alinezhad, Safaei Bardia, and Miremadi Seyed Ghassem. 2019. Can erasure codes damage reliability in SSD-based storage systems? IEEE Trans. Emerg. Top. Comput. 7, 3 (2019), 435446. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Chansler Robert J.. 2012. Data availability and durability with the hadoop distributed file system. USENIX Assoc. Newslett. 37, 1 (February2012), 1622.Google ScholarGoogle Scholar
  10. [10] Chen Peter M., Lee Edward A., Gibson Garth A., Katz Randy H., and Patterson David A.. 1994. RAID: High-performance, reliable secondary storage. ACM Comput. Surv. 26, 2 (June1994), 145185.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Cidon Asaf, Rumble Stephen, Stutsman Ryan, Katti Sachin, Ousterhout John, and Rosenblum Mendel. 2013. Copysets: Reducing the frequency of data loss in cloud storage. In Proceedings of the USENIX Annual Technical Conference (ATC’13). 3748.Google ScholarGoogle Scholar
  12. [12] Dholakia Ajay, Eleftheriou Evangelos, Hu Xiao-Yu, Iliadis Ilias, Menon Jai, and Rao K. K.. 2008. A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors. ACM Trans. Stor. 4, 1, Article 1 (May2008), 42 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Elerath Jon G. and Schindler Jiri. 2014. Beyond MTTDL: A closed-form RAID 6 reliability equation. ACM Trans. Stor. 10, 2, Article 7 (March2014), 21 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Ford Daniel, Labelle François, Popovici Florentina I., Stokely Murray, Truong Van-Anh, Barroso Luiz, Grimes Carrie, and Quinlan Sean. 2010. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI’10). 6174.Google ScholarGoogle Scholar
  15. [15] Ghemawat Sanjay, Gobioff Howard, and Leung Shun-Tak. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). 2943.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Greenan Kevin M., Plank James S., and Wylie Jay J.. 2010. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In Proceedings of the USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’10). 15.Google ScholarGoogle Scholar
  17. [17] Huang Cheng, Simitci Huseyin, Xu Yikang, Ogus Aaron, Calder Brad, Gopalan Parikshit, Li Jin, and Yekhanin Sergey. 2012. Erasure coding in windows azure storage. In Proceedings of the USENIX Annual Technical Conference (ATC’12). 1526.Google ScholarGoogle Scholar
  18. [18] Iliadis Ilias. 2009. Reliability modeling of RAID storage systems with latent errors. In Proceedings of the 17th Annual IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’09). 111122.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Iliadis Ilias. 2018. Reliability evaluation of erasure coded systems under rebuild bandwidth constraints. Int’. J. Adv. Netw. Serv. 11, 3&4 (December2018), 113142. Google ScholarGoogle Scholar
  20. [20] Iliadis Ilias. 2019. Data loss in RAID-5 and RAID-6 storage systems with latent errors. Int. J. Adv. Softw. 12, 3&4 (December2019), 259287. Google ScholarGoogle Scholar
  21. [21] Iliadis Ilias. 2019. Data loss in RAID-5 storage systems with latent errors. In Proceedings of the 12th International Conference on Communication Theory, Reliability, and Quality of Service (CTRQ’19). 19.Google ScholarGoogle Scholar
  22. [22] Iliadis Ilias. 2021. Reliability assessment of erasure-coded storage systems with latent errors. In Proceedings of the 14th International Conference on Communication Theory, Reliability, and Quality of Service (CTRQ’21). 1524.Google ScholarGoogle Scholar
  23. [23] Iliadis Ilias. 2022. Effect of lazy rebuild on reliability of erasure-coded storage systems. In Proceedings of the 15th International Conference on Communication Theory, Reliability, and Quality of Service (CTRQ’22). 110.Google ScholarGoogle Scholar
  24. [24] Iliadis Ilias, Haas Robert, Hu Xiao-Yu, and Eleftheriou Evangelos. 2011. Disk scrubbing versus intradisk redundancy for RAID storage systems. ACM Trans. Stor. 7, 2, Article 5 (July2011), 42 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Iliadis Ilias and Hu Xiao-Yu. 2008. Reliability assurance of RAID storage systems for a wide range of latent sector errors. In Proceedings of the IEEE International Conference on Networking, Architecture, and Storage (NAS’08). 1019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Iliadis Ilias, Sotnikov Dmitry, Ta-Shma Paula, and Venkatesan Vinodh. 2014. Reliability of geo-replicated cloud storage systems. In Proceedings of the IEEE 20th Pacific Rim International Symposium on Dependable Computing (PRDC’14). 169179.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Iliadis Ilias and Venkatesan Vinodh. 2014. Expected annual fraction of data loss as a metric for data storage reliability. In Proceedings of the 22nd Annual IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’14). 375384. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Iliadis Ilias and Venkatesan Vinodh. 2015. Most probable paths to data loss: An efficient method for reliability evaluation of data storage systems. Int. J. Adv. Syst. Meas. 8, 3&4 (December2015), 178200. Google ScholarGoogle Scholar
  29. [29] Iliadis Ilias and Venkatesan Vinodh. 2015. Rebuttal to ‘Beyond MTTDL: A closed-form RAID-6 reliability equation.’ ACM Trans. Stor. 11, 2, Article 9 (March2015), 10 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Iliadis Ilias and Venkatesan Vinodh. 2017. Reliability evaluation of erasure coded systems. Int’. J. Adv. Telecommun. 10, 3&4 (December2017), 118144. Google ScholarGoogle Scholar
  31. [31] Kishani Mostafa, Ahmadian Saba, and Asadi Hossein. 2020. A modeling framework for reliability of erasure codes in SSD arrays. IEEE Trans. Comput. 69, 5 (2020), 649665. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Li Bin, Ramamoorthy Aditya, and Srikant R.. 2018. Mean-field analysis of coding versus replication in large data storage systems. ACM Trans. Model. Perform. Eval. Comput. Syst. 3, 1, Article 3 (February2018), 28 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Li Yin, Chen Xubin, Zheng Ning, Hao Jingpeng, and Zhang Tong. 2019. An exploratory study on software-defined data center hard disk drives. ACM Trans. Stor. 15, 3, Article 18 (May2019), 22 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Luby Michael, Padovani Roberto, Richardson Thomas J., Minder Lorenz, and Aggarwal Pooja. 2019. Liquid cloud storage. ACM Trans. Stor. 15, 1, Article 2 (May2019), 49 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Malhotra Manish and Trivedi Kishor S.. 1993. Reliability analysis of redundant arrays of inexpensive disks. J. Parallel Distrib. Comput. 17, 1 (January1993), 146151.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Meza Justin, Wu Qiang, Kumar Sanjev, and Mutlu Onur. 2015. A large-scale study of flash memory failures in the field. SIGMETRICS Perform. Eval. Rev. 43, 1 (June2015), 177190. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Muralidhar Subramanian et al. 2014. f4: Facebook’s warm BLOB storage system. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 383397.Google ScholarGoogle Scholar
  38. [38] Oprea Alina and Juels Ari. 2010. A clean-slate look at disk scrubbing. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). 5770.Google ScholarGoogle Scholar
  39. [39] Ovsiannikov Michael, Rus Silvius, Reeves Damian, Sutter Paul, Rao Sriram, and Kelly Jim. 2013. The quantcast file system. Proc. VLDB Endow. 6, 11 (2013), 10921101.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Pâris Jehan-François, Schwarz Thomas J. E., Amer Ahmed, and Long Darrell D. E.. 2012. Highly reliable two-dimensional RAID arrays for archival storage. In Proceedings of the 31st IEEE International Performance Computing and Communications Conference (IPCCC’12). 324331.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Patterson David A., Gibson Garth, and Katz Randy H.. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM International Conference on Management of Data (SIGMOD’88). 109116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Pinheiro Eduardo, Weber Wolf-Dietrich, and Barroso Luiz André. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). 1728.Google ScholarGoogle Scholar
  43. [43] Plank James S. and Huang Cheng. 2013. Tutorial: Erasure coding for storage applications. In Proceedings of the11th Usenix Conference on File and Storage Technologies (FAST’13).Google ScholarGoogle Scholar
  44. [44] Rashmi K. V., Shah Nihar B., Gu Dikang, Kuang Hairong, Borthakur Dhruba, and Ramchandran Kannan. 2013. A solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the Facebook warehouse cluster. In Proceedings of the 5th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’13). 15.Google ScholarGoogle Scholar
  45. [45] Rashmi K. V., Shah Nihar B., Gu Dikang, Kuang Hairong, Borthakur Dhruba, and Ramchandran Kannan. 2014. A “Hitchhiker’s” guide to fast and efficient data reconstruction in erasure-coded data centers. In Proceedings of the ACM Conference on SIGCOMM. 331342.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Rodrigues Rodrigo and Liskov Barbara. 2005. High availability in DHTs: Erasure coding vs. replication. In Proceedings of the 4th International Workshop on Peer-to-Peer Systems (IPTPS’05). 226239.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Schroeder Bianca, Damouras Sotirios, and Gill Phillipa. 2010. Understanding latent sector errors and how to protect against them. ACM Trans. Stor. 6, 3, Article 9 (September2010), 23 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Schroeder Bianca and Gibson Garth A.. 2010. A large-scale study of failures in high-performance computing systems. IEEE Trans. Depend. Secure Comput. 7, 4 (2010), 337350. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Schroeder Bianca, Lagisetty Raghav, and Merchant Arif. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). 6780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Schwarz Thomas J. E., Xin Qin, Miller Ethan L., Long Darrell D. E., Hospodor Andy, and Ng Spencer. 2004. Disk scrubbing in large archival storage systems. In Proceedings of the 12th Annual IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’04). 409418.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Shvachko Konstantin, Kuang Hairong, Radia Sanjay, and Chansler Robert. 2010. The hadoop distributed file system. In Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST’10). 110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Silberstein Mark, Ganesh Lakshmi, Wang Yang, Alvisi Lorenzo, and Dahlin Mike. 2014. Lazy means smart: Reducing repair bandwidth costs in erasure-coded distributed storage. In Proceedings of the 7th ACM International Systems and Storage Conference (SYSTOR’14). 15:1–15:7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Thomasian Alexander and Blaum Mario. 2009. Higher reliability redundant disk arrays: Organization, operation, and coding. ACM Trans. Stor. 5, 3, Article 7 (November2009), 59 pages.Google ScholarGoogle Scholar
  54. [54] Venkatesan Vinodh and Iliadis Ilias. 2012. A general reliability model for data storage systems. In Proceedings of the 9th International Conference on Quantitative Evaluation of Systems (QEST’12). 209219.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Venkatesan Vinodh and Iliadis Ilias. 2013. Effect of codeword placement on the reliability of erasure coded data storage systems. In Proceedings of the 10th International Conference on Quantitative Evaluation of Systems (QEST’13). 241257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Venkatesan Vinodh and Iliadis Ilias. 2013. Effect of latent errors on the reliability of data storage systems. In Proceedings of the 21st Annual IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’13). 293297.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Venkatesan Vinodh, Iliadis Ilias, Fragouli Christina, and Urbanke Rüdiger. 2011. Reliability of clustered vs. declustered replica placement in data storage systems. In Proceedings of the 19th Annual IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’11). 307317.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Venkatesan Vinodh, Iliadis Ilias, and Haas Robert. 2012. Reliability of data storage systems under network rebuild bandwidth constraints. In Proceedings of the 20th Annual IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’12). 189197.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Weatherspoon Hakim and Kubiatowicz John. 2002. Erasure coding vs. replication: A quantitative comparison. In Proceedings of the 1st International Workshop on Peer-to-Peer Systems (IPTPS’02). 328338.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Whitepaper DELL/EMC. 2019. PowerVault ME4 Series ADAPT Software. Retrieved April 15, 2022 from https://www.dellemc.com/.Google ScholarGoogle Scholar
  61. [61] Zhang Mi, Han Shujie, and Lee Patrick P. C.. 2019. SimEDC: A simulator for the reliability analysis of erasure-coded data centers. IEEE Trans. Parallel Distrib. Syst. 30, 12 (2019), 28362848.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Reliability Evaluation of Erasure-coded Storage Systems with Latent Errors

                      Recommendations

                      Comments

                      Login options

                      Check if you have access through your login credentials or your institution to get full access on this article.

                      Sign in

                      Full Access

                      • Published in

                        cover image ACM Transactions on Storage
                        ACM Transactions on Storage  Volume 19, Issue 1
                        February 2023
                        259 pages
                        ISSN:1553-3077
                        EISSN:1553-3093
                        DOI:10.1145/3578369
                        Issue’s Table of Contents

                        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                        Publisher

                        Association for Computing Machinery

                        New York, NY, United States

                        Publication History

                        • Published: 11 January 2023
                        • Online AM: 19 November 2022
                        • Accepted: 28 June 2022
                        • Revised: 15 April 2022
                        • Received: 2 December 2021
                        Published in tos Volume 19, Issue 1

                        Permissions

                        Request permissions about this article.

                        Request Permissions

                        Check for updates

                        Qualifiers

                        • research-article
                        • Refereed
                      • Article Metrics

                        • Downloads (Last 12 months)249
                        • Downloads (Last 6 weeks)27

                        Other Metrics

                      PDF Format

                      View or Download as a PDF file.

                      PDF

                      eReader

                      View online with eReader.

                      eReader

                      Full Text

                      View this article in Full Text.

                      View Full Text

                      HTML Format

                      View this article in HTML Format .

                      View HTML Format
                      About Cookies On This Site

                      We use cookies to ensure that we give you the best experience on our website.

                      Learn more

                      Got it!