Abstract
Large-scale storage systems employ erasure-coding redundancy schemes to protect against device failures. The adverse effect of latent sector errors on the Mean Time to Data Loss (MTTDL) and the Expected Annual Fraction of Data Loss (EAFDL) reliability metrics is evaluated. A theoretical model capturing the effect of latent errors and device failures is developed, and closed-form expressions for the metrics of interest are derived. The MTTDL and EAFDL of erasure-coded systems are obtained analytically for (i) the entire range of bit error rates; (ii) the symmetric, clustered, and declustered data placement schemes; and (iii) arbitrary device failure and rebuild time distributions under network rebuild bandwidth constraints. The range of error rates that deteriorate system reliability is derived analytically. For realistic values of sector error rates, the results obtained demonstrate that MTTDL degrades, whereas, for moderate erasure codes, EAFDL remains practically unaffected. It is demonstrated that, in the range of typical sector error rates and for very powerful erasure codes, EAFDL degrades as well. It is also shown that the declustered data placement scheme offers superior reliability.
- [1] Amazon Web Services. 2021. Amazon Simple Storage Service (Amazon S3). Retrieved from http://aws.amazon.com/s3/.Google Scholar
- [2] 2022. Backblaze Drive Stats for 2021. Retrieved April 15, 2022 from https://www.backblaze.com/blog/backblaze-drive-stats-for-2021/.Google Scholar
- [3] 2022. Seagate, Exos X20, Data Sheet. Retrieved April 15, 2022 from https://www.seagate.com/products/enterprise-drives/exos-x/x20/.Google Scholar
- [4] 2022. Tape Roadmap, Information Storage Industry Consortium (INSIC) Report, 2019. Retrieved April 15, 2022 from https://www.insic.org/wp-content/uploads/2019/07/INSIC-Applications-and-Systems-Roadmap.pdf.Google Scholar
- [5] . 2021. HDFS and Erasure Codes (HDFS-RAID), Aug. 2009. Retrieved April 15, 2022 from https://hadoopblog.blogspot.com/2009/08.Google Scholar
- [6] . 2011. Apache hadoop goes realtime at Facebook. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’11). 1071–1080.Google Scholar
Digital Library
- [7] . 2011. Windows Azure Storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). 143–157.Google Scholar
Digital Library
- [8] . 2019. Can erasure codes damage reliability in SSD-based storage systems? IEEE Trans. Emerg. Top. Comput. 7, 3 (2019), 435–446.
DOI: Google ScholarCross Ref
- [9] . 2012. Data availability and durability with the hadoop distributed file system. USENIX Assoc. Newslett. 37, 1 (
February 2012), 16–22.Google Scholar - [10] . 1994. RAID: High-performance, reliable secondary storage. ACM Comput. Surv. 26, 2 (
June 1994), 145–185.Google ScholarDigital Library
- [11] . 2013. Copysets: Reducing the frequency of data loss in cloud storage. In Proceedings of the USENIX Annual Technical Conference (ATC’13). 37–48.Google Scholar
- [12] . 2008. A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors. ACM Trans. Stor. 4, 1, Article
1 (May 2008), 42 pages.DOI: Google ScholarDigital Library
- [13] . 2014. Beyond MTTDL: A closed-form RAID 6 reliability equation. ACM Trans. Stor. 10, 2, Article
7 (March 2014), 21 pages.DOI: Google ScholarDigital Library
- [14] . 2010. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI’10). 61–74.Google Scholar
- [15] . 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). 29–43.Google Scholar
Digital Library
- [16] . 2010. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In Proceedings of the USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’10). 1–5.Google Scholar
- [17] . 2012. Erasure coding in windows azure storage. In Proceedings of the USENIX Annual Technical Conference (ATC’12). 15–26.Google Scholar
- [18] . 2009. Reliability modeling of RAID storage systems with latent errors. In Proceedings of the 17th Annual IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’09). 111–122.Google Scholar
Cross Ref
- [19] . 2018. Reliability evaluation of erasure coded systems under rebuild bandwidth constraints. Int’. J. Adv. Netw. Serv. 11, 3&4 (
December 2018), 113–142. Google Scholar - [20] . 2019. Data loss in RAID-5 and RAID-6 storage systems with latent errors. Int. J. Adv. Softw. 12, 3&4 (
December 2019), 259–287. Google Scholar - [21] . 2019. Data loss in RAID-5 storage systems with latent errors. In Proceedings of the 12th International Conference on Communication Theory, Reliability, and Quality of Service (CTRQ’19). 1–9.Google Scholar
- [22] . 2021. Reliability assessment of erasure-coded storage systems with latent errors. In Proceedings of the 14th International Conference on Communication Theory, Reliability, and Quality of Service (CTRQ’21). 15–24.Google Scholar
- [23] . 2022. Effect of lazy rebuild on reliability of erasure-coded storage systems. In Proceedings of the 15th International Conference on Communication Theory, Reliability, and Quality of Service (CTRQ’22). 1–10.Google Scholar
- [24] . 2011. Disk scrubbing versus intradisk redundancy for RAID storage systems. ACM Trans. Stor. 7, 2, Article
5 (July 2011), 42 pages.DOI: Google ScholarDigital Library
- [25] . 2008. Reliability assurance of RAID storage systems for a wide range of latent sector errors. In Proceedings of the IEEE International Conference on Networking, Architecture, and Storage (NAS’08). 10–19.Google Scholar
Digital Library
- [26] . 2014. Reliability of geo-replicated cloud storage systems. In Proceedings of the IEEE 20th Pacific Rim International Symposium on Dependable Computing (PRDC’14). 169–179.Google Scholar
Digital Library
- [27] . 2014. Expected annual fraction of data loss as a metric for data storage reliability. In Proceedings of the 22nd Annual IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’14). 375–384.
DOI: Google ScholarDigital Library
- [28] . 2015. Most probable paths to data loss: An efficient method for reliability evaluation of data storage systems. Int. J. Adv. Syst. Meas. 8, 3&4 (
December 2015), 178–200. Google Scholar - [29] . 2015. Rebuttal to ‘Beyond MTTDL: A closed-form RAID-6 reliability equation.’ ACM Trans. Stor. 11, 2, Article
9 (March 2015), 10 pages.DOI: Google ScholarDigital Library
- [30] . 2017. Reliability evaluation of erasure coded systems. Int’. J. Adv. Telecommun. 10, 3&4 (
December 2017), 118–144. Google Scholar - [31] . 2020. A modeling framework for reliability of erasure codes in SSD arrays. IEEE Trans. Comput. 69, 5 (2020), 649–665.
DOI: Google ScholarCross Ref
- [32] . 2018. Mean-field analysis of coding versus replication in large data storage systems. ACM Trans. Model. Perform. Eval. Comput. Syst. 3, 1, Article
3 (February 2018), 28 pages.DOI: Google ScholarDigital Library
- [33] . 2019. An exploratory study on software-defined data center hard disk drives. ACM Trans. Stor. 15, 3, Article
18 (May 2019), 22 pages.DOI: Google ScholarDigital Library
- [34] . 2019. Liquid cloud storage. ACM Trans. Stor. 15, 1, Article
2 (May 2019), 49 pages.DOI: Google ScholarDigital Library
- [35] . 1993. Reliability analysis of redundant arrays of inexpensive disks. J. Parallel Distrib. Comput. 17, 1 (
January 1993), 146–151.Google ScholarDigital Library
- [36] . 2015. A large-scale study of flash memory failures in the field. SIGMETRICS Perform. Eval. Rev. 43, 1 (
June 2015), 177–190.DOI: Google ScholarDigital Library
- [37] . 2014. f4: Facebook’s warm BLOB storage system. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 383–397.Google Scholar
- [38] . 2010. A clean-slate look at disk scrubbing. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). 57–70.Google Scholar
- [39] . 2013. The quantcast file system. Proc. VLDB Endow. 6, 11 (2013), 1092–1101.Google Scholar
Digital Library
- [40] . 2012. Highly reliable two-dimensional RAID arrays for archival storage. In Proceedings of the 31st IEEE International Performance Computing and Communications Conference (IPCCC’12). 324–331.Google Scholar
Cross Ref
- [41] . 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM International Conference on Management of Data (SIGMOD’88). 109–116.Google Scholar
Digital Library
- [42] . 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). 17–28.Google Scholar
- [43] . 2013. Tutorial: Erasure coding for storage applications. In Proceedings of the11th Usenix Conference on File and Storage Technologies (FAST’13).Google Scholar
- [44] . 2013. A solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the Facebook warehouse cluster. In Proceedings of the 5th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’13). 1–5.Google Scholar
- [45] . 2014. A “Hitchhiker’s” guide to fast and efficient data reconstruction in erasure-coded data centers. In Proceedings of the ACM Conference on SIGCOMM. 331–342.Google Scholar
Digital Library
- [46] . 2005. High availability in DHTs: Erasure coding vs. replication. In Proceedings of the 4th International Workshop on Peer-to-Peer Systems (IPTPS’05). 226–239.Google Scholar
Digital Library
- [47] . 2010. Understanding latent sector errors and how to protect against them. ACM Trans. Stor. 6, 3, Article
9 (September 2010), 23 pages.DOI: Google ScholarDigital Library
- [48] . 2010. A large-scale study of failures in high-performance computing systems. IEEE Trans. Depend. Secure Comput. 7, 4 (2010), 337–350.
DOI: Google ScholarDigital Library
- [49] . 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). 67–80.Google Scholar
Digital Library
- [50] . 2004. Disk scrubbing in large archival storage systems. In Proceedings of the 12th Annual IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’04). 409–418.Google Scholar
Digital Library
- [51] . 2010. The hadoop distributed file system. In Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST’10). 1–10.Google Scholar
Digital Library
- [52] . 2014. Lazy means smart: Reducing repair bandwidth costs in erasure-coded distributed storage. In Proceedings of the 7th ACM International Systems and Storage Conference (SYSTOR’14). 15:1–15:7.Google Scholar
Digital Library
- [53] . 2009. Higher reliability redundant disk arrays: Organization, operation, and coding. ACM Trans. Stor. 5, 3, Article
7 (November 2009), 59 pages.Google Scholar - [54] . 2012. A general reliability model for data storage systems. In Proceedings of the 9th International Conference on Quantitative Evaluation of Systems (QEST’12). 209–219.Google Scholar
Digital Library
- [55] . 2013. Effect of codeword placement on the reliability of erasure coded data storage systems. In Proceedings of the 10th International Conference on Quantitative Evaluation of Systems (QEST’13). 241–257.Google Scholar
Digital Library
- [56] . 2013. Effect of latent errors on the reliability of data storage systems. In Proceedings of the 21st Annual IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’13). 293–297.Google Scholar
Digital Library
- [57] . 2011. Reliability of clustered vs. declustered replica placement in data storage systems. In Proceedings of the 19th Annual IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’11). 307–317.Google Scholar
Digital Library
- [58] . 2012. Reliability of data storage systems under network rebuild bandwidth constraints. In Proceedings of the 20th Annual IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’12). 189–197.Google Scholar
Digital Library
- [59] . 2002. Erasure coding vs. replication: A quantitative comparison. In Proceedings of the 1st International Workshop on Peer-to-Peer Systems (IPTPS’02). 328–338.Google Scholar
Cross Ref
- [60] . 2019. PowerVault ME4 Series ADAPT Software. Retrieved April 15, 2022 from https://www.dellemc.com/.Google Scholar
- [61] . 2019. SimEDC: A simulator for the reliability analysis of erasure-coded data centers. IEEE Trans. Parallel Distrib. Syst. 30, 12 (2019), 2836–2848.Google Scholar
Cross Ref
Index Terms
Reliability Evaluation of Erasure-coded Storage Systems with Latent Errors
Recommendations
Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems
SIGMETRICS '08: Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systemsTwo schemes proposed to cope with unrecoverable or latent media errors and enhance the reliability of RAID systems are examined. The first scheme is the established, widely used disk scrubbing scheme, which operates by periodically accessing disk drives ...
Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems
SIGMETRICS '08Two schemes proposed to cope with unrecoverable or latent media errors and enhance the reliability of RAID systems are examined. The first scheme is the established, widely used disk scrubbing scheme, which operates by periodically accessing disk drives ...
Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems
Two schemes proposed to cope with unrecoverable or latent media errors and enhance the reliability of RAID systems are examined. The first scheme is the established, widely used, disk scrubbing scheme, which operates by periodically accessing disk ...






Comments