Abstract
Modern storage systems orchestrate a group of disks to achieve their performance and reliability goals. Even though such systems are designed to withstand the failure of individual disks, failure of multiple disks poses a unique set of challenges. We empirically investigate disk failure data from a large number of production systems, specifically focusing on the impact of disk failures on RAID storage systems. Our data covers about one million SATA disks from six disk models for periods up to 5 years. We show how observed disk failures weaken the protection provided by RAID. The count of reallocated sectors correlates strongly with impending failures.
With these findings we designed RAIDShield, which consists of two components. First, we have built and evaluated an active defense mechanism that monitors the health of each disk and replaces those that are predicted to fail imminently. This proactive protection has been incorporated into our product and is observed to eliminate 88% of triple disk errors, which are 80% of all RAID failures. Second, we have designed and simulated a method of using the joint failure probability to quantify and predict how likely a RAID group is to face multiple simultaneous disk failures, which can identify disks that collectively represent a risk of failure even when no individual disk is flagged in isolation. We find in simulation that RAID-level analysis can effectively identify most vulnerable RAID-6 systems, improving the coverage to 98% of triple errors.
We conclude with discussions of operational considerations in deploying RAIDShield more broadly and new directions in the analysis of disk errors. One interesting approach is to combine multiple metrics, allowing the values of different indicators to be used for predictions. Using newer field data that reports an additional metric, medium errors, we find that the relative efficacy of reallocated sectors and medium errors varies across disk models, offering an additional way to predict failures.
- Bruce Allen. 2004. Monitoring hard disks with S.M.A.R.T. Linux Journal 2004, 117, 9. Google Scholar
Digital Library
- A. Alvarez, Walter A. Burkhard, and Flaviu Cristian. 1997. Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering. In Proceedings of the 24th International Symposium on Computer Architecture (ISCA’97). 62--72. DOI:http://dx.doi.org/10.1145/264107.264132 Google Scholar
Digital Library
- Ahmed Amer, Darrell D. E. Long, and S. J. Thomas Schwarz. 2014. Reliability challenges for storing exabytes. In Proceedings of the 2014 International Conference on Computing, Networking and Communications (ICNC). IEEE, 907--913.Google Scholar
- Dave Anderson, Jim Dykes, and Erik Riedel. 2003. More than an interface: SCSI vs. ATA. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST’03). 245--257. Google Scholar
Digital Library
- Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. 2001. Fail-stutter fault tolerance. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS VIII). 33--38. Google Scholar
Digital Library
- Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07). 289--300. DOI:http://dx.doi.org/10.1145/1254882.1254917 Google Scholar
Digital Library
- Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). Google Scholar
Digital Library
- Wendy Bartlett and Lisa Spainhower. 2004. Commercial fault tolerance: A tale of two systems. IEEE Trans. Dependable Secur. Comput. 1, 1 (Jan. 2004), 87--96. DOI:http://dx.doi.org/10.1109/TDSC.2004.4 Google Scholar
Digital Library
- Peter Bodik, Moises Goldszmidt, Armando Fox, Dawn B. Woodard, and Hans Andersen. 2010. Fingerprinting the datacenter: Automated classification of performance crises. In Proceedings of the 2010 EuroSys Conference (EuroSys’10). 111--124. DOI:http://dx.doi.org/10.1145/1755913.1755926 Google Scholar
Digital Library
- Jeff Bonwick and Bill Moore. 2008. ZFS: The last world in file systems. In SNIA Software Developers’s Conference.Google Scholar
- Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz, and David A. Patterson. 1994. RAID: High-performance, reliable secondary storage. ACM Comput. Surv. 26, 2 (June 1994), 145--185. DOI:http://dx.doi.org/10.1145/176979.176981 Google Scholar
Digital Library
- Peter Corbett, Bob English, Atul Goel, Tomislav Grcanac, Steven Kleiman, James Leong, and Sunitha Sankar. 2004. Row-diagonal parity for double disk failure correction. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST’04). 14. Google Scholar
Digital Library
- Ajay Dholakia, Evangelos Eleftheriou, Xiao-Yu Hu, Ilias Iliadis, Jai Menon, and K. K. Rao. 2008. A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors. ACM Trans. Storage 4, 1, Article 1 (May 2008), 42 pages. DOI:http://dx.doi.org/10.1145/1353452.1353453 Google Scholar
Digital Library
- Cezary Dubnicki and others. 2009. HYDRAstor: A scalable secondary storage. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09). Google Scholar
Digital Library
- Jon Elerath. 2009. Hard-disk drives: The good, the bad, and the ugly. Commun. ACM 52, 6 (June 2009), 38--45. DOI:http://dx.doi.org/10.1145/1516046.1516059 Google Scholar
Digital Library
- Jon G. Elerath and Jiri Schindler. 2014. Beyond MTTDL: A closed-form RAID 6 reliability equation. ACM Trans. Storage 10, 2, Article 7 (March 2014), 21 pages. DOI:http://dx.doi.org/10.1145/2577386 Google Scholar
Digital Library
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). 29--43. DOI:http://dx.doi.org/10.1145/945445.945450 Google Scholar
Digital Library
- Garth Gibson. 1992. Redundant Disk Arrays: Reliable, Parallel Secondary Storage. Ph.D. Dissertation. University of California, Berkeley, CA. Google Scholar
Digital Library
- Atul Goel and Peter Corbett. 2012. RAID triple parity. ACM SIGOPS Oper. Syst. Rev. 46, 3 (Dec. 2012), 41--49. DOI:http://dx.doi.org/10.1145/2421648.2421655 Google Scholar
Digital Library
- Moises Goldszmidt. 2012. Finding soon-to-fail disks in a haystack. In USENIX HotStorage’12. Google Scholar
Digital Library
- Kevin M. Greenan, James S. Plank, and Jay J. Wylie. 2010. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In USENIX HotStorage’10. 5. Google Scholar
Digital Library
- James Lee Hafner. 2005. WEAVER codes: Highly fault tolerant erasure codes for storage systems. In Proceedings of the 4th Conference on USENIX Conference on File and Storage Technologies (FAST’05). Google Scholar
Digital Library
- James Lee Hafner, Veera Deenadhayalan, K. K. Rao, and John A. Tomlin. 2005. Matrix methods for lost data reconstruction in erasure codes. In Proceedings of the 4th Conference on USENIX Conference on File and Storage Technologies (FAST’05). Google Scholar
Digital Library
- Greg Hamerly and Charles Elkan. 2001. Bayesian approaches to failure prediction for disk drives. In ICML’01. 202--209. Google Scholar
Digital Library
- Cheng Huang and others. 2012. Erasure coding in windows azure storage. In Proceedings of the USENIX Annual Technical Conference. Google Scholar
Digital Library
- Gordan F. Hughes, Joseph F. Murray, Kenneth Kreutz-Delgado, and Charles Elkan. 2002. Improved disk-drive failure warnings. IEEE Trans. Reliability 51, 3 (Sept. 2002), 350--357.Google Scholar
Cross Ref
- Navendu Jain, Mike Dahlin, and Renu Tewari. 2005. TAPER: Tiered approach for eliminating redundancy in replica synchronization. In Proceedings of the 4th Conference on USENIX Conference on File and Storage Technologies (FAST’05). Google Scholar
Digital Library
- Hannu Kari, Heikki Saikkonen, and Fabrizio Lombardi. 1993. Detection of defective media in disks. In Proceedings of the IEEE Workshop on Defect and Fault Tolerance in VLSI Systems. 49--55. Google Scholar
Digital Library
- Hannu H. Kari. 1997. Latent Sector Faults and Reliability of Disk Arrays. Ph.D. Dissertation. Helsinki University of Technology, Espoo, Finland.Google Scholar
- O. Khan, R. Burns, J. S. Plank, W. Pierce, and C. Huang. 2012. Rethinking erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). Google Scholar
Digital Library
- Andrew Krioukov, Lakshmi N. Bairavasundaram, Garth R. Goodson, Kiran Srinivasan, Randy Thelen, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2008. Parity lost and parity regained. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). Google Scholar
Digital Library
- Mingqiang Li, Jiwu Shu, and Weimin Zheng. 2009. GRID codes: Strip-based erasure codes with high fault tolerance for storage systems. ACM Trans. Storage 4, 4, Article 15 (Feb. 2009), 22 pages. DOI:http://dx.doi.org/10.1145/1480439.1480444 Google Scholar
Digital Library
- Ao Ma, Fred Douglis, Guanlin Lu, Darren Sawyer, Surendar Chandra, and Windsor Hsu. 2015. RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. In 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, Santa Clara, CA, 241--256. Google Scholar
Digital Library
- Chris Mellor. 2014. Kryder’s law craps out: Race to UBER-cheap storage is over. The A Register. Retrieved from http://www.theregister.co.uk/2014/11/10/kryders_law_of_ever_cheaper_storage_disproven.Google Scholar
- Joseph F. Murray, Gordon F. Hughes, and Kenneth Kreutz-Delgado. 2003. Hard drive failure prediction using non-parametric statistical methods. In ICANN/ICONIP. 4.Google Scholar
- Joseph F. Murray, Gordon F. Hughes, and Dale Schuurmans. 2005. Machine learning methods for predicting failures in hard drives: A multiple-instance application. In J. Mach. Learning Res., Vol. 6. 783--816. Google Scholar
Digital Library
- J.-F. Pâris, S. J. T. Schwarz, Ahmed Amer, and Darrell D. E. Long. 2010. Improving disk array reliability through expedited scrubbing. In 2010 IEEE Fifth International Conference on Networking, Architecture and Storage (NAS). IEEE, 119--125. Google Scholar
Digital Library
- David Patterson, Garth Gibson, and Randy Katz. 1988. A case for redundant arrays of inexpensive disks (raid). In Proceedings of the 1988 ACM SIGMOD Conference on Management of Data (SIGMOD’88). 109--116. Google Scholar
Digital Library
- Hugo Patterson, Stephen Manley, Mike Federwisch, Dave Hitz, Steve Kleiman, and Shane Owara. 2002. SnapMirror: File system based asynchronous mirroring for disaster recovery. In Proceedings of the 1st USENIX Conference on File and Storage Technologies. 1. Google Scholar
Digital Library
- Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz Andre Barroso. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). 17--28. Google Scholar
Digital Library
- J. S. Plank and M. Blaum. 2014. Sector-disk (SD) erasure codes for mixed failure modes in RAID systems. ACM Trans. Storage 10, 1 (January 2014). DOI:http://dx.doi.org/10.1145/2560013 Google Scholar
Digital Library
- J. S. Plank, M. Blaum, and J. L. Hafner. 2013. SD codes: Erasure codes designed for how storage systems really fail. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). Google Scholar
Digital Library
- Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. IRON file systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05). 206--220. DOI:http://dx.doi.org/10.1145/1095810.1095830 Google Scholar
Digital Library
- Bianca Schroeder and A. Garth Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). Article 1. Google Scholar
Digital Library
- Thomas J. E. Schwarz, Qin Xin, Ethan L. Miller, Darrell D. E. Long, Andy Hospodor, and Spencer Ng. 2004. Disk scrubbing in large archival storage systems. In IEEE MASCOTS’04. IEEE Computer Society, 409--418. Google Scholar
Digital Library
- Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). IEEE, 1--10. Google Scholar
Digital Library
- Nisha Talagala and David Patterson. 1999. An analysis of error behaviour in a large storage system. In Proceedings of the IEEE Workshop on Fault Tolerance in Parallel and Distributed Systems.Google Scholar
- Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu. 2012. Characteristics of backup workloads in production systems. In Proceedings of the 10th Conference on File and Storage Technologies (FAST’12). Google Scholar
Digital Library
- Chip Walter. 2005. Kryder’s law. Scientific American 293, 2 (2005), 32--33.Google Scholar
Cross Ref
Index Terms
RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures
Recommendations
HPDA: A hybrid parity-based disk array for enhanced performance and reliability
Flash-based Solid State Drive (SSD) has been productively shipped and deployed in large scale storage systems. However, a single flash-based SSD cannot satisfy the capacity, performance and reliability requirements of the modern storage systems that ...
Lonestar: An Energy-Aware Disk Based Long-Term Archival Storage System
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed SystemsWe present the architecture for an disk based archival storage system and propose a new RAID scheme that is designed for "write once, read sometimes" workloads. By intertwining parity groups into a multi-dimensional RAID and improving the single disk ...
Higher reliability redundant disk arrays: Organization, operation, and coding
Parity is a popular form of data protection in redundant arrays of inexpensive/independent disks (RAID). RAID5 dedicates one out of N disks to parity to mask single disk failures, that is, the contents of a block on a failed disk can be reconstructed by ...






Comments