skip to main content
research-article

RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures

Published:20 November 2015Publication History
Skip Abstract Section

Abstract

Modern storage systems orchestrate a group of disks to achieve their performance and reliability goals. Even though such systems are designed to withstand the failure of individual disks, failure of multiple disks poses a unique set of challenges. We empirically investigate disk failure data from a large number of production systems, specifically focusing on the impact of disk failures on RAID storage systems. Our data covers about one million SATA disks from six disk models for periods up to 5 years. We show how observed disk failures weaken the protection provided by RAID. The count of reallocated sectors correlates strongly with impending failures.

With these findings we designed RAIDShield, which consists of two components. First, we have built and evaluated an active defense mechanism that monitors the health of each disk and replaces those that are predicted to fail imminently. This proactive protection has been incorporated into our product and is observed to eliminate 88% of triple disk errors, which are 80% of all RAID failures. Second, we have designed and simulated a method of using the joint failure probability to quantify and predict how likely a RAID group is to face multiple simultaneous disk failures, which can identify disks that collectively represent a risk of failure even when no individual disk is flagged in isolation. We find in simulation that RAID-level analysis can effectively identify most vulnerable RAID-6 systems, improving the coverage to 98% of triple errors.

We conclude with discussions of operational considerations in deploying RAIDShield more broadly and new directions in the analysis of disk errors. One interesting approach is to combine multiple metrics, allowing the values of different indicators to be used for predictions. Using newer field data that reports an additional metric, medium errors, we find that the relative efficacy of reallocated sectors and medium errors varies across disk models, offering an additional way to predict failures.

References

  1. Bruce Allen. 2004. Monitoring hard disks with S.M.A.R.T. Linux Journal 2004, 117, 9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Alvarez, Walter A. Burkhard, and Flaviu Cristian. 1997. Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering. In Proceedings of the 24th International Symposium on Computer Architecture (ISCA’97). 62--72. DOI:http://dx.doi.org/10.1145/264107.264132 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ahmed Amer, Darrell D. E. Long, and S. J. Thomas Schwarz. 2014. Reliability challenges for storing exabytes. In Proceedings of the 2014 International Conference on Computing, Networking and Communications (ICNC). IEEE, 907--913.Google ScholarGoogle Scholar
  4. Dave Anderson, Jim Dykes, and Erik Riedel. 2003. More than an interface: SCSI vs. ATA. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST’03). 245--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. 2001. Fail-stutter fault tolerance. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS VIII). 33--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07). 289--300. DOI:http://dx.doi.org/10.1145/1254882.1254917 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Wendy Bartlett and Lisa Spainhower. 2004. Commercial fault tolerance: A tale of two systems. IEEE Trans. Dependable Secur. Comput. 1, 1 (Jan. 2004), 87--96. DOI:http://dx.doi.org/10.1109/TDSC.2004.4 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Peter Bodik, Moises Goldszmidt, Armando Fox, Dawn B. Woodard, and Hans Andersen. 2010. Fingerprinting the datacenter: Automated classification of performance crises. In Proceedings of the 2010 EuroSys Conference (EuroSys’10). 111--124. DOI:http://dx.doi.org/10.1145/1755913.1755926 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jeff Bonwick and Bill Moore. 2008. ZFS: The last world in file systems. In SNIA Software Developers’s Conference.Google ScholarGoogle Scholar
  11. Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz, and David A. Patterson. 1994. RAID: High-performance, reliable secondary storage. ACM Comput. Surv. 26, 2 (June 1994), 145--185. DOI:http://dx.doi.org/10.1145/176979.176981 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Peter Corbett, Bob English, Atul Goel, Tomislav Grcanac, Steven Kleiman, James Leong, and Sunitha Sankar. 2004. Row-diagonal parity for double disk failure correction. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST’04). 14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ajay Dholakia, Evangelos Eleftheriou, Xiao-Yu Hu, Ilias Iliadis, Jai Menon, and K. K. Rao. 2008. A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors. ACM Trans. Storage 4, 1, Article 1 (May 2008), 42 pages. DOI:http://dx.doi.org/10.1145/1353452.1353453 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Cezary Dubnicki and others. 2009. HYDRAstor: A scalable secondary storage. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jon Elerath. 2009. Hard-disk drives: The good, the bad, and the ugly. Commun. ACM 52, 6 (June 2009), 38--45. DOI:http://dx.doi.org/10.1145/1516046.1516059 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jon G. Elerath and Jiri Schindler. 2014. Beyond MTTDL: A closed-form RAID 6 reliability equation. ACM Trans. Storage 10, 2, Article 7 (March 2014), 21 pages. DOI:http://dx.doi.org/10.1145/2577386 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). 29--43. DOI:http://dx.doi.org/10.1145/945445.945450 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Garth Gibson. 1992. Redundant Disk Arrays: Reliable, Parallel Secondary Storage. Ph.D. Dissertation. University of California, Berkeley, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Atul Goel and Peter Corbett. 2012. RAID triple parity. ACM SIGOPS Oper. Syst. Rev. 46, 3 (Dec. 2012), 41--49. DOI:http://dx.doi.org/10.1145/2421648.2421655 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Moises Goldszmidt. 2012. Finding soon-to-fail disks in a haystack. In USENIX HotStorage’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kevin M. Greenan, James S. Plank, and Jay J. Wylie. 2010. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In USENIX HotStorage’10. 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. James Lee Hafner. 2005. WEAVER codes: Highly fault tolerant erasure codes for storage systems. In Proceedings of the 4th Conference on USENIX Conference on File and Storage Technologies (FAST’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. James Lee Hafner, Veera Deenadhayalan, K. K. Rao, and John A. Tomlin. 2005. Matrix methods for lost data reconstruction in erasure codes. In Proceedings of the 4th Conference on USENIX Conference on File and Storage Technologies (FAST’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Greg Hamerly and Charles Elkan. 2001. Bayesian approaches to failure prediction for disk drives. In ICML’01. 202--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Cheng Huang and others. 2012. Erasure coding in windows azure storage. In Proceedings of the USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Gordan F. Hughes, Joseph F. Murray, Kenneth Kreutz-Delgado, and Charles Elkan. 2002. Improved disk-drive failure warnings. IEEE Trans. Reliability 51, 3 (Sept. 2002), 350--357.Google ScholarGoogle ScholarCross RefCross Ref
  27. Navendu Jain, Mike Dahlin, and Renu Tewari. 2005. TAPER: Tiered approach for eliminating redundancy in replica synchronization. In Proceedings of the 4th Conference on USENIX Conference on File and Storage Technologies (FAST’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Hannu Kari, Heikki Saikkonen, and Fabrizio Lombardi. 1993. Detection of defective media in disks. In Proceedings of the IEEE Workshop on Defect and Fault Tolerance in VLSI Systems. 49--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Hannu H. Kari. 1997. Latent Sector Faults and Reliability of Disk Arrays. Ph.D. Dissertation. Helsinki University of Technology, Espoo, Finland.Google ScholarGoogle Scholar
  30. O. Khan, R. Burns, J. S. Plank, W. Pierce, and C. Huang. 2012. Rethinking erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Andrew Krioukov, Lakshmi N. Bairavasundaram, Garth R. Goodson, Kiran Srinivasan, Randy Thelen, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2008. Parity lost and parity regained. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mingqiang Li, Jiwu Shu, and Weimin Zheng. 2009. GRID codes: Strip-based erasure codes with high fault tolerance for storage systems. ACM Trans. Storage 4, 4, Article 15 (Feb. 2009), 22 pages. DOI:http://dx.doi.org/10.1145/1480439.1480444 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ao Ma, Fred Douglis, Guanlin Lu, Darren Sawyer, Surendar Chandra, and Windsor Hsu. 2015. RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. In 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, Santa Clara, CA, 241--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Chris Mellor. 2014. Kryder’s law craps out: Race to UBER-cheap storage is over. The A Register. Retrieved from http://www.theregister.co.uk/2014/11/10/kryders_law_of_ever_cheaper_storage_disproven.Google ScholarGoogle Scholar
  35. Joseph F. Murray, Gordon F. Hughes, and Kenneth Kreutz-Delgado. 2003. Hard drive failure prediction using non-parametric statistical methods. In ICANN/ICONIP. 4.Google ScholarGoogle Scholar
  36. Joseph F. Murray, Gordon F. Hughes, and Dale Schuurmans. 2005. Machine learning methods for predicting failures in hard drives: A multiple-instance application. In J. Mach. Learning Res., Vol. 6. 783--816. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J.-F. Pâris, S. J. T. Schwarz, Ahmed Amer, and Darrell D. E. Long. 2010. Improving disk array reliability through expedited scrubbing. In 2010 IEEE Fifth International Conference on Networking, Architecture and Storage (NAS). IEEE, 119--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. David Patterson, Garth Gibson, and Randy Katz. 1988. A case for redundant arrays of inexpensive disks (raid). In Proceedings of the 1988 ACM SIGMOD Conference on Management of Data (SIGMOD’88). 109--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Hugo Patterson, Stephen Manley, Mike Federwisch, Dave Hitz, Steve Kleiman, and Shane Owara. 2002. SnapMirror: File system based asynchronous mirroring for disaster recovery. In Proceedings of the 1st USENIX Conference on File and Storage Technologies. 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz Andre Barroso. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). 17--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. S. Plank and M. Blaum. 2014. Sector-disk (SD) erasure codes for mixed failure modes in RAID systems. ACM Trans. Storage 10, 1 (January 2014). DOI:http://dx.doi.org/10.1145/2560013 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. S. Plank, M. Blaum, and J. L. Hafner. 2013. SD codes: Erasure codes designed for how storage systems really fail. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. IRON file systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05). 206--220. DOI:http://dx.doi.org/10.1145/1095810.1095830 Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Bianca Schroeder and A. Garth Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). Article 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Thomas J. E. Schwarz, Qin Xin, Ethan L. Miller, Darrell D. E. Long, Andy Hospodor, and Spencer Ng. 2004. Disk scrubbing in large archival storage systems. In IEEE MASCOTS’04. IEEE Computer Society, 409--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). IEEE, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Nisha Talagala and David Patterson. 1999. An analysis of error behaviour in a large storage system. In Proceedings of the IEEE Workshop on Fault Tolerance in Parallel and Distributed Systems.Google ScholarGoogle Scholar
  48. Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu. 2012. Characteristics of backup workloads in production systems. In Proceedings of the 10th Conference on File and Storage Technologies (FAST’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Chip Walter. 2005. Kryder’s law. Scientific American 293, 2 (2005), 32--33.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Storage
      ACM Transactions on Storage  Volume 11, Issue 4
      Special Issue USENIX FAST 2015
      November 2015
      141 pages
      ISSN:1553-3077
      EISSN:1553-3093
      DOI:10.1145/2836327
      Issue’s Table of Contents

      Copyright © 2015 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 November 2015
      • Received: 1 August 2015
      • Accepted: 1 August 2015
      Published in tos Volume 11, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!