skip to main content
research-article

Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation

Published:01 March 2014Publication History
Skip Abstract Section

Abstract

We introduce a new closed-form equation for estimating the number of data-loss events for a redundant array of inexpensive disks in a RAID-6 configuration. The equation expresses operational failures, their restorations, latent (sector) defects, and disk media scrubbing by time-based distributions that can represent non-homogeneous Poisson processes. It uses two-parameter Weibull distributions that allows the distributions to take on many different shapes, modeling increasing, decreasing, or constant occurrence rates. This article focuses on the statistical basis of the equation. It also presents time-based distributions of the four processes based on an extensive analysis of field data collected over several years from 10,000s of commercially available systems with 100,000s of disk drives. Our results for RAID-6 groups of size 16 indicate that the closed-form expression yields much more accurate results compared to the MTTDL reliability equation and matching computationally-intensive Monte Carlo simulations.

References

  1. Ascher, H. 1983. Statistical methods in reliability: Discussion. Technometrics 25, 4.Google ScholarGoogle Scholar
  2. Ascher, H. 1999. A set of numbers is not a dataset. IEEE Trans. Reliab 48, 2.Google ScholarGoogle ScholarCross RefCross Ref
  3. Ascher, H. 2010. Personal communication.Google ScholarGoogle Scholar
  4. Bairavasundaram, L., Goodson, G., Pasupathy, S., and Schindler, J. 2007. An analysis of latent sector errors in disk drives. ACM SIGMETRICS, Perform. Eval. Rev. 35, 1, 289--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci-Dusseau, A., and Arpaci-Dusseau, R. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bartlett J., Bartlett, W., Carr, R., Garcia, D., Gray, J., Horst, R., Jardine, R., Lenoski, D., and McGuire, D. 1990. Fault tolerance in tandem computers. NetApp Tech. rep. 90.5.Google ScholarGoogle Scholar
  7. Bazovsky, I. 1961. Reliability Theory and Practice. Prentice Hall.Google ScholarGoogle Scholar
  8. Blaum, M., Brady, J., Bruck, J., and Menon, J. 1994. EVENODD: An optimal scheme for tolerating double disk failures in RAID architectures. In Proceedings of the 21st International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Corbett, P., English, B., Goel, A., Grcanac, T., Kleiman, S., Leong, J., and Shankar, S. 2004. Row-diagonal parity for double disk failure correction. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dholakia, A., Eleftheriou, E., Hu, X., Iliadis, I., Menon, J., and Rao, K. K. 2006. Analysis of a new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors. ACM SIGMETRICS Perform. Eval. Rev. 34, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Elerath, J. 2009a. A simple equation for estimating reliability of an N+1 redundant array of independent disks. In Proceedings of the 39th International Conference on Dependable Systems and Networks.Google ScholarGoogle ScholarCross RefCross Ref
  12. Elerath, J. 2009b. Hard disk drives: The good, the bad and the ugly. ACM Queue 52, 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Elerath, J. and Pecht, M. 2009. A highly accurate method for assessing reliability of redundant arrays of inexpensive disks (RAID). IEEE Trans. Comput. 58, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. EMC. 2007. EMC CLARiion RAID 6 Technology: A detailed review. http://www.emc.com/collateral/hardware/white-papers/h2891-clariion-raid-6.pdf. (Last accessed July 2012.)Google ScholarGoogle Scholar
  15. Gao, Y., Meister, D., and Binkmann, A. 2010. Reliability analysis of declustered-parity RAID 6 with disk scrubbing and considering irrecoverable read errors. In Proceedings of the IEEE International Conference on Networking, Architecture, and Storage. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Gibson, G. and Patterson, D. 1993. Designing disk arrays for high data reliability. J. Parallel Distrib. Comput. 17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Greenan, K., Plank, J., and Wylie, J. 2010. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In Proceedings of the 2nd USENIX Conference on Hot Topics in Storage and File Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Kececioglu, D. 1993. Reliability & Life Testing Handbook, Volumes 1 & 2. Prentice Hall.Google ScholarGoogle Scholar
  19. Malhotra, M. and Trivedi, K. 1993. Reliability analysis of redundant arrays of inexpensive disks. J. Parallel Distrib. Comput. 17, 1--2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Nelson, W. 1982. Applied Life Data Analysis. Addison-Wesley.Google ScholarGoogle Scholar
  21. Nelson, W. 1990. Accelerated Testing. Wiley & Sons.Google ScholarGoogle Scholar
  22. Nelson, W. 2003. Recurrent Events Data Analysis for Product Repairs, Disease Recurrences, and Other Applications. ASA-SIAM Series on Statistics and Applied Probability, Society for Industrial and Applied Mathematics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. NetApp. 2013. NetApp data ONTAP 8 operating system.http://www.netapp.com/us/products/platform-os/data-ontap-8/index.aspx.Google ScholarGoogle Scholar
  24. Oracle. 2010. A better RAID strategy for high capacity drives in mainframe storage. http://www.oracle.com/technetwork/articles/systems-hardware-architecture/raid-strategy-hi-capacity -drives-170907.pdf.Google ScholarGoogle Scholar
  25. Paris, J., Amer, A., Long, D., and Schwarz, T. 2009. Evaluating the impact of irrecoverable read errors on disk array reliability. In Proceedings of the 15th IEEE Pacific Rim International Symposium on Dependable Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Patterson, D. A., Gibson, G., and Katz, R. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Pinheiro, E., Weber, W., and Barroso, L. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Rao, K. K., Hafner, J., and Golding, R. 2006. Reliability for networked storage nodes. In Proceedings of the 36th International Conference on Dependable Systems and Networks. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Schroeder, B. and Gibson, G. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Serve The Home. 2011. The RAID reliability anthology -- The primer. http://www.servethehome.com/raid-reliability-failureanthology-part-primer.Google ScholarGoogle Scholar
  31. Shah, S. and Elerath, J. 2005. Reliability analysis of disk drive failure mechanisms. In Proceedings of the IEEE Reliability and Maintainability Symposium.Google ScholarGoogle Scholar
  32. Thompson, W. 1981. On the foundations of reliability. Technometrics 23, 1.Google ScholarGoogle ScholarCross RefCross Ref
  33. Thomasian, A. and Blaum, M. 2009. Higher reliability in redundant disk arrays: Organization, operation, and coding. ACM Trans. Storage 5, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Tobias, P. and Trindade, D. 2011. Applied Reliability3rd Ed. CRC Press.Google ScholarGoogle Scholar

Index Terms

  1. Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Storage
      ACM Transactions on Storage  Volume 10, Issue 2
      March 2014
      86 pages
      ISSN:1553-3077
      EISSN:1553-3093
      DOI:10.1145/2600090
      • Editor:
      • Darrell Long
      Issue’s Table of Contents

      Copyright © 2014 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 March 2014
      • Accepted: 1 July 2013
      • Revised: 1 June 2013
      • Received: 1 March 2013
      Published in tos Volume 10, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!