skip to main content
research-article

Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics

Published:24 November 2008Publication History
Skip Abstract Section

Abstract

Building reliable storage systems becomes increasingly challenging as the complexity of modern storage systems continues to grow. Understanding storage failure characteristics is crucially important for designing and building a reliable storage system. While several recent studies have been conducted on understanding storage failures, almost all of them focus on the failure characteristics of one component—disks—and do not study other storage component failures.

This article analyzes the failure characteristics of storage subsystems. More specifically, we analyzed the storage logs collected from about 39,000 storage systems commercially deployed at various customer sites. The dataset covers a period of 44 months and includes about 1,800,000 disks hosted in about 155,000 storage-shelf enclosures. Our study reveals many interesting findings, providing useful guidelines for designing reliable storage systems. Some of our major findings include: (1) In addition to disk failures that contribute to 20--55% of storage subsystem failures, other components such as physical interconnects and protocol stacks also account for a significant percentage of storage subsystem failures. (2) Each individual storage subsystem failure type, and storage subsystem failure as a whole, exhibits strong self-correlations. In addition, these failures exhibit “bursty” patterns. (3) Storage subsystems configured with redundant interconnects experience 30--40% lower failure rates than those with a single interconnect. (4) Spanning disks of a RAID group across multiple shelves provides a more resilient solution for storage subsystems than within a single shelf.

References

  1. Allen, B. 2004. Monitoring hard disks with smart. Linux J. 117, 9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bairavasundaram, L. N., Goodson, G. R., Schroeder, B., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2008. An Analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST), San Jose, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bairavasundaram, L. N., Goodson, G. R., Pasupathy, S., and Schindler, J. 2007. An analysis of latent sector errors in disk drives. SIGMETRICS Perform. Eval. Rev. 35, 1, 289--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Cole, G. 2000. Estimating drive reliability in desktop computers and consumer electronics systems. Tech. Rep., Seagate Technology Paper TP-338.1.Google ScholarGoogle Scholar
  5. Corbett, P., English, B., Goel, A., Grcanac, T., Kleiman, S., Leong, J., and Sankar, S. 2004. Row-Diagonal parity for double disk failure correction. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST), 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Elerath, J. G. and Shah, S. 2004. Server class disk drives: How reliable are they. In Proceedings of the IEEE Reliability and Maintainability Symposium, 151--156.Google ScholarGoogle Scholar
  7. Elerath, J. G. and Shah, S. 2003. Disk drive reliability case study: Dependence upon head fly-height and quantity of heads. In Proceedings of the Reliability and Maintainability Symposium, 608--612.Google ScholarGoogle Scholar
  8. EMC. 2007. EMC symmetrix DMX-4 specification sheet. http://www.emc.com/products/systems/symmetrix/symmetri_DMX1000/pdf/DMX3000.pdf.Google ScholarGoogle Scholar
  9. Ghemawat, S., Gobioff, H., and Leung, S.-T. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), New York, 29--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Gray, J. 1990. A census of tandem system availability between 1985 and 1990. In Proceedings of the IEEE Transactions on Reliability.Google ScholarGoogle ScholarCross RefCross Ref
  11. Lancaster, L. and Rowe, A. 2001. Measuring real-world data availability. In Proceedings of the 15th USENIX Conference on System Administration (LISA), Berkeley, CA, 93--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. NetApp. 2008. FAS6000 series technical specifications. http://www.netapp.com/products/filer/fas6000_tech_specs.html.Google ScholarGoogle Scholar
  13. Patterson, D. A., Gibson, G. A., and Katz, R. H. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, IL, H. Boral and P.-Å. Larson, eds. ACM Press, 109--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Pinheiro, E., Weber, W.-D., and Barroso, L. A. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST), Berkeley, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Rosander, A. C. 1951. Elementary Principles of Statistics. D. Van Nostrand Company.Google ScholarGoogle Scholar
  16. Schroeder, B. and Gibson, G. A. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST), Berkeley, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Schulze, M., Gibson, G. A., Katz, R. H., and Patterson, D. A. 1989. How reliable is a RAID? In Proceedings of the COMPCON. 118--123.Google ScholarGoogle Scholar
  18. Shah, S. and Elerath, J. G. 2005. Reliability analysis of disk drive failure mechanisms. In Proceedings of the IEEE Reliability and Maintainability Symposium, 226--231.Google ScholarGoogle Scholar
  19. SNIA. 2008. Storage Networking Industry Association dictionary. http://www.snia.org/education/dictionary/.Google ScholarGoogle Scholar
  20. Talagala, N. and Patterson, D. 1999. An analysis of error behaviour in a large storage system. Tech. Rep. UCB/CSD-99-1042. Electrical Engineering and Computer Science Department, University of California, Berkeley. February. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Yang, J. and Sun, F.-B. 1999. A comprehensive review of hard-disk drive reliability. In Proceedings of the Reliability and Maintainability Symposium, 403--409.Google ScholarGoogle Scholar

Index Terms

  1. Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Storage
          ACM Transactions on Storage  Volume 4, Issue 3
          November 2008
          108 pages
          ISSN:1553-3077
          EISSN:1553-3093
          DOI:10.1145/1416944
          Issue’s Table of Contents

          Copyright © 2008 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 24 November 2008
          • Accepted: 1 August 2008
          • Received: 1 February 2008
          Published in tos Volume 4, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!