skip to main content
research-article

An analysis of data corruption in the storage stack

Published:24 November 2008Publication History
Skip Abstract Section

Abstract

An important threat to reliable storage of data is silent data corruption. In order to develop suitable protection mechanisms against data corruption, it is essential to understand its characteristics. In this article, we present the first large-scale study of data corruption. We analyze corruption instances recorded in production storage systems containing a total of 1.53 million disk drives, over a period of 41 months. We study three classes of corruption: checksum mismatches, identity discrepancies, and parity inconsistencies. We focus on checksum mismatches since they occur the most.

We find more than 400,000 instances of checksum mismatches over the 41-month period. We find many interesting trends among these instances, including: (i) nearline disks (and their adapters) develop checksum mismatches an order of magnitude more often than enterprise-class disk drives, (ii) checksum mismatches within the same disk are not independent events and they show high spatial and temporal locality, and (iii) checksum mismatches across different disks in the same storage system are not independent. We use our observations to derive lessons for corruption-proof system design.

References

  1. Alvarez, G. A., Burkhard, W. A., and Cristian, F. 1997. Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA), Denver, CO, 62--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bairavasundaram, L. N., Goodson, G. R., Pasupathy, S., and Schindler, J. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS), San Diego, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bartlett, W. and Spainhower, L. 2004. Commercial fault tolerance: A tale of two systems. IEEE Trans. on Dependable and Secure Comput. 1, 1 (Jan.), 87--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Blaum, M., Brady, J., Bruck, J., and Menon, J. 1994. EVENODD: An optimal scheme for tolerating double disk failures in RAID architectures. In Proceedings of the 21st Annual International Symposium on Computer Architecture (ISCA), Chicago, IL, 245--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Corbett, P., English, B., Goel, A., Grcanac, T., Kleiman, S., Leong, J., and Sankar, S. 2004. Row-Diagonal parity for double disk failure correction. In Proceedings of the 3rd USENIX Symposium on File and Storage Technologies (FAST), San Francisco, CA, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Darden, M. H. 2002. Data integrity: The Dell—EMC distinction. http://www.dell.com/content/topics/global.aspx/power/en/ps2q02_darden?c=us&cs=555&l=en&s=biz.Google ScholarGoogle Scholar
  7. Elerath, J. G. and Shah, S. 2004. Server class disk drives: How reliable are they. In Proceedings of the 50th Annual Reliability and Maintainability Symposium, Los Angeles, CA, 151--156.Google ScholarGoogle Scholar
  8. Ghemawat, S., Gobioff, H., and Leung, S.-T. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), Bolton Landing (Lake George), New York, 29--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Hafner, J. L. 2005. WEAVER codes: Highly fault tolerant erasure codes for storage systems. In Proceedings of the 4th USENIX Symposium on File and Storage Technologies (FAST), San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Hafner, J. L., Deenadhayalan, V. W., Rao, K., and Tomlin, J. A. 2005. Matrix methods for lost data reconstruction in erasure codes. In Proceedings of the 4th USENIX Symposium on File and Storage Technologies (FAST), San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jiang, W., Hu, C., Kanevsky, A., and Zhou, Y. 2008. Is disk the dominant contributor for storage subsystem failures? A comprehensive study of failure characteristics. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST), San Jose, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Park, C.-I. 1995. Efficient placement of parity and data to tolerate two disk failures in disk array systems. IEEE Trans. Parallel Distrib. Syst. 6, 11 (Nov.), 1177--1184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Patterson, D., Gibson, G., and Katz, R. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM SIGMOD Conference on the Management of Data (SIGMOD), Chicago, IL, 109--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Pinheiro, E., Weber, W. D., and Barroso, L. A. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST), San Jose, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Prabhakaran, V., Bairavasundaram, L. N., Agrawal, N., Gunawi, H. S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2005. IRON file systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP), Brighton, United Kingdom, 206--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Schroeder, B. and Gibson, G. A. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST), San Jose, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Shah, S. and Elerath, J. G. 2005. Reliability analyses of disk drive failure mechanisms. In Proceedings of the 51st Annual Reliability and Maintainability Symposium, Alexandria, VA, 226--231.Google ScholarGoogle Scholar
  18. Shah, S. and Elerath, J. G. 2004. Disk drive vintage and its effect on reliability. In Proceedings of the 50th Annual Reliability and Maintainability Symposium, Los Angeles, CA, 163--167.Google ScholarGoogle Scholar
  19. Sivathanu, G., Wright, C. P., and Zadok, E. 2005. Ensuring data integrity in storage: Techniques and applications. In Proceedings of the ACM Workshop on Storage Security and Survivability (StorageSS), Fairfax, VA, 26--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Sivathanu, M., Prabhakaran, V., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2004. Improving storage system availability with D-GRAID. In Proceedings of the 3rd USENIX Symposium on File and Storage Technologies (FAST), San Francisco, CA, 15--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sun Microsystems. 2006. ZFS: The last word in file systems. www.sun.com/2004-0914/feature/.Google ScholarGoogle Scholar
  22. Sundaram, R. 2006. The private lives of disk drives. http://www.netapp.com/go/techontap/matl/sample/0206tot_resiliency.html.Google ScholarGoogle Scholar
  23. Weber. 1998. Information technology: SCSI primary commands (SPC-2). Tech. Rep. T10 Project 1236-D Revision 5. September.Google ScholarGoogle Scholar

Index Terms

(auto-classified)
  1. An analysis of data corruption in the storage stack

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Storage
          ACM Transactions on Storage  Volume 4, Issue 3
          November 2008
          108 pages
          ISSN:1553-3077
          EISSN:1553-3093
          DOI:10.1145/1416944
          Issue’s Table of Contents

          Copyright © 2008 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 24 November 2008
          • Accepted: 1 August 2008
          • Received: 1 February 2008
          Published in tos Volume 4, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!