Abstract
An important threat to reliable storage of data is silent data corruption. In order to develop suitable protection mechanisms against data corruption, it is essential to understand its characteristics. In this article, we present the first large-scale study of data corruption. We analyze corruption instances recorded in production storage systems containing a total of 1.53 million disk drives, over a period of 41 months. We study three classes of corruption: checksum mismatches, identity discrepancies, and parity inconsistencies. We focus on checksum mismatches since they occur the most.
We find more than 400,000 instances of checksum mismatches over the 41-month period. We find many interesting trends among these instances, including: (i) nearline disks (and their adapters) develop checksum mismatches an order of magnitude more often than enterprise-class disk drives, (ii) checksum mismatches within the same disk are not independent events and they show high spatial and temporal locality, and (iii) checksum mismatches across different disks in the same storage system are not independent. We use our observations to derive lessons for corruption-proof system design.
- Alvarez, G. A., Burkhard, W. A., and Cristian, F. 1997. Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA), Denver, CO, 62--72. Google Scholar
Digital Library
- Bairavasundaram, L. N., Goodson, G. R., Pasupathy, S., and Schindler, J. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS), San Diego, CA. Google Scholar
Digital Library
- Bartlett, W. and Spainhower, L. 2004. Commercial fault tolerance: A tale of two systems. IEEE Trans. on Dependable and Secure Comput. 1, 1 (Jan.), 87--96. Google Scholar
Digital Library
- Blaum, M., Brady, J., Bruck, J., and Menon, J. 1994. EVENODD: An optimal scheme for tolerating double disk failures in RAID architectures. In Proceedings of the 21st Annual International Symposium on Computer Architecture (ISCA), Chicago, IL, 245--254. Google Scholar
Digital Library
- Corbett, P., English, B., Goel, A., Grcanac, T., Kleiman, S., Leong, J., and Sankar, S. 2004. Row-Diagonal parity for double disk failure correction. In Proceedings of the 3rd USENIX Symposium on File and Storage Technologies (FAST), San Francisco, CA, 1--14. Google Scholar
Digital Library
- Darden, M. H. 2002. Data integrity: The Dell—EMC distinction. http://www.dell.com/content/topics/global.aspx/power/en/ps2q02_darden?c=us&cs=555&l=en&s=biz.Google Scholar
- Elerath, J. G. and Shah, S. 2004. Server class disk drives: How reliable are they. In Proceedings of the 50th Annual Reliability and Maintainability Symposium, Los Angeles, CA, 151--156.Google Scholar
- Ghemawat, S., Gobioff, H., and Leung, S.-T. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), Bolton Landing (Lake George), New York, 29--43. Google Scholar
Digital Library
- Hafner, J. L. 2005. WEAVER codes: Highly fault tolerant erasure codes for storage systems. In Proceedings of the 4th USENIX Symposium on File and Storage Technologies (FAST), San Francisco, CA. Google Scholar
Digital Library
- Hafner, J. L., Deenadhayalan, V. W., Rao, K., and Tomlin, J. A. 2005. Matrix methods for lost data reconstruction in erasure codes. In Proceedings of the 4th USENIX Symposium on File and Storage Technologies (FAST), San Francisco, CA. Google Scholar
Digital Library
- Jiang, W., Hu, C., Kanevsky, A., and Zhou, Y. 2008. Is disk the dominant contributor for storage subsystem failures? A comprehensive study of failure characteristics. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST), San Jose, CA. Google Scholar
Digital Library
- Park, C.-I. 1995. Efficient placement of parity and data to tolerate two disk failures in disk array systems. IEEE Trans. Parallel Distrib. Syst. 6, 11 (Nov.), 1177--1184. Google Scholar
Digital Library
- Patterson, D., Gibson, G., and Katz, R. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM SIGMOD Conference on the Management of Data (SIGMOD), Chicago, IL, 109--116. Google Scholar
Digital Library
- Pinheiro, E., Weber, W. D., and Barroso, L. A. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST), San Jose, CA. Google Scholar
Digital Library
- Prabhakaran, V., Bairavasundaram, L. N., Agrawal, N., Gunawi, H. S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2005. IRON file systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP), Brighton, United Kingdom, 206--220. Google Scholar
Digital Library
- Schroeder, B. and Gibson, G. A. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST), San Jose, CA. Google Scholar
Digital Library
- Shah, S. and Elerath, J. G. 2005. Reliability analyses of disk drive failure mechanisms. In Proceedings of the 51st Annual Reliability and Maintainability Symposium, Alexandria, VA, 226--231.Google Scholar
- Shah, S. and Elerath, J. G. 2004. Disk drive vintage and its effect on reliability. In Proceedings of the 50th Annual Reliability and Maintainability Symposium, Los Angeles, CA, 163--167.Google Scholar
- Sivathanu, G., Wright, C. P., and Zadok, E. 2005. Ensuring data integrity in storage: Techniques and applications. In Proceedings of the ACM Workshop on Storage Security and Survivability (StorageSS), Fairfax, VA, 26--36. Google Scholar
Digital Library
- Sivathanu, M., Prabhakaran, V., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2004. Improving storage system availability with D-GRAID. In Proceedings of the 3rd USENIX Symposium on File and Storage Technologies (FAST), San Francisco, CA, 15--30. Google Scholar
Digital Library
- Sun Microsystems. 2006. ZFS: The last word in file systems. www.sun.com/2004-0914/feature/.Google Scholar
- Sundaram, R. 2006. The private lives of disk drives. http://www.netapp.com/go/techontap/matl/sample/0206tot_resiliency.html.Google Scholar
- Weber. 1998. Information technology: SCSI primary commands (SPC-2). Tech. Rep. T10 Project 1236-D Revision 5. September.Google Scholar
Index Terms
(auto-classified)An analysis of data corruption in the storage stack
Recommendations
An analysis of latent sector errors in disk drives
SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systemsThe reliability measures in today's disk drive-based storage systems focus predominantly on protecting against complete disk failures. Previous disk reliability studies have analyzed empirical data in an attempt to better understand and predict disk ...
An analysis of data corruption in the storage stack
FAST'08: Proceedings of the 6th USENIX Conference on File and Storage TechnologiesAn important threat to reliable storage of data is silent data corruption. In order to develop suitable protection mechanisms against data corruption, it is essential to understand its characteristics. In this paper, we present the first large-scale ...
An analysis of latent sector errors in disk drives
SIGMETRICS '07 Conference ProceedingsThe reliability measures in today's disk drive-based storage systems focus predominantly on protecting against complete disk failures. Previous disk reliability studies have analyzed empirical data in an attempt to better understand and predict disk ...






Comments