skip to main content
research-article

Understanding latent sector errors and how to protect against them

Published:28 September 2010Publication History
Skip Abstract Section

Abstract

Latent sector errors (LSEs) refer to the situation where particular sectors on a drive become inaccessible. LSEs are a critical factor in data reliability, since a single LSE can lead to data loss when encountered during RAID reconstruction after a disk failure or in systems without redundancy. LSEs happen at a significant rate in the field [Bairavasundaram et al. 2007], and are expected to grow more frequent with new drive technologies and increasing drive capacities. While two approaches, data scrubbing and intra-disk redundancy, have been proposed to reduce data loss due to LSEs, none of these approaches has been evaluated on real field data.

This article makes two contributions. We provide an extended statistical analysis of latent sector errors in the field, specifically from the view point of how to protect against LSEs. In addition to providing interesting insights into LSEs, we hope the results (including parameters for models we fit to the data) will help researchers and practitioners without access to data in driving their simulations or analysis of LSEs. Our second contribution is an evaluation of five different scrubbing policies and five different intra-disk redundancy schemes and their potential in protecting against LSEs. Our study includes schemes and policies that have been suggested before, but have never been evaluated on field data, as well as new policies that we propose based on our analysis of LSEs in the field.

References

  1. }}Bairavasundaram, L. N., Goodson, G. R., Pasupathy, S., and Schindler, J. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. ACM, New York, 289--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. }}Baker, M., Shah, M., Rosenthal, D. S. H., Roussopoulos, M., Maniatis, P., Giuli, T., and Bungale, P. 2006. A fresh look at the reliability of long-term digital storage. In Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems. ACM, New York, 221--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. }}Blaum, M., Brady, J., Bruck, J., and Menon, J. 1994. Evenodd: an optimal scheme for tolerating double disk failures in RAID architectures. In Proceedings of the 21 International Conference on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. }}Corbett, P., English, B., Goel, A., Grcanac, T., Kleiman, S., Leong, J., and Sankar, S. 2004. Row-diagonal parity for double disk failure correction. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies. USENIX Association, Monterey, CA, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. }}Dholakia, A., Eleftheriou, E., Hu, X.-Y., Iliadis, I., Menon, J., and Rao, K. K. 2008. A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors. ACM Trans. Storage 4, 1. Article 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. }}Elerath, J. G. 2009. Hard-disk drives: the good, the bad, and the ugly. Comm. ACM 52, 6, 38-45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. }}Elerath, J. G. and Pecht, M. 2007. Enhanced reliability modeling of RAID storage systems. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 175--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. }}Gunawi, H. S., Prabhakaran, V., Krishnan, S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2007. Improving file system reliability with I/O shepherding. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP'07). ACM, New York, 283--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. }}Hafner, J. L. 2005. Weaver codes: Highly fault tolerant erasure codes for storage systems. In Proceedings of the 4th Conference on USENIX Conference on File and Storage Technologies. Vol. 4. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. }}Hafner, J. L. 2006. Hover erasure codes for disk arrays. In International Conference on Dependable Systems and Networks. 217--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. }}Iliadis, I., Haas, R., Hu, X.-Y., and Eleftheriou, E. 2008. Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. ACM, New York, 241--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. }}McKusick, M. K., Joy, W. N., Leffler, S. J., and Fabry, R. S. 1984. A fast file system for UNIX. ACM Trans. Comput. Syst. 2, 3, 181--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. }}Mi, N., Riska, A., Smirni, E., and Riedel, E. 2008. Enhancing data availability in disk drives through background activities. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.Google ScholarGoogle Scholar
  14. }}Oprea, A. and Juels, A. 2010. A clean-slate look at disk scrubbing. In Proceeding of FAST'10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. }}Patterson, D., Gibson, G., and Katz, R. 1988. A case for redundant arrays of inexpensive disks (RAID). ACM SIGMOD Record 17, 3, 109--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. }}Plank, J. S. 2008. The Raid-6 Liber8Tion Code. Int. J. High Perform. Comput. Appl. 23, 3, 242--251. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. }}Prabhakaran, V., Bairavasundaram, L. N., Agrawal, N., Gunawi, H. S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2005. IRON file systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles. ACM, New York, 206--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. }}Schwarz, T., Xin, Q., Miller, E. L., Long, D. E., Hospodor, A., and Ng, S. 2004. Disk scrubbing in large archival storage systems. In Proceedings of the IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems (MASCOTS'04). IEEE, Los Alamitos, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. }}Wylie, J. J. and Swaminathan, R. 2007. Determining fault tolerance of XOR-based erasure codes efficiently. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. }}Zhang, Y., Rajimwale, A., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H.. 2010. End-to-end data integrity for file systems: A ZFS case xtudy. In The Proceedings of the 8th Conference on File and Storage Technologies (FAST'10). To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Understanding latent sector errors and how to protect against them

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!