skip to main content
research-article

Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems

Published:01 July 2011Publication History
Skip Abstract Section

Abstract

Two schemes proposed to cope with unrecoverable or latent media errors and enhance the reliability of RAID systems are examined. The first scheme is the established, widely used, disk scrubbing scheme, which operates by periodically accessing disk drives to detect media-related unrecoverable errors. These errors are subsequently corrected by rebuilding the sectors affected. The second scheme is the recently proposed intradisk redundancy scheme, which uses a further level of redundancy inside each disk, in addition to the RAID redundancy across multiple disks. A new model is developed to evaluate the extent to which disk scrubbing reduces the unrecoverable sector errors. The probability of encountering unrecoverable sector errors is derived analytically under very general conditions regarding the characteristics of the read/write process of uniformly distributed random workloads and for a broad spectrum of disk scrubbing schemes, which includes the deterministic and random scrubbing schemes. We show that the deterministic scrubbing scheme is the most efficient one. We also derive closed-form expressions for the percentage of unrecoverable sector errors that the scrubbing scheme detects and corrects, the throughput performance, and the minimum scrubbing period achievable under operation with random, uniformly distributed I/O requests. Our results demonstrate that the reliability improvement due to disk scrubbing depends on the scrubbing frequency and the load of the system, and, for heavy-write workloads, may not reach the reliability level achieved by a simple interleaved parity-check (IPC)-based intradisk redundancy scheme, which is insensitive to the load. In fact, for small unrecoverable sector error probabilities, the IPC-based intradisk redundancy scheme achieves essentially the same reliability as that of a system operating without unrecoverable sector errors. For heavy loads, the reliability achieved by the scrubbing scheme can be orders of magnitude less than that of the intradisk redundancy scheme. Finally, the I/O and throughput performances are evaluated by means of analysis and event-driven simulation.

References

  1. Bairavasundaram, L. N., Goodson, G. R., Pasupathy, S., and Schindler, J. 2007. An analysis of latent sector errors in disk drives. ACM SIGMETRICS Perform. Eval. Rev. 35, 1, 289--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Baker, M., Shah, M., Rosenthal, D. S. H., Roussopoulos, M., Maniatis, P., Giuli, T., and Bungale, P. 2006. A fresh look at the reliability of long-term digital storage. In Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys’06). ACM, New York, 221--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Blaum, M., Brady, J., Bruck, J., and Mennon, J. 1995. EVENODD: An efficient scheme for tolerating double disk failures in RAID architectures. IEEE Trans. Comput. 44, 2, 192--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chen, P. M., Lee, E. K., Gibson, G. A., Katz, R. H., and Patterson, D. A. 1994. RAID: High-Performance, reliable secondary storage. ACM Comput. Surv. 26, 2, 145--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Corbett, P., English, R., Goel, A., Grcanac, T., Kleiman, S., Leong, J., and Sankar, S. 2004. Row-diagonal parity for double disk failure correction. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST). USENIX Association, Berkeley, CA, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dholakia, A., Eleftheriou, E., Hu, X.-Y., Iliadis, I., Menon, J., and Rao, K. 2006. Analysis of a new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors. ACM SIGMETRICS Perform. Eval. Rev. 34, 1, 373--374. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Dholakia, A., Eleftheriou, E., Hu, X.-Y., Iliadis, I., Menon, J., and Rao, K. 2008. A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors. ACM Trans. Storage 4, 1, 1--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. DiskSim. 2007. The DiskSim simulation environment (Ver. 3.0). http://www.pdl.cmu.edu/DiskSim/.Google ScholarGoogle Scholar
  9. Elerath, J. G. and Pecht, M. 2007. Enhanced reliability modeling of RAID storage systems. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, Los Alamitos, CA, 175--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Greenan, K. M., Plank, J. S., and Wylie, J. J. 2010. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In Proceedings of the USENIX Workshop on Hot Topics in Storage and File Systems. USENIX Association, Berkeley, CA, 1--5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hafner, J. L., Deenadhayalan, V., Kanungo, T., and Rao, K. 2004. Performance metrics for erasure codes in storage systems. IBM Res. rep. RJ 10321.Google ScholarGoogle Scholar
  12. Hitachi Global Storage Technologies. 2007. Hitachi disk drive product datasheets. http://www.hitachigst.com/.Google ScholarGoogle Scholar
  13. HP Labs. 2006. Pivate software. http://tesla.hpl.hp.com/private_software/.Google ScholarGoogle Scholar
  14. Iliadis, I. 2009. Reliability modeling of RAID storage systems with latent errors. In Proceedings of the 17th Annual IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). ACM, New York, 111--122.Google ScholarGoogle ScholarCross RefCross Ref
  15. Iliadis, I., Haas, R., Hu, X.-Y., and Eleftheriou, E. 2008. Disk scrubbing versus intra-disk redundancy for high-reliability RAID storage systems. ACM SIGMETRICS Perform. Eval. Rev. 36, 1, 241--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Iliadis, I. and Hu, X.-Y. 2008. Reliability assurance of RAID storage systems for a wide range of latent sector errors. In Proceedings of the IEEE International Conference on Networking, Architecture, and Storage (NAS). IEEE, Los Alamitos, CA. 10--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Kleinrock, L. 1975. Queueing Systems, Volume 1: Theory. Wiley, New York.Google ScholarGoogle Scholar
  18. Mi, N., Riska, A., Smirni, E., and Riedel, E. 2008. Enhancing data availability in disk drives through background activities. In Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, Los Alamitos, CA, 492--501.Google ScholarGoogle Scholar
  19. Oprea, A. and Juels, A. 2010. A clean-slate look at disk scrubbing. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST). USENIX Association, Berkeley, CA, 57--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Pâris, J.-F. and Long, D. D. E. 2006. Using device diversity to protect data against batch correlated disk failures. In Proceedings of the 2nd ACM Workshop on Storage Security and Survivability (StorageSS). ACM, New York, 47--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Patterson, D. A., Gibson, G., and Katz, R. H. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 109--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Pinheiro, E., Weber, W.-D., and Barroso, L. A. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST). USENIX Association, Berkeley, CA, 17--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Riska, A. and Riedel, E. 2006. Disk drive level workload characterization. In Proceedings of the USENIX Annual Technical Conference, USENIX Association, Berkeley, CA, 97--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Riska, A. and Riedel, E. 2008. Idle read after write: IRAW. In Proceedings of the USENIX Annual Technical Conference. USENIX Association, Berkeley, CA, 43--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ruemmler, C. and Wilkes, J. 1994. An introduction to disk drive modeling. IEEE Comput. 27, 3, 17--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sawyer, D. C. 1994. Dependability analysis of parallel systems using a simulation-based approach. NASA-CR-195762.Google ScholarGoogle Scholar
  27. Schroeder, B. and Gibson, G. A. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST). USENIX Association, Berkeley, CA, 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Schroeder, B., Damouras, S., and Gill, P. 2010. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Conference on File and Storage Technologies. USENIX, Berkeley, CA, 71--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Schwarz, T. J. E., Xin, Q., Miller, E. L., Long, D. D. E., Hospodor, A., and Ng, S. 2004. Disk scrubbing in large archival storage systems. In Proceedings of the 12th Annual IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, Los Alamitos, CA, 409--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Shah, S. and Elerath, J. G. 2005. Reliability analysis of disk drive failure mechanisms. In Proceedings of the 51th IEEE Annual Reliability and Maintainability Symposium (RAMS). IEEE, Los Alamitos, CA, 226--231.Google ScholarGoogle Scholar
  31. Thomasian, A. and Blaum, M. 2009. Higher reliability redundant disk arrays: Organization, operation, and coding. ACM Trans. Storage 5, 3, 1--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Trivedi, K. S. 2002. Probabilistic and Statistics with Reliability, Queueing and Computer Science Applications 2nd Ed. Wiley, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Wang, G., Butt, A. R., and Gniady, C. 2008. On the impact of disk scrubbing on energy savings. In Proceedings of the USENIX Workshop on Power Aware Computing and Systems (HotPower). USENIX Association, Berkeley, CA, 1--5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Wolff, R. W. 1989. Stochastic Modeling and the Theory of Queues. Prentice Hall, Englewood Cliffs, NJ.Google ScholarGoogle Scholar

Index Terms

  1. Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader
              About Cookies On This Site

              We use cookies to ensure that we give you the best experience on our website.

              Learn more

              Got it!