Abstract
We introduce a new closed-form equation for estimating the number of data-loss events for a redundant array of inexpensive disks in a RAID-6 configuration. The equation expresses operational failures, their restorations, latent (sector) defects, and disk media scrubbing by time-based distributions that can represent non-homogeneous Poisson processes. It uses two-parameter Weibull distributions that allows the distributions to take on many different shapes, modeling increasing, decreasing, or constant occurrence rates. This article focuses on the statistical basis of the equation. It also presents time-based distributions of the four processes based on an extensive analysis of field data collected over several years from 10,000s of commercially available systems with 100,000s of disk drives. Our results for RAID-6 groups of size 16 indicate that the closed-form expression yields much more accurate results compared to the MTTDL reliability equation and matching computationally-intensive Monte Carlo simulations.
- Ascher, H. 1983. Statistical methods in reliability: Discussion. Technometrics 25, 4.Google Scholar
- Ascher, H. 1999. A set of numbers is not a dataset. IEEE Trans. Reliab 48, 2.Google Scholar
Cross Ref
- Ascher, H. 2010. Personal communication.Google Scholar
- Bairavasundaram, L., Goodson, G., Pasupathy, S., and Schindler, J. 2007. An analysis of latent sector errors in disk drives. ACM SIGMETRICS, Perform. Eval. Rev. 35, 1, 289--300. Google Scholar
Digital Library
- Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci-Dusseau, A., and Arpaci-Dusseau, R. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Bartlett J., Bartlett, W., Carr, R., Garcia, D., Gray, J., Horst, R., Jardine, R., Lenoski, D., and McGuire, D. 1990. Fault tolerance in tandem computers. NetApp Tech. rep. 90.5.Google Scholar
- Bazovsky, I. 1961. Reliability Theory and Practice. Prentice Hall.Google Scholar
- Blaum, M., Brady, J., Bruck, J., and Menon, J. 1994. EVENODD: An optimal scheme for tolerating double disk failures in RAID architectures. In Proceedings of the 21st International Symposium on Computer Architecture. Google Scholar
Digital Library
- Corbett, P., English, B., Goel, A., Grcanac, T., Kleiman, S., Leong, J., and Shankar, S. 2004. Row-diagonal parity for double disk failure correction. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Dholakia, A., Eleftheriou, E., Hu, X., Iliadis, I., Menon, J., and Rao, K. K. 2006. Analysis of a new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors. ACM SIGMETRICS Perform. Eval. Rev. 34, 1. Google Scholar
Digital Library
- Elerath, J. 2009a. A simple equation for estimating reliability of an N+1 redundant array of independent disks. In Proceedings of the 39th International Conference on Dependable Systems and Networks.Google Scholar
Cross Ref
- Elerath, J. 2009b. Hard disk drives: The good, the bad and the ugly. ACM Queue 52, 6. Google Scholar
Digital Library
- Elerath, J. and Pecht, M. 2009. A highly accurate method for assessing reliability of redundant arrays of inexpensive disks (RAID). IEEE Trans. Comput. 58, 3. Google Scholar
Digital Library
- EMC. 2007. EMC CLARiion RAID 6 Technology: A detailed review. http://www.emc.com/collateral/hardware/white-papers/h2891-clariion-raid-6.pdf. (Last accessed July 2012.)Google Scholar
- Gao, Y., Meister, D., and Binkmann, A. 2010. Reliability analysis of declustered-parity RAID 6 with disk scrubbing and considering irrecoverable read errors. In Proceedings of the IEEE International Conference on Networking, Architecture, and Storage. Google Scholar
Digital Library
- Gibson, G. and Patterson, D. 1993. Designing disk arrays for high data reliability. J. Parallel Distrib. Comput. 17. Google Scholar
Digital Library
- Greenan, K., Plank, J., and Wylie, J. 2010. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In Proceedings of the 2nd USENIX Conference on Hot Topics in Storage and File Systems. Google Scholar
Digital Library
- Kececioglu, D. 1993. Reliability & Life Testing Handbook, Volumes 1 & 2. Prentice Hall.Google Scholar
- Malhotra, M. and Trivedi, K. 1993. Reliability analysis of redundant arrays of inexpensive disks. J. Parallel Distrib. Comput. 17, 1--2. Google Scholar
Digital Library
- Nelson, W. 1982. Applied Life Data Analysis. Addison-Wesley.Google Scholar
- Nelson, W. 1990. Accelerated Testing. Wiley & Sons.Google Scholar
- Nelson, W. 2003. Recurrent Events Data Analysis for Product Repairs, Disease Recurrences, and Other Applications. ASA-SIAM Series on Statistics and Applied Probability, Society for Industrial and Applied Mathematics. Google Scholar
Digital Library
- NetApp. 2013. NetApp data ONTAP 8 operating system.http://www.netapp.com/us/products/platform-os/data-ontap-8/index.aspx.Google Scholar
- Oracle. 2010. A better RAID strategy for high capacity drives in mainframe storage. http://www.oracle.com/technetwork/articles/systems-hardware-architecture/raid-strategy-hi-capacity -drives-170907.pdf.Google Scholar
- Paris, J., Amer, A., Long, D., and Schwarz, T. 2009. Evaluating the impact of irrecoverable read errors on disk array reliability. In Proceedings of the 15th IEEE Pacific Rim International Symposium on Dependable Computing. Google Scholar
Digital Library
- Patterson, D. A., Gibson, G., and Katz, R. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google Scholar
Digital Library
- Pinheiro, E., Weber, W., and Barroso, L. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Rao, K. K., Hafner, J., and Golding, R. 2006. Reliability for networked storage nodes. In Proceedings of the 36th International Conference on Dependable Systems and Networks. Google Scholar
Digital Library
- Schroeder, B. and Gibson, G. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Serve The Home. 2011. The RAID reliability anthology -- The primer. http://www.servethehome.com/raid-reliability-failureanthology-part-primer.Google Scholar
- Shah, S. and Elerath, J. 2005. Reliability analysis of disk drive failure mechanisms. In Proceedings of the IEEE Reliability and Maintainability Symposium.Google Scholar
- Thompson, W. 1981. On the foundations of reliability. Technometrics 23, 1.Google Scholar
Cross Ref
- Thomasian, A. and Blaum, M. 2009. Higher reliability in redundant disk arrays: Organization, operation, and coding. ACM Trans. Storage 5, 3. Google Scholar
Digital Library
- Tobias, P. and Trindade, D. 2011. Applied Reliability3rd Ed. CRC Press.Google Scholar
Index Terms
Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation
Recommendations
HPDA: A hybrid parity-based disk array for enhanced performance and reliability
Flash-based Solid State Drive (SSD) has been productively shipped and deployed in large scale storage systems. However, a single flash-based SSD cannot satisfy the capacity, performance and reliability requirements of the modern storage systems that ...
Storage systems for movies-on-demand video servers
MSS '95: Proceedings of the 14th IEEE Symposium on Mass Storage SystemsWe evaluate storage system alternatives for movies-on-demand video servers. We begin by characterizing the movies-on-demand workload. We briefly discuss performance in disk arrays. First, we study disk farms in which one movie is stored per disk. This ...
Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud
Data deduplication has been demonstrated to be an effective technique in reducing the total data transferred over the network and the storage space in cloud backup, archiving, and primary storage systems, such as VM (virtual machine) platforms. However, ...






Comments