Abstract
As storage systems grow in size and complexity, they are increasingly confronted with concurrent disk failures together with multiple unrecoverable sector errors. To ensure high data reliability and availability, erasure codes with high fault tolerance are required. In this article, we present a new family of erasure codes with high fault tolerance, named GRID codes. They are called such because they are a family of strip-based codes whose strips are arranged into multi-dimensional grids. In the construction of GRID codes, we first introduce a concept of matched codes and then discuss how to use matched codes to construct GRID codes. In addition, we propose an iterative reconstruction algorithm for GRID codes. We also discuss some important features of GRID codes. Finally, we compare GRID codes with several categories of existing codes. Our comparisons show that for large-scale storage systems, our GRID codes have attractive advantages over many existing erasure codes: (a) They are completely XOR-based and have very regular structures, ensuring easy implementation; (b) they can provide up to 15 and even higher fault tolerance; and (c) their storage efficiency can reach up to 80% and even higher. All the advantages make GRID codes more suitable for large-scale storage systems.
- Aguilera, M. K., Janakiraman, R., and Xu, L. 2005. Using erasure codes efficiently for storage in a distributed system. In Proceedings of the Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'05), 336--345. Google Scholar
Digital Library
- Anne, N. B., Thirunavukkarasu, U., and Latifi, S. 2004. Three and four-dimensional parity-check codes for correction and detection of multiple errors. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC), vol. 2. IEEE Computer Society, 840--845. Google Scholar
Digital Library
- Bairavasundaram, L. N., Goodson, G. R., Pasupathy, S., and Schindler, J. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. ACM Press, New York, 289--300. Google Scholar
Digital Library
- Blaum, M., Brady, J., Bruck, J., Menon, J., and Vardy, A. 2001. The EVENODD code and its generalization: An efficient scheme for tolerating multiple disk failures in RAID architectures. InHigh Performance Mass Storage and Parallel I/O: Technologies and Applications, 187--208.Google Scholar
- Blaum, M., Bruck, J., and Vardy, A. 1996. MDS array codes with independent parity symnbols. IEEE Trans. Inf. Theory 42, 2, 529--542. Google Scholar
Digital Library
- Blaum, M., Brady, J., Bruck, J., and Menon, J. 1995. EVENODD: An efficient scheme for tolerating double disk failures in RAID architectures. IEEE Trans. Comput. 44, 2, 192--202. Google Scholar
Digital Library
- Bloemer, J., Kalfane, M., Karp, R., Karpinski, M., Luby, M., and Zuckerman, D. 1995. An XOR-based erasure resilient coding scheme. Tech. rep. TR-95-048, International Computer Science Institute, Berkeley, California.Google Scholar
- Chen, P. M., Lee, E. K., Gibson, G. A., Katz, R. H., and Patterson, D. A. 1994. RAID: High-Performance, reliable secondary storage. ACM Comput. Surv. 26, 2, 145--185. Google Scholar
Digital Library
- Collins, R. L. and Plank, J. S. 2005. Assessing the performance of erasure codes in the wide-area. In Proceedings of the Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'05), 182--187. Google Scholar
Digital Library
- Corbett, P., English, B., Goel, A., Grcanac, T., Kleiman, S., Leong, J., and Sankar, S. 2004. Row-Diagonal parity for double disk failure correction. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST'04). USENIX Association, 1--14. Google Scholar
Digital Library
- Feng, G.-L., Deng, R. H., Bao, F., and Shen, J. C. 2005a. New efficient MDS array codes for RAID, Part I. Reed-Solomon-Like codes for tolerating three disk failures. IEEE Trans. Comput. 54, 9, 1071--1080. Google Scholar
Digital Library
- Feng, G.-L., Deng, R. H., Bao, F., and Shen, J. C. 2005b. New efficient MDS array codes for RAID, Part II. Rabin-Like codes for tolerating multiple (greater than or equal to 4) disk failures. IEEE Trans. Comput. 54, 12, 1473--1483. Google Scholar
Digital Library
- Frølund, S. Merchant, A., Saito, Y., Spence, S., and Veitch, A. 2004. A decentralized algorithm for erasure-coded virtual disks. In Proceedings of the Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'04), 125--134. Google Scholar
Digital Library
- Gallager, R. G. 1962. Low density parity check codes. IRE Trans. Inf. Theory 8, 1, 21--28.Google Scholar
Cross Ref
- Goodson, G. R., Wylie, J. J., Granger, G. R., and Reiter, M. K. 2004. Efficient Byzantine-tolerant erasure-coded storage. In Proceedings of the Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'04), 135--144. Google Scholar
Digital Library
- Greenan, K. M., Miller, E. L., and Wylie, J. J. 2008. Reliability of flat XOR-based erasure codes on heterogeneous devices. In Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'08). IEEE Computer Society, 147--156.Google Scholar
- Hafner, J. L. 2006. Hover erasure codes for disk arrays. In Proceedings of the Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'06). IEEE Computer Society, 217--226. Google Scholar
Digital Library
- Hafner, J. L. 2005. Weaver codes: Highly fault tolerant erasure codes for storage systems. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST'05). USENIX Association, 211--224. Google Scholar
Digital Library
- Hafner, J. L., Deenadhayalan, V., Kanungo, T., and Rao, K. K. 2004. Performance metrics for erasure codes in storage systems. Tech. rep. RJ 10321 (A0408-003). IBM Research Division, Almaden Research Center. August.Google Scholar
- Huang, C. and Xu, L. 2005. Star: An efficient coding scheme for correcting triple storage node failures. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST'05). USENIX Association. Google Scholar
Digital Library
- Luby, M. C., Mitzenmacher, M., Shokrollahi, M. A., and Spielman, D. A. 2001. Efficient erasure correcting codes. IEEE Trans. Inf. Theory 47, 2, 569--584. Google Scholar
Digital Library
- MacWilliams, F. J. and Sloane, N. J. A. 1977. The Theory of Error-Correcting Codes. North-Holland, New York.Google Scholar
- Pinheiro, E., Weber, W. D., and Barroso, L. A. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST'07). USENIX Association, 17--29. Google Scholar
Digital Library
- Plank, J. S. 2008. The RAID-6 liberation codes. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST'08). USENIX Association, 1--14. Google Scholar
Digital Library
- Plank, J. S. 2005. Erasure codes for storage applications. Tutorial slides presented at the 4th USENIX Conference on File and Storage Technologies (FAST'05). http://www.cs.utk.edu/~plank/plank/papers/FAST-2005.html.Google Scholar
- Plank, J. S. and Thomason, M. G. 2004. A practical analysis of low-density parity-check erasure codes for wide-area storage applications. In Proceedings of the Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'04). IEEE Computer Society. Google Scholar
Digital Library
- Plank, J. S. and Xu, L. 2006. Optimizing Cauchy Reed-Solomon codes for fault-tolerant network storage applications. In Proceedings of the 5th IEEE International Symposium on Network Computing and Applications (NCA'06). IEEE Computer Society, 173--180. Google Scholar
Digital Library
- Plank, J. S., Buchsbaum, A. L., Collins, R. L., and Thomason, M. G. 2005. Small parity-check erasure codes- Exploration and observations. InProceedings of the Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'05). IEEE Computer Society, 326--335. Google Scholar
Digital Library
- Reed, I. S. and Solomon, G. 1960. Polynomial codes over certain finite fields. J. Soc. Industrial Appl. Math. 8, 2, 300--304.Google Scholar
Cross Ref
- Roth, R. M. and Lempel, A. 1989. On MDS codes via Cauchy matrices. IEEE Trans. Inf. Theory 35, 6, 1314--1319.Google Scholar
Digital Library
- Rubinoff, M. 1961. N-Dimensional codes for detecting and correcting multiple errors. Commun. ACM 4, 12, 545--551. Google Scholar
Digital Library
- Schroeder, B. and Gibson, G. A. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST'07). USENIX Association, 1--16. Google Scholar
Digital Library
- Tanner, R. M. 1981. A recursive approach to low complexity codes. IEEE Trans. Inf. Theory 27, 5, 533--547.Google Scholar
Digital Library
- Wilcke, W. W., Garner, R. B., Fleiner, C., Freitas, R. F., Golding, R. A., Glider, J. S., Kenchammana-Hosekote, D. R., Hafner, J. L., Mohiuddin, K. M., Rao, K. K., Becker-Szendy, R. A., Wong, T. M., Zaki, O. A., Hernandez, M., Fernandez, K. R., Huels, H., Lenk, H., Smolin, K., Ries, M., Goettert, C., Picunko, T., Kahn, H., and Loo, T. 2006. IBM intelligent bricks project: Petabytes and beyond. IBM J. Res. Devel. 50, 2-3, 181--197. Google Scholar
Digital Library
- Wong, T. E. and Shea, J. M. 2001. Multi-Dimensional parity check codes for bursty channels. In Proceedings of the IEEE International Symposium on Information Theory (ISIT'01), 123.Google Scholar
- Wylie, J. J. and Swaminathan, R. 2007. Determining fault tolerance of XOR-based erasure codes efficiently. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07). IEEE Computer Society, 206--215. Google Scholar
Digital Library
- Xia, H. and Chien, A. A. 2007. Robustore: A distributed storage architecture with robust and high performance. In Proceedings of the ACM/IEEE Conference on SuperComputing (SC'07), 1--11. Google Scholar
Digital Library
- Xu, L. and Bruck, J. 1999. X-Code: MDS array codes with optimal encoding. IEEE Trans. Inf. Theory 45, 1, 272--276. Google Scholar
Digital Library
Index Terms
GRID codes: Strip-based erasure codes with high fault tolerance for storage systems
Recommendations
A Hybrid Approach to Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation
The current parallel storage systems use thousands of inexpensive disks to meet the storage requirement of applications. Data redundancy and/or coding are used to enhance data availability, for instance, Row-diagonal parity (RDP) and EVENODD codes, ...
On designing endurance aware erasure code for SSD-based storage systems
DPD-factor and GDP-pattern are proposed for comparing the endurance of erasure codes.EA-EO is designed as a modification of EVENODD with smaller DPD-factor.A code with smaller DPD-factor can provide higher endurance for systems.A code with sequential ...
Optimal recovery of single disk failure in RDP code storage systems
Performance evaluation reviewModern storage systems use thousands of inexpensive disks to meet the storage requirement of applications. To enhance the data availability, some form of redundancy is used. For example, conventional RAID-5 systems provide data availability for single ...








Comments