skip to main content
research-article

GRID codes: Strip-based erasure codes with high fault tolerance for storage systems

Published:09 February 2009Publication History
Skip Abstract Section

Abstract

As storage systems grow in size and complexity, they are increasingly confronted with concurrent disk failures together with multiple unrecoverable sector errors. To ensure high data reliability and availability, erasure codes with high fault tolerance are required. In this article, we present a new family of erasure codes with high fault tolerance, named GRID codes. They are called such because they are a family of strip-based codes whose strips are arranged into multi-dimensional grids. In the construction of GRID codes, we first introduce a concept of matched codes and then discuss how to use matched codes to construct GRID codes. In addition, we propose an iterative reconstruction algorithm for GRID codes. We also discuss some important features of GRID codes. Finally, we compare GRID codes with several categories of existing codes. Our comparisons show that for large-scale storage systems, our GRID codes have attractive advantages over many existing erasure codes: (a) They are completely XOR-based and have very regular structures, ensuring easy implementation; (b) they can provide up to 15 and even higher fault tolerance; and (c) their storage efficiency can reach up to 80% and even higher. All the advantages make GRID codes more suitable for large-scale storage systems.

References

  1. Aguilera, M. K., Janakiraman, R., and Xu, L. 2005. Using erasure codes efficiently for storage in a distributed system. In Proceedings of the Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'05), 336--345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Anne, N. B., Thirunavukkarasu, U., and Latifi, S. 2004. Three and four-dimensional parity-check codes for correction and detection of multiple errors. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC), vol. 2. IEEE Computer Society, 840--845. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bairavasundaram, L. N., Goodson, G. R., Pasupathy, S., and Schindler, J. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. ACM Press, New York, 289--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Blaum, M., Brady, J., Bruck, J., Menon, J., and Vardy, A. 2001. The EVENODD code and its generalization: An efficient scheme for tolerating multiple disk failures in RAID architectures. InHigh Performance Mass Storage and Parallel I/O: Technologies and Applications, 187--208.Google ScholarGoogle Scholar
  5. Blaum, M., Bruck, J., and Vardy, A. 1996. MDS array codes with independent parity symnbols. IEEE Trans. Inf. Theory 42, 2, 529--542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Blaum, M., Brady, J., Bruck, J., and Menon, J. 1995. EVENODD: An efficient scheme for tolerating double disk failures in RAID architectures. IEEE Trans. Comput. 44, 2, 192--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bloemer, J., Kalfane, M., Karp, R., Karpinski, M., Luby, M., and Zuckerman, D. 1995. An XOR-based erasure resilient coding scheme. Tech. rep. TR-95-048, International Computer Science Institute, Berkeley, California.Google ScholarGoogle Scholar
  8. Chen, P. M., Lee, E. K., Gibson, G. A., Katz, R. H., and Patterson, D. A. 1994. RAID: High-Performance, reliable secondary storage. ACM Comput. Surv. 26, 2, 145--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Collins, R. L. and Plank, J. S. 2005. Assessing the performance of erasure codes in the wide-area. In Proceedings of the Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'05), 182--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Corbett, P., English, B., Goel, A., Grcanac, T., Kleiman, S., Leong, J., and Sankar, S. 2004. Row-Diagonal parity for double disk failure correction. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST'04). USENIX Association, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Feng, G.-L., Deng, R. H., Bao, F., and Shen, J. C. 2005a. New efficient MDS array codes for RAID, Part I. Reed-Solomon-Like codes for tolerating three disk failures. IEEE Trans. Comput. 54, 9, 1071--1080. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Feng, G.-L., Deng, R. H., Bao, F., and Shen, J. C. 2005b. New efficient MDS array codes for RAID, Part II. Rabin-Like codes for tolerating multiple (greater than or equal to 4) disk failures. IEEE Trans. Comput. 54, 12, 1473--1483. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Frølund, S. Merchant, A., Saito, Y., Spence, S., and Veitch, A. 2004. A decentralized algorithm for erasure-coded virtual disks. In Proceedings of the Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'04), 125--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Gallager, R. G. 1962. Low density parity check codes. IRE Trans. Inf. Theory 8, 1, 21--28.Google ScholarGoogle ScholarCross RefCross Ref
  15. Goodson, G. R., Wylie, J. J., Granger, G. R., and Reiter, M. K. 2004. Efficient Byzantine-tolerant erasure-coded storage. In Proceedings of the Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'04), 135--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Greenan, K. M., Miller, E. L., and Wylie, J. J. 2008. Reliability of flat XOR-based erasure codes on heterogeneous devices. In Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'08). IEEE Computer Society, 147--156.Google ScholarGoogle Scholar
  17. Hafner, J. L. 2006. Hover erasure codes for disk arrays. In Proceedings of the Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'06). IEEE Computer Society, 217--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hafner, J. L. 2005. Weaver codes: Highly fault tolerant erasure codes for storage systems. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST'05). USENIX Association, 211--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hafner, J. L., Deenadhayalan, V., Kanungo, T., and Rao, K. K. 2004. Performance metrics for erasure codes in storage systems. Tech. rep. RJ 10321 (A0408-003). IBM Research Division, Almaden Research Center. August.Google ScholarGoogle Scholar
  20. Huang, C. and Xu, L. 2005. Star: An efficient coding scheme for correcting triple storage node failures. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST'05). USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Luby, M. C., Mitzenmacher, M., Shokrollahi, M. A., and Spielman, D. A. 2001. Efficient erasure correcting codes. IEEE Trans. Inf. Theory 47, 2, 569--584. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. MacWilliams, F. J. and Sloane, N. J. A. 1977. The Theory of Error-Correcting Codes. North-Holland, New York.Google ScholarGoogle Scholar
  23. Pinheiro, E., Weber, W. D., and Barroso, L. A. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST'07). USENIX Association, 17--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Plank, J. S. 2008. The RAID-6 liberation codes. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST'08). USENIX Association, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Plank, J. S. 2005. Erasure codes for storage applications. Tutorial slides presented at the 4th USENIX Conference on File and Storage Technologies (FAST'05). http://www.cs.utk.edu/~plank/plank/papers/FAST-2005.html.Google ScholarGoogle Scholar
  26. Plank, J. S. and Thomason, M. G. 2004. A practical analysis of low-density parity-check erasure codes for wide-area storage applications. In Proceedings of the Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'04). IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Plank, J. S. and Xu, L. 2006. Optimizing Cauchy Reed-Solomon codes for fault-tolerant network storage applications. In Proceedings of the 5th IEEE International Symposium on Network Computing and Applications (NCA'06). IEEE Computer Society, 173--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Plank, J. S., Buchsbaum, A. L., Collins, R. L., and Thomason, M. G. 2005. Small parity-check erasure codes- Exploration and observations. InProceedings of the Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'05). IEEE Computer Society, 326--335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Reed, I. S. and Solomon, G. 1960. Polynomial codes over certain finite fields. J. Soc. Industrial Appl. Math. 8, 2, 300--304.Google ScholarGoogle ScholarCross RefCross Ref
  30. Roth, R. M. and Lempel, A. 1989. On MDS codes via Cauchy matrices. IEEE Trans. Inf. Theory 35, 6, 1314--1319.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Rubinoff, M. 1961. N-Dimensional codes for detecting and correcting multiple errors. Commun. ACM 4, 12, 545--551. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Schroeder, B. and Gibson, G. A. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST'07). USENIX Association, 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Tanner, R. M. 1981. A recursive approach to low complexity codes. IEEE Trans. Inf. Theory 27, 5, 533--547.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Wilcke, W. W., Garner, R. B., Fleiner, C., Freitas, R. F., Golding, R. A., Glider, J. S., Kenchammana-Hosekote, D. R., Hafner, J. L., Mohiuddin, K. M., Rao, K. K., Becker-Szendy, R. A., Wong, T. M., Zaki, O. A., Hernandez, M., Fernandez, K. R., Huels, H., Lenk, H., Smolin, K., Ries, M., Goettert, C., Picunko, T., Kahn, H., and Loo, T. 2006. IBM intelligent bricks project: Petabytes and beyond. IBM J. Res. Devel. 50, 2-3, 181--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Wong, T. E. and Shea, J. M. 2001. Multi-Dimensional parity check codes for bursty channels. In Proceedings of the IEEE International Symposium on Information Theory (ISIT'01), 123.Google ScholarGoogle Scholar
  36. Wylie, J. J. and Swaminathan, R. 2007. Determining fault tolerance of XOR-based erasure codes efficiently. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07). IEEE Computer Society, 206--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Xia, H. and Chien, A. A. 2007. Robustore: A distributed storage architecture with robust and high performance. In Proceedings of the ACM/IEEE Conference on SuperComputing (SC'07), 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Xu, L. and Bruck, J. 1999. X-Code: MDS array codes with optimal encoding. IEEE Trans. Inf. Theory 45, 1, 272--276. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. GRID codes: Strip-based erasure codes with high fault tolerance for storage systems

          Recommendations

          Reviews

          Festus Gail Gray

          As the authors correctly suggest in the introduction, no code is perfect and selecting code for an application always involves tradeoffs. Although GRID codes-"a new family of erasure codes with high fault tolerance"-may not be optimal with respect to any single parameter, they may be the best codes for storage applications. This is because the combination of less-than-optimal parameters is better than any other class of codes. The paper includes examples that show fault tolerance up to 15, efficiency up to 80 percent, very regular architectures, completely XOR-based operations, optimal small-write performance, and low reconstruction cost using local sets of data. Other codes exist that are better for any single feature. For example, Reed-Solomon codes have higher fault tolerance and optimal efficiency, but require more complex operations to detect and correct errors; this results in lower bandwidth for data transmission. Therefore, for storage applications, the higher bandwidth might be better than higher fault tolerance and higher efficiency. The paper is very well written and accessible to nonspecialists. It compares and contrasts various classes of codes. It is an excellent paper for readers who want to learn the advantages and disadvantages of using a variety of codes. Online Computing Reviews Service

          Access critical reviews of Computing literature here

          Become a reviewer for Computing Reviews.

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!