Abstract
Repair performance in hierarchical data centers is often bottlenecked by cross-rack network transfer. Recent theoretical results show that the cross-rack repair traffic can be minimized through repair layering, whose idea is to partition a repair operation into inner-rack and cross-rack layers. However, how repair layering should be implemented and deployed in practice remains an open issue. In this article, we address this issue by proposing a practical repair layering framework called DoubleR. We design two families of practical double regenerating codes (DRC), which not only minimize the cross-rack repair traffic but also have several practical properties that improve state-of-the-art regenerating codes. We implement and deploy DoubleR atop the Hadoop Distributed File System (HDFS) and show that DoubleR maintains the theoretical guarantees of DRC and improves the repair performance of regenerating codes in both node recovery and degraded read operations.
- GitHub. 2017. Facebookarchive/hadoop-20. Retrieved October 12, 2017, from https://github.com/facebookarchive/hadoop-20.Google Scholar
- HadoopWiki. 2017. HDFS RAID. Retrieved October 12, 2017, from http://wiki.apache.org/hadoop/HDFS-RAID.Google Scholar
- GitHub. 2017. ISA-L. Retrieved October 12, 2017, from https://github.com/01org/isa-lGoogle Scholar
- Marcos K. Aguilera. 2013. Geo-distributed storage in data centers. Slides presented at the International Conference on Principles of Distributed Systems (OPODIS’13).Google Scholar
- F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijaykumar. 2014. ShuffleWatcher: Shuffle-aware scheduling in multi-tenant MapReduce clusters. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC’14). 1--12. Google Scholar
Digital Library
- Theophilus Benson, Aditya Akella, and David A. Maltz. 2010. Network traffic characteristics of data centers in the wild. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement (IMC’10). 267--280. Google Scholar
Digital Library
- Ranjita Bhagwan, Kiran Tati, Yuchung Cheng, Stefan Savage, and Geoffrey M. Voelker. 2004. Total recall: system support for automated availability management. In Proceedings of the 1st Symposium on Networked Systems Design and Implementation (NSDI’04). 25. Google Scholar
Digital Library
- Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, et al. 2011. Windows Azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles. ACM, New York, NY, 143--157. Google Scholar
Digital Library
- Henry C. H. Chen, Yuchong Hu, Patrick P. C. Lee, and Yang Tang. 2014. NCCloud: A network-coding-based storage system in a cloud-of-clouds. IEEE Transactions on Computers 63, 1, 31--44. Google Scholar
Digital Library
- Brian Cho and Marcos K. Aguilera. 2012. Surviving congestion in geo-distributed storage systems. In Proceedings of the 2012 USENIX Annual Technical Conference (USENIX ATC’12). 40. Google Scholar
Digital Library
- Mosharaf Chowdhury, Srikanth Kandula, and Ion Stoica. 2013. Leveraging endpoint flexibility in data-intensive clusters. In Proceedings of the 2013 ACM SIGCOMM Conference (SIGCOMM’13). 231--242. Google Scholar
Digital Library
- Asaf Cidon, Robert Escriva, Sachin Katti, Mendel Rosenblum, and Emin Gün Sirer. 2015. Tiered replication: A cost-effective alternative to full cluster geo-replication. In Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC’15). 31--43. Google Scholar
Digital Library
- Cisco Systems. 2016. Oversubscription and Density Best Practices. Retrieved October 12, 2017, from http://www.cisco.com/c/en/us/solutions/collateral/data-center-virtualization/storage-networking-solution/net_implementation_white_paper0900aecd800f592f.html.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI’04). 10. Google Scholar
Digital Library
- A. G. Dimakis, P. B. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran. 2010. Network coding for distributed storage systems. IEEE Transactions on Information Theory 56, 9, 4539--4551. Google Scholar
Digital Library
- Daniel Ford, François Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. 2010. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10). 61--74. Google Scholar
Digital Library
- B. Gaston, J. Pujol, and M. Villanueva. 2013. A realistic distributed storage system that minimizes data storage and repair bandwidth. In Proceedings of the 2013 Data Compression Conference (DCC’13). 491. Google Scholar
Digital Library
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03), Vol. 37. ACM, New York, NY, 29--43. Google Scholar
Digital Library
- Sreechakra Goparaju, Arman Fazeli, and Alexander Vardy. 2017. Minimum storage regenerating codes for all parameters. IEEE Transactions on Information Theory 63, 10, 6318--6328. Google Scholar
Digital Library
- Kevin M. Greenan, Ethan L. Miller, and Thomas J. E. Schwarz. 2008. Optimizing Galois field arithmetic for diverse processor architectures and applications. In Proceedings of the 2008 IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’08). 1--10. Google Scholar
Cross Ref
- Kevin M. Greenan, James S. Plank, and Jay J. Wylie. 2010. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In Proceedings of the 2nd USENIX Conference on Hot Topics in Storage and File Systems (HotStorage’10). 5. Google Scholar
Digital Library
- Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta. 2009. VL2: A scalable and flexible data center network. In Proceedings of the ACM SIGCOMM 2009 Conference (SIGCOMM’09). 51--62. Google Scholar
Digital Library
- Yuchong Hu, Patrick P. C. Lee, Kenneth W. Shum, and Pan Zhou. 2017. Proxy-assisted regenerating codes with uncoded repair for distributed storage systems. IEEE Transactions on Information Theory PP, 99, 1. Google Scholar
Cross Ref
- Yuchong Hu, Patrick P. C. Lee, and Xiaoyang Zhang. 2016. Double regenerating codes for hierarchical data centers. In Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT’16). 245--249. Google Scholar
Digital Library
- Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin. 2012. Erasure coding in Windows Azure storage. In Proceedings of the 2012 USENIX Annual Technical Conference (USENIX ATC’12). 2. Google Scholar
Digital Library
- Virajith Jalaparti, Peter Bodík, Ishai Menache, Sriram Rao, Konstantin Makarychev, and Matthew Caesar. 2015. Network-aware scheduling for data-parallel jobs: Plan when you can. In Proceedings of the 2015 ACM SIGCOMM Conference (SIGCOMM’15). 407--420. Google Scholar
Digital Library
- Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky. 2008. Are disks the dominant contributor for storage failures? A comprehensive study of storage subsystem failure characteristics. ACM Transactions on Storage 4, 3, 7. Google Scholar
Digital Library
- O. Khan, R. Burns, J. S. Plank, W. Pierce, and C. Huang. 2012. Rethinking erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). 20. Google Scholar
Digital Library
- Mingqiang Li, Runhui Li, and Patrick P. C. Lee. 2016. Relieving both storage and recovery burdens in big data clusters with R-STAIR codes. IEEE Internet Computing PP, 99, 1. Google Scholar
Cross Ref
- Runhui Li, Xiaolu Li, Patrick P. C. Lee, and Qun Huang. 2017. Repair pipelining for erasure-coded storage. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC’17). 567--579. Google Scholar
Digital Library
- Runhui Li, Jian Lin, and Patrick P. C. Lee. 2015. Enabling concurrent failure recovery for regenerating-coding-based storage systems: From theory to practice. IEEE Transactions on Computers 64, 7, 1898--1911. Google Scholar
Digital Library
- Subrata Mitra, Rajesh Panta, Moo-Ryong Ra, and Saurabh Bagchi. 2016. Partial-parallel-repair (PPR): A distributed technique for repairing erasure coded storage. In Proceedings of the 11th European Conference on Computer Systems (EuroSys’16). 30. Google Scholar
Digital Library
- Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, and Sanjeev Kumar. 2014. f4: Facebook’s warm blob storage system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). 383--398. Google Scholar
Digital Library
- Michael Ovsiannikov, Silvius Rus, Damian Reeves, Paul Sutter, Sriram Rao, and Jim Kelly. 2013. The Quantcast File System. Proceedings of the VLDB Endowment 6, 11, 1092--1101. Google Scholar
Digital Library
- Lluis Pamies-Juarez, Filip Blagojević, Robert Mateescu, Cyril Gyuot, Eyal En Gad, and Zvonimir Bandic. 2016. Opening the chrysalis: On the real repair performance of MSR codes. In Proceedings of the 14th Usenix Conference on File and Storage Technologies (FAST’16). 81--94. Google Scholar
Digital Library
- Dimitris S. Papailiopoulos, Jianqiang Luo, Alexandros G. Dimakis, Cheng Huang, and Jin Li. 2012. Simple regenerating codes: Network coding for cloud storage. In Proceedings of the 2012 IEEE INFOCOM Conference. 2801--2805. Google Scholar
Cross Ref
- Jaume Pernas, Chau Yuen, Bernat Gastón, and Jaume Pujol. 2013. Non-homogeneous two-rack model for distributed storage systems. In Proceedings of the 2013 IEEE International Symposium on Information Theory (ISIT’13).Google Scholar
Cross Ref
- K. V. Rashmi, P. Nakkiran, J. Wang, N. B. Shah, and K. Ramchandran. 2015. Having your cake and eating it too: Jointly optimal erasure codes for I/O, storage, and network-bandwidth. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). 81--94. Google Scholar
Digital Library
- K. V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. 2013. A solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the Facebook warehouse cluster. In Proceedings of the 5th USENIX Conference on Hot Topics in Storage and File Systems (HotStorage’13). 8. Google Scholar
Digital Library
- K. V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. 2014. A hitchhiker’s guide to fast and efficient data reconstruction in erasure-coded data centers. In Proceedings of the 2014 ACM SIGCOMM Conference (SIGCOMM’14). 331--342. Google Scholar
Digital Library
- K. V. Rashmi, Nihar B. Shah, and P. Vijay Kumar. 2011. Optimal exact-regenerating codes for distributed storage at the MSR and MBR points via a product-matrix construction. IEEE Transactions on Information Theory 57, 8, 5227--5239. Google Scholar
Digital Library
- I. S. Reed and G. Solomon. 1960. Polynomial codes over certain finite fields. Journal of the Society for Industrial and Applied Mathematics 8, 2, 300--304. Google Scholar
Cross Ref
- Birenjith Sasidharan, Myna Vajha, and P. Vijay Kumar. 2016. An explicit, coupled-layer construction of a high-rate MSR code with low sub-packetization level, small field size and all-node repair. arXiv:1607.07335.Google Scholar
- Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur. 2013. Xoring elephants: Novel erasure codes for big data. Proceedings of the VLDB Endowment 6, 5, 325--336. Google Scholar
Digital Library
- Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). 1. Google Scholar
Digital Library
- N. B. Shah, K. V. Rashmi, P. V. Kumar, and K. Ramchandran. 2012. Distributed storage codes with repair-by-transfer and non-achievability of interior points on the storage-bandwidth tradeoff. IEEE Transactions on Information Theory 58, 3, 1837--1852. Google Scholar
Digital Library
- N. B. Shah, K. V. Rashmi, P. V. Kumar, and K. Ramchandran. 2012. Interference alignment in regenerating codes for distributed storage: Necessity and code constructions. IEEE Transactions on Information Theory 58, 4, 2134--2158. Google Scholar
Digital Library
- Zhirong Shen, Jiwu Shu, and Patrick P. C. Lee. 2016. Reconsidering single failure recovery in clustered file systems. In Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’16). 323--334.Google Scholar
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler. 2010. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). 1--10. Google Scholar
Digital Library
- Mark Silberstein, Lakshmi Ganesh, Yang Wang, Lorenzo Alvisi, and Mike Dahlin. 2014. Lazy means smart: Reducing repair bandwidth costs in erasure-coded distributed storage. In Proceedings of the 2014 International Conference on Systems and Storage (SYSTOR’14). 1--7. Google Scholar
Digital Library
- C. Suh and K. Ramchandran. 2011. Exact-repair MDS code construction using interference alignment. IEEE Transactions on Information Theory 57, 3, 1425--1442. Google Scholar
Digital Library
- Itzhak Tamo, Zhiying Wang, and Jehoshua Bruck. 2013. Zigzag codes: MDS array codes with optimal rebuilding. IEEE Transactions on Information Theory 59, 3, 1597--1616. Google Scholar
Digital Library
- M. Ali Tebbi, Terence H. Chan, and Chi Wan Sung. 2014. A code design framework for multi-rack distributed storage. In Proceedings of the 2014 IEEE Information Theory Workshop (ITW’14). 55--59.Google Scholar
Cross Ref
- Amin Vahdat, Mohammad Al-Fares, Nathan Farrington, Radhika Niranjan Mysore, George Porter, and Sivasankar Radhakrishnan. 2010. Scale-out networking in the data center. IEEE Micro 30, 4, 29--41. Google Scholar
Digital Library
- Y. Wu and A. G. Dimakis. 2009. Reducing repair traffic for erasure coding-based storage via interference alignment. In Proceedings of the 2009 IEEE International Symposium on Information Theory (ISIT’09). 2276--2280. Google Scholar
Digital Library
- Mingyuan Xia, Mohit Saxena, Mario Blaum, and David A. Pease. 2015. A tale of two erasure codes in HDFS. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). 213--226. Google Scholar
Digital Library
- Min Ye and Alexander Barg. 2017. Explicit constructions of high-rate MDS array codes with optimal repair bandwidth. IEEE Transactions on Information Theory 63, 4, 2001--2014. Google Scholar
Digital Library
- Min Ye and Alexander Barg. 2017. Explicit constructions of optimal-access MDS codes with nearly optimal sub-packetization. IEEE Transactions on Information Theory 63, 10, 6307--6317. Google Scholar
Digital Library
Index Terms
Optimal Repair Layering for Erasure-Coded Data Centers: From Theory to Practice
Recommendations
Repair Pipelining for Erasure-coded Storage: Algorithms and Evaluation
We propose repair pipelining, a technique that speeds up the repair performance in general erasure-coded storage. By carefully scheduling the repair of failed data in small-size units across storage nodes in a pipelined manner, repair pipelining reduces ...
Data Delta Based Hybrid Writes for Erasure-Coded Storage Systems
Network and Parallel ComputingAbstractErasure coding is widely used in storage systems since it can offer higher reliability at lower redundancy than data replication. However, erasure-coded storage systems have to perform a partial write to an entire erasure coding group for a small ...
Adaptive Updates for Erasure-Coded Storage Systems Based on Data Delta and Logging
Parallel and Distributed Computing, Applications and TechnologiesAbstractWith the explosive growth of data in modern storage systems, erasure coding is widely used to ensure data reliability because of its low storage cost and high reliability. However, a small update can lead to a partial update for erasure-coded ...






Comments