Abstract
We propose repair pipelining, a technique that speeds up the repair performance in general erasure-coded storage. By carefully scheduling the repair of failed data in small-size units across storage nodes in a pipelined manner, repair pipelining reduces the single-block repair time to approximately the same as the normal read time for a single block in homogeneous environments. We further design different extensions of repair pipelining algorithms for heterogeneous environments and multi-block repair operations. We implement a repair pipelining prototype, called
- Facebook. 2020. Facebook’s Hadoop. Retrieved from https://github.com/facebookarchive/hadoop-20.Google Scholar
- Hadoop. 2020. Hadoop 3.1.1 HDFS. Retrieved from https://hadoop.apache.org/docs/r3.1.1/.Google Scholar
- Iperf. 2020. Iperf. Retrieved from https://iperf.fr/.Google Scholar
- Redis. 2020. Redis. Retrieved from http://redis.io/.Google Scholar
- Linux. 2020. tc. Retrieved from https://linux.die.net/man/8/tc.Google Scholar
- Marcos K. Aguilera. 2013. Geo-distributed Storage in Data Centers. In Proceedings of the International Conference on Principles of Distributed Systems (OPODIS’13).Google Scholar
- Faraz Ahmad, Srimat T. Chakradhar, Anand Raghunathan, and T. N. Vijaykumar. 2014. ShuffleWatcher: Shuffle-aware scheduling in multi-tenant mapreduce clusters. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’14). 1–12. Google Scholar
Digital Library
- Fabien André, Anne-Marie Kermarrec, Erwan Le Merrer, Nicolas Le Souarnec, Gilles Straub, and Alexandre van Kempen. 2014. Archiving cold data in warehouses with clustered network coding. In Proceedings of the 9th European Conference on Computer Systems (EuroSys’14). 1–14. Google Scholar
Digital Library
- Yunren Bai, Zihan Xu, Haixia Wang, and Dongsheng Wang. 2019. Fast recovery techniques for erasure-coded clusters in non-uniform traffic network. In Proceedings of the 48th International Conference on Parallel Processing (ICPP’19). 1–10. Google Scholar
Digital Library
- R. Bhagwan, K. Tati, Y. Cheng, S. Savage, and G. Voelker. 2004. Total recall: System support for automated availability management. In Proceedings of the 1st Symposium on Networked Systems Design and Implementation (NSDI’04). 25. Google Scholar
Digital Library
- B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, et al. 2011. Windows Azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). 143–157. Google Scholar
Digital Library
- Y. Chen, S. Mu, J. Li, C. Huang, J. Li, A. Ogus, and D. Phillips. 2017. Giza: Erasure coding objects across global data centers. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’17). 539–551. Google Scholar
Digital Library
- Mosharaf Chowdhury, Srikanth Kandula, and Ion Stoica. 2013. Leveraging endpoint flexibility in data-intensive clusters. In Proceedings of the ACM SIGCOMM Conference (SIGCOMM’13). 231–242. Google Scholar
Digital Library
- B. Chun, F. Dabek, A. Haeberlen, E. Sit, H. Weatherspoon, M. F. Kaashoek, J. Kubiatowicz, and R. Morris. 2006. Efficient replica maintenance for distributed storage systems. In Proceedings of the 3rd Symposium on Networked Systems Design & Implementation (NSDI’06). 45–58. Google Scholar
Digital Library
- Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Opearting Systems Design & Implementation (OSDI’04). 137–149. Google Scholar
Digital Library
- A. G. Dimakis, P. B. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran. 2010. Network coding for distributed storage systems. IEEE Trans. Info. Theory 56, 9 (Sep. 2010), 4539–4551. Google Scholar
Digital Library
- Daniel Ford, Francois Labelle, Florentina I. Popovici, Murray Stokel, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. 2010. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI’10). 61–74. Google Scholar
Digital Library
- S. Ghemawat, H. Gobioff, and S. T. Leung. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). 29–43. Google Scholar
Digital Library
- C. A. R. Hoare. 1961. Algorithm 65: Find. Commun. ACM 4, 7 (1961), 321–322. Google Scholar
Digital Library
- Mark Holland and Garth A. Gibson. 1992. Parity declustering for continuous operation in redundant disk arrays. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’92). 23–35. Google Scholar
Digital Library
- Hanxu Hou, Patrick P. C. Lee, Kenneth W. Shum, and Yuchong Hu. 2019. Rack-aware regenerating codes for data centers. IEEE Trans. Info. Theory 65, 8 (Aug 2019), 4730–4745.Google Scholar
Cross Ref
- Yuchong Hu, Xiaolu Li, Mi Zhang, Patrick P. C. Lee, Xiaoyang Zhang, Pan Zhou, and Dan Feng. 2017. Optimal repair layering for erasure-coded data centers: From theory to practice. ACM Trans. Storage 13, 4 (2017), 1–24. Google Scholar
Digital Library
- C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin. 2012. Erasure coding in Windows Azure storage. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’12). 15–26. Google Scholar
Digital Library
- Jianzhong Huang, Xianhai Liang, Xiao Qin, Qiang Cao, and Changsheng Xie. 2015. PUSH: A pipelined reconstruction I/O for erasure-coded storage clusters. IEEE Trans. Parallel Distrib. Syst. 26, 2 (2015), 516–526.Google Scholar
Digital Library
- V. Jalaparti, P. Bodik, I. Menache, S. Rao, K. Makarychev, and M. Caesar. 2015. Network-aware scheduling for data-parallel jobs: Plan when you can. In Proceedings of the ACM SIGCOMM Conference (SIGCOMM’15). 407–420. Google Scholar
Digital Library
- O. Khan, R. Burns, J. Plank, W. Pierce, and C. Huang. 2012. Rethinking erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). 251–264. Google Scholar
Digital Library
- Jun Li, Shuang Yang, Xin Wang, and Baochun Li. 2010. Tree-structured data regeneration in distributed and storage systems with regenerating codes. In Proceedings of the 29th IEEE Conference on Computer Communications (INFOCOM’10). 2892–2900. Google Scholar
Digital Library
- Runhui Li, Patrick P. C. Lee, and Yuchong Hu. 2014. Degraded-first scheduling for MapReduce in erasure-coded storage clusters. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’14). 419–430. Google Scholar
Digital Library
- Runhui Li, Xiaolu Li, Patrick P. C. Lee, and Qun Huang. 2017. Repair pipelining for erasure-coded storage. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’17). 567–579. Google Scholar
Digital Library
- Xiaolu Li, Runhui Li, Patrick P. C. Lee, and Yuchong Hu. 2019. OpenEC: Toward unified and configurable erasure coding management in distributed storage systems. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST’19). 331–344. Google Scholar
Digital Library
- Subrata Mitra, Rajesh Panta, Moo-Ryong Ra, and Saurabh Bagchi. 2016. Partial-parallel-repair: A distributed technique for repairing erasure-coded storage. In Proceedings of the 11th European Conference on Computer Systems (EuroSys’16). 1–16. Google Scholar
Digital Library
- Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, and Sanjeev Kumar. 2014. f4: Facebook’s warm BLOB storage system. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 383–398. Google Scholar
Digital Library
- Lionel M. Ni and Philip K. McKinley. 1993. A survey of wormhole routing techniques in direct networks. IEEE Comput. 26, 2 (1993), 62–76. Google Scholar
Digital Library
- Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. 2011. Fast crash recovery in RAMCloud. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). 29–41. Google Scholar
Digital Library
- Michael Ovsiannikov, Silvius Rus, Damian Reeves, Paul Sutter, Sriram Rao, and Jim Kelly. 2013. The quantcast file system. Proc. VLDB Endow. 6, 11 (2013), 1092–1101. Google Scholar
Digital Library
- Lluis Pamies-Juarez, Filip Blagojević, Robert Mateescu, Cyril Gyuot, Eyal En Gad, and Zvonimir Bandic. 2016. Opening the chrysalis: On the real repair performance of MSR codes. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). 81–94. Google Scholar
Digital Library
- James S. Plank. 2013. Erasure codes for storage systems: A brief primer. ;login: USENIX Mag. 38, 6 (Dec 2013), 44–50.Google Scholar
- J. S. Plank, J. Luo, C. D. Schuman, L. Xu, and Z. Wilcox-O’Hearn. 2009. A performance evaluation and examination of open-source erasure coding libraries for storage. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09). 253–265. Google Scholar
Digital Library
- N. Prakash, Vitaly Abdrashitov, and Muriel Médard. 2018. The storage versus repair-bandwidth trade-off for clustered storage systems. IEEE Trans. Info. Theory 64, 8 (2018), 5783–5805.Google Scholar
Cross Ref
- K. V. Rashmi, Preetum Nakkiran, Jingyan Wang, Nihar B. Shah, and Kannan Ramchandran. 2015. Having your cake and eating it too: Jointly optimal erasure codes for I/O, storage, and network-bandwidth. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). 81–94. Google Scholar
Digital Library
- K. V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. 2013. A solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the Facebook warehouse cluster. In Proceedings of the 5th USENIX Conference on Hot Topics in Storage and File Systems (HotStorage’13). 8. Google Scholar
Digital Library
- K. V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. 2014. A “Hitchhiker’s” guide to fast and efficient data reconstruction in erasure-coded data centers. In Proceedings of the ACM SIGCOMM Conference (SIGCOMM’14). 331–342. Google Scholar
Digital Library
- I. S. Reed and G. Solomon. 1960. Polynomial codes over certain finite fields. J. Soc. Indust. Appl. Math. 8, 2 (1960), 300–304.Google Scholar
Cross Ref
- Jason K. Resch and James S. Plank. 2011. AONT-RS: Blending security and performance in dispersed storage systems. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11). 191–202. Google Scholar
Digital Library
- Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur. 2013. XORing elephants: Novel erasure codes for big data. Proc. VLDB Endow. 6, 5 (2013), 325–336. Google Scholar
Digital Library
- Zhirong Shen, Xiaolu Li, and Patrick P. C. Lee. 2019. Fast predictive repair in erasure-coded storage. In Proceedings of the 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’19). 556–567.Google Scholar
- Zhirong Shen, Jiwu Shu, Zhijie Huang, and Yingxun Fu. 2020. ClusterSR: Cluster-aware scattered repair in erasure-coded storage. In Proceedings of the 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS’20). 42–51.Google Scholar
- Zhirong Shen, Jiwu Shu, and Patrick P. C. Lee. 2016. Reconsidering single failure recovery in clustered file systems. In Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’16). 323–334.Google Scholar
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler. 2010. The Hadoop distributed file system. In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). 1–10. Google Scholar
Digital Library
- Mark Silberstein, Lakshmi Ganesh, Yang Wang, Lorenzo Alvizi, and Mike Dahlin. 2014. Lazy means smart: Reducing repair bandwidth costs in erasure-coded distributed storage. In Proceedings of International Conference on Systems and Storage (SYSTOR’14). 1–7. Google Scholar
Digital Library
- Myna Vajha, Vinayak Ramkumar, Bhagyashree Puranik, Ganesh Kini, Elita Lobo, Birenjith Sasidharan, P. Vijay Kumar, Alexandar Barg, Min Ye, Srinivasan Narayanamurthy et al. 2018. Clay codes: Moulding MDS codes to yield an MSR code. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST’18). 139–154. Google Scholar
Digital Library
- Hakim Weatherspoon and John D. Kubiatowicz. 2002. Erasure coding vs. replication: A quantitative comparison. In Proceedings of the 1st International Workshop on Peer-to-Peer Systems (IPTPS’02). 328–337. Google Scholar
Digital Library
- Fangliang Xu, Yijie Wang, Xiaoqiang Pei, and Xingkong Ma. 2019. LAR: Locality-aware reconstruction for erasure-coded distributed storage systems. Concurr. Comput.: Pract. Exper. 31, 11 (2019), e5031.Google Scholar
Cross Ref
Index Terms
Repair Pipelining for Erasure-coded Storage: Algorithms and Evaluation
Recommendations
Optimal Repair Layering for Erasure-Coded Data Centers: From Theory to Practice
Special Issue on MSST 2017 and Regular PapersRepair performance in hierarchical data centers is often bottlenecked by cross-rack network transfer. Recent theoretical results show that the cross-rack repair traffic can be minimized through repair layering, whose idea is to partition a repair ...
Repair pipelining for erasure-coded storage
USENIX ATC '17: Proceedings of the 2017 USENIX Conference on Usenix Annual Technical ConferenceWe propose repair pipelining, a technique that speeds up the repair performance in general erasure-coded storage. By pipelining the repair of failed data in small-size units across storage nodes, repair pipelining reduces the repair time to ...
Data Delta Based Hybrid Writes for Erasure-Coded Storage Systems
Network and Parallel ComputingAbstractErasure coding is widely used in storage systems since it can offer higher reliability at lower redundancy than data replication. However, erasure-coded storage systems have to perform a partial write to an entire erasure coding group for a small ...






Comments