Abstract
Existing distributed asynchronous graph processing systems employ checkpointing to capture globally consistent snapshots and rollback all machines to most recent checkpoint to recover from machine failures. In this paper we argue that recovery in distributed asynchronous graph processing does not require the entire execution state to be rolled back to a globally consistent state due to the relaxed asynchronous execution semantics. We define the properties required in the recovered state for it to be usable for correct asynchronous processing and develop CoRAL, a lightweight checkpointing and recovery algorithm. First, this algorithm carries out confined recovery that only rolls back graph execution states of the failed machines to affect recovery. Second, it relies upon lightweight checkpoints that capture locally consistent snapshots with a reduced peak network bandwidth requirement. Our experiments using real-world graphs show that our technique recovers from failures and finishes processing 1.5x to 3.2x faster compared to the traditional asynchronous checkpointing and recovery mechanism when failures impact 1 to 6 machines of a 16 machine cluster. Moreover, capturing locally consistent snapshots significantly reduces intermittent high peak bandwidth usage required to save the snapshots -- the average reduction in 99th percentile bandwidth ranges from 22% to 51% while 1 to 6 snapshot replicas are being maintained.
- L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large social networks: Membership, growth, and evolution. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 44--54, 2006. Google Scholar
Digital Library
- P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. In WWW, pages 595--601, 2004.Google Scholar
Digital Library
- K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM TOCS, 3(1):63--75, Feb. 1985. Google Scholar
Digital Library
- A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. One trillion edges: graph processing at facebook-scale. In Proc. VLDB Endowment, 2015. Google Scholar
Digital Library
- E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375--408, Sept. 2002. Google Scholar
Digital Library
- A. Farahat, T. LoFaro, J. C. Miller, G. Rae, and L. A. Ward. Authority rankings from hits, pagerank, and salsa: Existence, uniqueness, and effect of initialization. SIAM Jornal of Scientific Computing, 27(4):1181--1201, Nov. 2005. Google Scholar
Digital Library
- J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In USENIX OSDI, pages 599--613, 2014.Google Scholar
- M. Han and K. Daudjee. Giraph unchained: Barrierless asynchronous parallel execution in pregel-like graph processing systems. Proc. VLDB Endowment, 8(9):950--961, May 2015. Google Scholar
Digital Library
- Harshvardhan, A. Fidel, N. M. Amato, and L. Rauchwerger. Kla: A new algorithmic paradigm for parallel graph computations. In PACT, pages 27--38, New York, NY, 2014.Google Scholar
- P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait-free coordination for internet-scale systems. In USENIX ATC, pages 11--11, Berkeley, CA, 2010.Google Scholar
- H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media? In WWW, 2010.%pages 591--600, 2010. Google Scholar
Digital Library
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endowment, 5(8):716--727, Apr. 2012. Google Scholar
Digital Library
- G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, G. Czajkowski, and G. Inc. Pregel: A system for large-scale graph processing. In ACM SIGMOD, pages 135--146, 2010. Google Scholar
Digital Library
- D. Manivannan and M. Singhal. Quasi-synchronous checkpointing: Models, characterization, and classification. IEEE TPDS, 10(7):703--713, 1999. Google Scholar
Digital Library
- D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum. Fast crash recovery in ramcloud. In ACM SOSP, pages 29--41, New York, NY, USA, 2011. ACM.\newpage Google Scholar
Digital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.Google Scholar
- R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In USENIX OSDI, pages 293--306, Berkeley, CA, USA, 2010.Google Scholar
- M. Pundir, L. M. Leslie, I. Gupta, and R. H. Campbell. Zorro: Zero-cost reactive failure recovery in distributed graph processing. In ACM SoCC, pages 195--208, 2015.Google Scholar
Digital Library
- S. Salihoglu and J. Widom. GPS: A graph processing system. In Scientific and Statistical Database Management Conference, pages 22:1--22:12, 2013. Google Scholar
Digital Library
- B. Shao, H. Wang, and Y. Li. Trinity: A distributed graph engine on a memory cloud. In ACM SIGMOD, pages 505--516, 2013. Google Scholar
Digital Library
- Y. Shen, G. Chen, H. V. Jagadish, W. Lu, B. C. Ooi, and B. M. Tudor. Fast failure recovery in distributed graph processing systems. Proc. VLDB Endowment, 8(4):437--448, Dec. 2014. Google Scholar
Digital Library
- L. G. Valiant. A bridging model for parallel computation. CACM, 33(8):103--111, Aug. 1990. Google Scholar
Digital Library
- H. Cui, J. Cipar, Q. Ho, J.K. Kim, S, Lee, A. Kumar, J. Wei, W. Dai, G.R. Ganger, P.B. Gibbons, G.A. Gibson, and E.P. Xing. Exploiting Bounded Staleness to Speed Up Big Data Analytics. In USENIX ATC, pages 37--48, 2014.Google Scholar
Digital Library
- K. Vora, G. Xu, and R. Gupta. Load the Edges You Need: A Generic I/O Optimization for Disk-based Graph Processing. In USENIX ATC, pages 507--522, 2016.Google Scholar
Digital Library
- K. Vora, S. C. Koduru, and R. Gupta. ASPIRE: Exploiting Asynchronous Parallelism in Iterative Algorithms using a Relaxed Consistency based DSM. In OOPSLA, pages 861--878, 2014.Google Scholar
Digital Library
- G. Wang, W. Xie, A. Demers, and J. Gehrke. Asynchronous large-scale graph processing made easy. In Conference on Innovative Data Systems Research (CIDR), 2013.Google Scholar
- P. Wang, K. Zhang, R. Chen, and H. Chen. Replication-based fault-tolerance for large-scale graph processing. In IEEE/IFIP DSN, pages 562--573, 2014. Google Scholar
Digital Library
- J. W. Young. A first order approximation to the optimum checkpoint interval. CACM, 17(9):530--531, Sept. 1974. Google Scholar
Digital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX NSDI, pages 2--2, 2012.Google Scholar
Digital Library
- ZeroMQ. http://zeromq.org/.Google Scholar
- Y. Zhang, Q. Gao, L. Gao, and C. Wang. Accelerate large-scale iterative computation through asynchronous accumulative updates. In ScienceCloud, 2012. Google Scholar
Digital Library
- X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CALD-02--107, Carnegie Mellon University, 2002.Google Scholar
Recommendations
CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsExisting distributed asynchronous graph processing systems employ checkpointing to capture globally consistent snapshots and rollback all machines to most recent checkpoint to recover from machine failures. In this paper we argue that recovery in ...
CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
Asplos'17Existing distributed asynchronous graph processing systems employ checkpointing to capture globally consistent snapshots and rollback all machines to most recent checkpoint to recover from machine failures. In this paper we argue that recovery in ...
Exploiting Unblocking Checkpoint for Fault-Tolerance in Pregel-Like Systems
Web Information Systems Engineering – WISE 2021AbstractWith the explosive growth of graph size, a series of Pregel-like systems have emerged. Typically, these systems employ checkpointing and rollback mechanisms to achieve fault-tolerance in either blocking or unblocking manner. The blocking ...







Comments