Abstract
Many vertex-centric graph algorithms can be expressed using asynchronous parallelism by relaxing certain read-after-write data dependences and allowing threads to compute vertex values using stale (i.e., not the most recent) values of their neighboring vertices. We observe that on distributed shared memory systems, by converting synchronous algorithms into their asynchronous counterparts, algorithms can be made tolerant to high inter-node communication latency. However, high inter-node communication latency can lead to excessive use of stale values causing an increase in the number of iterations required by the algorithms to converge. Although by using bounded staleness we can restrict the slowdown in the rate of convergence, this also restricts the ability to tolerate communication latency. In this paper we design a relaxed memory consistency model and consistency protocol that simultaneously tolerate communication latency and minimize the use of stale values. This is achieved via a coordinated use of best effort refresh policy and bounded staleness. We demonstrate that for a range of asynchronous graph algorithms and PDE solvers, on an average, our approach outperforms algorithms based upon: prior relaxed memory models that allow stale values by at least 2.27x; and Bulk Synchronous Parallel (BSP) model by 4.2x. We also show that our approach frequently outperforms GraphLab, a popular distributed graph processing framework.
- Apache Giraph. http://giraph.apache.org/.Google Scholar
- M. Ahamad, G. Neiger, J.E. Burns, P. Kohli, and P.W. Hutto. Causal Memory: Definitions, Implementation and Programming. phDistributed Computing, 9 (1): 37--49, 1995.Google Scholar
- C. Amza, A.L. Cox, W. Zwaenepoel, and S. Dwarkadas. Software DSM Protocols That Adapt Between Single Writer and Multiple Writer. phHPCA, pages 261--271, 1997. Google Scholar
Digital Library
- A. Bourchtein. Atmospheric models. http://www.cise.ufl.edu/research/sparse/matrices/Bourchtein/atmosmodl.html, 2009.Google Scholar
- H.E. Bal, M.F. Kaashoek, and A.S. Tanenbaum. Orca: A Language for Parallel Programming of Distributed Systems. phIEEE TSE, 18 (3): 190--205, 1992. Google Scholar
Digital Library
- G.M. Baudet. Asynchronous Iterative Methods for Multiprocessors. phJACM, 25 (2): 226--244, 1978. Google Scholar
Digital Library
- B.N. Bershad and M.J. Zekauskas. Midway: Shared Memory Parallel Programming with Entry Consistency for Distributed Memory Multiprocessors. phTR, Carnegie Mellon University-CS-91--170, 1991.Google Scholar
- J.B. Carter, J.K. Bennett, and W. Zwaenepoel. Implementation and Performance of Munin. phSOSP, pages 152--164, 1991. Google Scholar
Digital Library
- J.B. Carter, J.K. Bennett, and W. Zwaenepoel. Techniques for Reducing Consistency-related Communication in Distributed Shared-memory Systems. phACM TOCS, 13 (3): 205--243, 1995. Google Scholar
Digital Library
- D. Chaiken, C. Fields, K. Kurihara, and A. Agarwal. Directory-Based Cache Coherence in Large-Scale Multiprocessors. phComputer, 23 (6): 49--58, 1990. Google Scholar
Digital Library
- D. Chen, C. Tang, B. Sanders, S. Dwarkadas, and M.L. Scott. Exploiting High-level Coherence Information to Optimize Distributed Shared State. phPPoPP, pages 131--142, 2003. Google Scholar
Digital Library
- Y-S. Cheng, M. Neely, and K. M. Chugg. Iterative Message Passing Algorithm for Bipartite Maximum Weighted Matching. In phIEEE International Symposium on Information Theory, pages 1934--1938. 2006.Google Scholar
- J. Cipar, Q. Ho, J. K. Kim, S. Lee, G. R. Ganger, G. Gibson, K. Keeton, and E. Xing. Solving the Straggler Problem with Bounded Staleness. phHotOS, pages 22--22, 2013. Google Scholar
Digital Library
- W. L. M. D. Chazan. Chaotic relaxation. In phLinear Algebra and Its Application, pages 2:199--222, 1969.Google Scholar
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. phCACM, 51 (1): 107--113, 2008. Google Scholar
Digital Library
- A. Dziekonski, A. Lamecki, and M. Mrozowski. High-order vector finite element method in EM. http://www.cise.ufl.edu/research/sparse/matrices/Dziekonski/dielFilterV3real.html, 2011.Google Scholar
- C. Ding, X. Shen, K. Kelsey, C. Tice, R. Huang, and C. Zhang. Software Behavior Oriented Parallelization. phPLDI, pages 223--234, 2007. Google Scholar
Digital Library
- C. Janna, and M. Ferronato. 3D model of a steel flange, hexahedral finite elements. http://www.cise.ufl.edu/research/sparse/matrices/Janna/Flan\_1565.html, 2011.Google Scholar
- K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy. Memory Consistency and Event Ordering in Scalable Shared-memory Multiprocessors. phISCA, pages 15--26, 1990. Google Scholar
Digital Library
- J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed Graph-parallel Computation on Natural Graphs. phOSDI, pages 17--30, 2012. Google Scholar
Digital Library
- J. R. Goodman. phCache Consistency and Sequential Consistency. Univ. of Wisconsin-Madison, CS Department, 1991.Google Scholar
- A. Heddaya and H. Sinha. An overview of Mermera: A system and formalism for non-coherent distributed parallel memory. phHawaii International Conf. on System Sciences, vol. 2, pages 164--173, 1993.Google Scholar
- A. Heddaya and H. Sinha. phCoherence, Non-coherence and Local Consistency in Distributed Shared Memory for Parallel Computing. TR BU-CS-92-004, Boston Univ., 1992.Google Scholar
- A. Heddaya and H. Sinha. phAn Implementation of Mermera: A Shared Memory System that Mixes Coherence with Non-coherence. TR BUCS-TR-1993-006, Boston Univ., 1993. Google Scholar
Digital Library
- M. De Domenico, A. Lima, P. Mougel, and M. Musolesi. The Anatomy of a Scientific Rumor. Scientific Reports, 2013.Google Scholar
Cross Ref
- P. W. Hutto and M. Ahamad. Slow memory: Weakening consistency to enhance concurrency in distributed shared memories. phICDCS, pages 302--309, 1990.Google Scholar
- L. Iftode, J. P. Singh, and K. Li. Scope Consistency: A Bridge Between Release Consistency and Entry Consistency. phSPAA, pages 277--287, 1996. Google Scholar
Digital Library
- V. Iosevich and A. Schuster. Distributed Shared Memory: To Relax or Not to Relax? In M. Danelutto, M. Vanneschi, and D. Laforenza, editors, phEuro-Par, phLNCS 3149, pages 198--205, Springer, 2004.Google Scholar
- U. Kang, D. Horng, et al. Inference of Beliefs on Billion-Scale Graphs. phKDD-LDMTA, 2010.Google Scholar
- G. Karypis, V. Kumar. A Fast and Highly Quality Multilevel Scheme for Partitioning Irregular Graphs. phSIAM Journal on Scientific Computing, Vol. 20, pp. 359--392, 1999. Google Scholar
Digital Library
- P. Keleher, A.L. Cox, and W. Zwaenepoel. phLazy Release Consistency for Software Distributed Shared Memory, phISCA, pages 13--21, 1992. Google Scholar
Digital Library
- P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. phWTEC.pages 10--10, 1994. Google Scholar
Digital Library
- SC. Koduru, M. Feng, and R. Gupta. Programming Large Dynamic Data Structures on a DSM Cluster of Multicores. phPGAS Programming Models, 2013.Google Scholar
- L. Kontothanassis, R. Stets, G. Hunt, U. Rencuzogullari, G. Altekar, S. Dwarkadas, and M.L. Scott. Shared Memory Computing on Clusters with Symmetric Multiprocessors and System Area Networks. phTOCS, 23 (3): 301--335, 2005. Google Scholar
Digital Library
- A. Kristensen and C. Low. Problem-oriented Object Memory: Customizing Consistency. phOOPSLA, pages 399--413, 1995. Google Scholar
Digital Library
- M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L.P. Chew. Optimistic Parallelism Requires Abstractions. phPLDI, pages 211--222, 2007. Google Scholar
Digital Library
- A. Kyrola, G. Blelloch, C. Guestrin GraphChi: Large-scale Graph Computation on Just a PC. phOSDI, pages 31--46, 2012. Google Scholar
Digital Library
- L. Lamport. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. phIEEE TC, C-28 (9): 690--691, 1979. Google Scholar
Digital Library
- L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. phCACM, 21 (7): 558--565, 1978. Google Scholar
Digital Library
- D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy. phThe Directory-based Cache Coherence Protocol for the DASH Multiprocessor, phISCA, pages 148--159, 1990. Google Scholar
Digital Library
- J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Mathematics, pages 29--123, 2009.Google Scholar
Cross Ref
- I. Lipkind, I. Pechtchanski, and V. Karamcheti. Object Views: Language Support for Intelligent Object Caching in Parallel and Distributed Computations. phOOPSLA, pages 447--460, 1999. Google Scholar
Digital Library
- R. Lipton and J. Sandberg. phPRAM: A Scalable Shared Memory. Princeton University, Department of Computer Science, TR-180--88, 1988.Google Scholar
- L. Liu and Z. Li. Improving Parallelism and Locality with Asynchronous Algorithms. phPPoPP, pages 213--222, 2010. Google Scholar
Digital Library
- X. Liu and T. Murata. Advanced modularity-specialized label propagation algorithm for detecting communities in networks. phPhysica A: Statistical Mechanics and its Applications, 389 (7): 1493--1500, 2010.Google Scholar
Cross Ref
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. phProc. VLDB Endow., 5 (8): 716--727, 2012. Google Scholar
Digital Library
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large-scale Graph Processing.phSIGMOD, pages 135--146, 2010. Google Scholar
Digital Library
- R. Meyers and Z. Li. ASYNC Loop Constructs for Relaxed Synchronization. In phLCPC,phLanguages and Compilers for Parallel Computing, pages 292--303, 2008. Google Scholar
Digital Library
- D. Mosberger. Memory Consistency Models. phSIGOPS Oper. Syst. Rev., 27 (1): 18--26, 1993. Google Scholar
Digital Library
- D. Nguyen, L. Andrew and K. Pingali. A Lightweight Infrastructure for Graph Analytics. phSOSP, 2013. Google Scholar
Digital Library
- W-Y. Liang, C-T. King, and F. Lai. Adsmith: An Efficient Object-Based Distributed Shared Memory System on PVM. phInternational Symposium on Parallel Architectures, Algorithms, and Networks, pages 173--179, 1996. Google Scholar
Digital Library
- J. Yang and J. Leskovec. Defining and Evaluating Network Communities based on Ground-truth. ICDM, 2012. Google Scholar
Digital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. 1999.Google Scholar
- L. Takac and M. Zabovsky. Data analysis in public social networks. International Scientific Conference and International Workshop Present Day Trends of Innovations, 2012.Google Scholar
- L. Rauchwerger and D. A. Padua. The LRPD Test: Speculative Run-time Parallelization of Loops with Privatization and Reduction Parallelization. PLDI, 218--232, 1995. Google Scholar
Digital Library
- D. J. Scales and K. Gharachorloo. Design and Performance of the Shasta Distributed Shared Memory Protocol. phICS, pages 245--252, 1997. Google Scholar
Digital Library
- M. Schulz, J. Tao, and W. Karl. Improving the Scalability of Shared Memory Systems through Relaxed Consistency. phWC3, 2002.Google Scholar
- X. Shen, Arvind, and L. Rudolph. CACHET: An Adaptive Cache Coherence Protocol for Distributed Shared-memory Systems. phICS, pages 135--144, 1999. Google Scholar
Digital Library
- J. Shun, and G. Blelloch. Ligra: A Lightweight Graph Processing Framework for Shared Memory. phPPoPP, pages 135--146, 2013. Google Scholar
Digital Library
- A. Singla, U. Ramachandran, and J. Hodgins. Temporal Notions of Synchronization and Consistency in Beehive. phSPAA, pages 211--220, 1997. Google Scholar
Digital Library
- J. Leskovec. Stanford Large Network Dataset Collection. http://snap.stanford.edu/data/index.html, 2011.Google Scholar
- C. Sinclair. 3-D spectral-element elastic wave modeling in freq. domain. http://www.cise.ufl.edu/research/sparse/matrices/Sinclair/3Dspectralwave.html, 2007.Google Scholar
- T. A. Davis and Y. Hu. The University of Florida Sparse Matrix Collection. phACM Transactions on Mathematical Software, Vol 38, pages 1:1 - 1:25, 2011. Google Scholar
Digital Library
- L. G. Valiant. A Bridging Model for Parallel Computation. phCACM, 33 (8): 103--111, 1990. Google Scholar
Digital Library
- J. Leskovec, L. A. Adamic, and B. A. Huberman. The Dynamics of Viral Marketing. ACM Trans. Web, 2007. Google Scholar
Digital Library
- G. Wang, W. Xie, A. Demers, and J. Gehrke. Asynchronous Large-Scale Graph Processing Made Easy. phCIDR, 2013.Google Scholar
- B.-H. Yu, Z. Huang, S. Cranefield, and M. Purvis. Homeless and Home-based Lazy Release Consistency Protocols on Distributed Shared Memory. phAustralasian Conf. on Computer Science-Vol 26, pages 117--123, 2004. Google Scholar
Digital Library
- Y. Zhou, L. Iftode, J. P. Sing, K. Li, B. R. Toonen, I. Schoinas, M. D. Hill, and D. A. Wood. Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation. phPPoPP, pages 193--205, 1997. Google Scholar
Digital Library
- X. Zhu and Z. Ghahramani. Learning from Labeled and Unlabeled Data with Label Propagation. Technical Report Carnegie Mellon University-CALD-02--107,Carnegie Mellon University, 2002.Google Scholar
Index Terms
ASPIRE: exploiting asynchronous parallelism in iterative algorithms using a relaxed consistency based DSM
Recommendations
ASPIRE: exploiting asynchronous parallelism in iterative algorithms using a relaxed consistency based DSM
OOPSLA '14: Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & ApplicationsMany vertex-centric graph algorithms can be expressed using asynchronous parallelism by relaxing certain read-after-write data dependences and allowing threads to compute vertex values using stale (i.e., not the most recent) values of their neighboring ...
Moving Address Translation Closer to Memory in Distributed Shared-Memory Multiprocessors
To support a global virtual memory space, an architecture must translate virtual addresses dynamically. In current processors, the translation is done in a TLB (Translation Lookaside Buffer), before or in parallel with the first-level cache access. As ...
Minnow: Lightweight Offload Engines for Worklist Management and Worklist-Directed Prefetching
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsThe importance of irregular applications such as graph analytics is rapidly growing with the rise of Big Data. However, parallel graph workloads tend to perform poorly on general-purpose chip multiprocessors (CMPs) due to poor cache locality, low ...







Comments