Abstract
Graph-structured analytics has been widely adopted in a number of big data applications such as social computation, web-search and recommendation systems. Though much prior research focuses on scaling graph-analytics on distributed environments, the strong desire on performance per core, dollar and joule has generated considerable interests of processing large-scale graphs on a single server-class machine, which may have several terabytes of RAM and 80 or more cores. However, prior graph-analytics systems are largely neutral to NUMA characteristics and thus have suboptimal performance. This paper presents a detailed study of NUMA characteristics and their impact on the efficiency of graph-analytics. Our study uncovers two insights: 1) either random or interleaved allocation of graph data will significantly hamper data locality and parallelism; 2) sequential inter-node (i.e., remote) memory accesses have much higher bandwidth than both intra- and inter-node random ones. Based on them, this paper describes Polymer, a NUMA-aware graph-analytics system on multicore with two key design decisions. First, Polymer differentially allocates and places topology data, application-defined data and mutable runtime states of a graph system according to their access patterns to minimize remote accesses. Second, for some remaining random accesses, Polymer carefully converts random remote accesses into sequential remote accesses, by using lightweight replication of vertices across NUMA nodes. To improve load balance and vertex convergence, Polymer is further built with a hierarchical barrier to boost parallelism and locality, an edge-oriented balanced partitioning for skewed graphs, and adaptive data structures according to the proportion of active vertices. A detailed evaluation on an 80-core machine shows that Polymer often outperforms the state-of-the-art single-machine graph-analytics systems, including Ligra, X-Stream and Galois, for a set of popular real-world and synthetic graphs.
- The 9th dimacs implementation challenge - shortest paths. http://www.dis.uniroma1.it/challenge9/.Google Scholar
- Graph 500. http://www.graph500.org.Google Scholar
- numactl. http://oss.sgi.com/projects/libnuma/.Google Scholar
- L. A. Adamic and B. A. Huberman. Zipf’s law and the internet. Glottometrics, 3(1):143–150, 2002.Google Scholar
- A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The multikernel: a new os architecture for scalable multicore systems. In SOSP, 2009. Google Scholar
Digital Library
- S. Beamer, K. Asanovi´c, and D. Patterson. Direction-optimizing breadth-first search. Scientific Programming, 21(3):137–148, 2013. Google Scholar
Digital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW, pages 107–117, 1998. Google Scholar
Digital Library
- I. Calciu, D. Dice, Y. Lev, V. Luchangco, V. J. Marathe, and N. Shavit. Numa-aware reader-writer locks. In PPoPP, 2013. Google Scholar
Digital Library
- D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In SDM, volume 4, pages 442–446. SIAM, 2004.Google Scholar
- R. Chen and H. Chen. Tiled-mapreduce: Efficient and flexible mapreduce processing on multicore with tiling. ACM TACO, 10, 2013. Google Scholar
Digital Library
- R. Chen, H. Chen, and B. Zang. Tiled-mapreduce: optimizing resource usages of data-parallel applications on multicore with tiling. In PACT, 2010. Google Scholar
Digital Library
- R. Chen, J. Shi, Y. Chen, H. Guan, and H. Chen. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. Technical Report 2013-11-001, IPADS, SJTU, 2013.Google Scholar
- R. Chen, X. Ding, P. Wang, H. Chen, B. Zang, and H. Guan. Computation and communication efficient graph processing with distributed immutable view. In HPDC, 2014. Google Scholar
Digital Library
- R. Chen, J. Shi, B. Zang, and H. Guan. Bipartite-oriented distributed graph partitioning for big learning. In APSys, 2014. Google Scholar
Digital Library
- R. Cheng, J. Hong, A. Kyrola, Y. Miao, X. Weng, M. Wu, F. Yang, L. Zhou, F. Zhao, and E. Chen. Kineograph: taking the pulse of a fast-changing and connected world. In EuroSys, pages 85–98, 2012. Google Scholar
Digital Library
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, et al. Introduction to algorithms, volume 2. MIT press Cambridge, 2001. Google Scholar
Digital Library
- M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic management: a holistic approach to memory placement on numa systems. In ASPLOS, 2013. Google Scholar
Digital Library
- T. David, R. Guerraoui, and V. Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In SOSP, 2013. Google Scholar
Digital Library
- D. Dice, V. J. Marathe, and N. Shavit. Lock cohorting: a general technique for designing numa locks. In PPoPP, pages 247–256, 2012. Google Scholar
Digital Library
- M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In SIGCOMM, pages 251–262, 1999. Google Scholar
Digital Library
- F. Gaud, B. Lepers, J. Decouchant, J. Funston, A. Fedorova, V. Quema, and I. Grenoble. Large pages may be harmful on numa systems. In USENIX ATC, 2014. Google Scholar
Digital Library
- J. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed graph-parallel computation on natural graphs. In OSDI, 2012. Google Scholar
Digital Library
- J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In OSDI, 2014. Google Scholar
Digital Library
- W. Han, Y. Miao, K. Li, M. Wu, F. Yang, L. Zhou, V. Prabhakaran, W. Chen, and E. Chen. Chronos: a graph engine for temporal graph analysis. In EuroSys, 2014. Google Scholar
Digital Library
- U. Kang, D. Horng, et al. Inference of beliefs on billion-scale graphs. In SIGKDD-LDMTA, 2010.Google Scholar
- Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. Mizan: a system for dynamic load balancing in large-scale graph processing. In EuroSys, pages 169–182, 2013. Google Scholar
Digital Library
- H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In WWW, pages 591–600, 2010. Google Scholar
Digital Library
- A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi: Large-scale graph computation on just a PC. In OSDI, 2012. Google Scholar
Digital Library
- R. Lachaize, B. Lepers, V. Quéma, et al. Memprof: A memory profiler for numa multicore systems. In USENIX ATC, 2012. Google Scholar
Digital Library
- X. Liu and J. Mellor-Crummey. A tool to analyze the performance of multithreaded programs on numa architectures. In PPoPP, 2014. Google Scholar
Digital Library
- Y. Liu, B. Wu, H. Wang, and P. Ma. Bpgm: A big graph mining tool. Tsinghua Science and Technology, 19(1):33–38, 2014.Google Scholar
Cross Ref
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: a framework for machine learning and data mining in the cloud. VLDB Endowment, 5(8):716–727, 2012. Google Scholar
Digital Library
- A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17(1):5–20, 2007.Google Scholar
Cross Ref
- Z. Majo and T. R. Gross. Memory management in numa multicore systems: trapped between cache contention and interconnect overhead. In ISMM, 2011. Google Scholar
Digital Library
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010. Google Scholar
Digital Library
- J. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM TOCS, 9(1):21–65, 1991. Google Scholar
Digital Library
- U. Meyer and P. Sanders. δ -stepping: a parallelizable shortest path algorithm. Journal of Algorithms, 49(1):114–152, 2003. Google Scholar
Digital Library
- D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In SOSP, pages 439–455. ACM, 2013. Google Scholar
Digital Library
- D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In SOSP, 2013. Google Scholar
Digital Library
- B. Panda, J. Herbach, S. Basu, and R. Bayardo. PLANET: massively parallel learning of tree ensembles with MapReduce. In VLDB, 2009. Google Scholar
Digital Library
- V. Prabhakaran, M. Wu, X. Weng, F. McSherry, L. Zhou, and M. Haridasan. Managing large graphs on multi-cores with graph awareness. In Usenix ATC, 2012. Google Scholar
Digital Library
- A. Roy, I. Mihailovic, and W. Zwaenepoel. X-stream: Edge-centric graph processing using streaming partitions. In SOSP, 2013. Google Scholar
Digital Library
- J. Shun and G. E. Blelloch. Ligra: a lightweight graph processing framework for shared memory. In PPoPP, 2013. Google Scholar
Digital Library
- A. Smola and S. Narayanamurthy. An architecture for parallel topic models. In VLDB, 2010. Google Scholar
Digital Library
- P. Wang, K. Zhang, R. Chen, H. Chen, and H. Guan. Replication-based fault-tolerance for large-scale graph processing. In DSN, 2014. Google Scholar
Digital Library
- C. Xie, R. Chen, H. Guan, B. Zang, and H. Chen. Sync or async: Time to fuse for distributed graph-parallel computation. In PPoPP, 2015. Google Scholar
Digital Library
- X. Zhao, A. Chang, A. D. Sarma, H. Zheng, and B. Y. Zhao. On the embeddability of random walk distances. In VLDB, 2013. Google Scholar
Digital Library
- J. Zhong and B. He. Medusa: Simplified Graph Processing on GPUs. TPDS, 2013. Google Scholar
Digital Library
- X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Technical Report CMUCALD-02-107, Carnegie Mellon University, 2002.Google Scholar
Index Terms
NUMA-aware graph-structured analytics
Recommendations
NUMA-aware graph-structured analytics
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingGraph-structured analytics has been widely adopted in a number of big data applications such as social computation, web-search and recommendation systems. Though much prior research focuses on scaling graph-analytics on distributed environments, the ...
GraphGrind: addressing load imbalance of graph partitioning
ICS '17: Proceedings of the International Conference on SupercomputingWe investigate how graph partitioning adversely affects the performance of graph analytics. We demonstrate that graph partitioning induces extra work during graph traversal and that graph partitions have markedly different connectivity than the original ...






Comments