Abstract
Graph processing is one of the important research topics in the big-data era. To build a general framework for graph processing by using a DRAM-based FPGA board with deep memory hierarchy, one of the reasonable methods is to partition a given big graph into multiple small subgraphs, represent the graph with a two-dimensional grid, and then process the subgraphs one after another to divide and conquer the whole problem. Such a method (grid-graph processing) stores the graph data in the off-chip memory devices (e.g., on-board or host DRAM) that have large storage capacities but relatively small bandwidths, and processes individual small subgraphs one after another by using the on-chip memory devices (e.g., FFs, BRAM, and URAM) that have small storage capacities but superior random access performances. However, directly exchanging graph (vertex and edge) data between the processing units in FPGA chip with slow off-chip DRAMs during grid-graph processing leads to limited performances and excessive data transmission amounts between the FPGA chip and off-chip memory devices.
In this article, we show that it is effective in improving the performance of grid-graph processing on DRAM-based FPGA hardware accelerators by leveraging the flexibility and programmability of FPGAs to build application-specific caching mechanisms, which bridge the performance gaps between on-chip and off-chip memory devices, and reduce the data transmission amounts by exploiting the localities on data accessing. We design two application-specific caching mechanisms (i.e., vertex caching and edge caching) to exploit two types of localities (i.e., vertex locality and subgraph locality) that exist in grid-graph processing, respectively. Experimental results show that with the vertex caching mechanism, our system (named as FabGraph) achieves up to 3.1× and 2.5× speedups for BFS and PageRank, respectively, over ForeGraph when processing medium graphs stored in the on-board DRAM. With the edge caching mechanism, the extension of FabGraph (named as FabGraph+) achieves up to 9.96× speedups for BFS over FPGP when processing large graphs stored in the host DRAM.
- R. K. Ahuja, K. Mehlhorn, J. Orlin, and R. E. Tarjan. 1990. Faster algorithms for the shortest path problem. J. Amer. Comput. Mach. 37, 2 (1990), 213--223.Google Scholar
Digital Library
- Z. Ai, M. Zhang, Y. Wu, X. Qian, K. Chen, and W. Zheng. 2019. Clip: A disk I/O focused parallel out-of-core graph processing system. IEEE Trans. Parallel Distrib. Syst. 30, 1 (2019), 45--62.Google Scholar
Digital Library
- ARM. 2019. AMBA AXI and ACE Protocol Specification. Retrieved from https://static.docs.arm.com/ihi0022/g/IHI0022G_amba_axi_protocol_spec.pdf.Google Scholar
- O. G. Attia, T. Johnson, K. Townsend, P. Jones, and J. Zambreno. 2014. CyGraph: A reconfigurable architecture for parallel breadth-first search. In Proceedings of the IEEE International Parallel Distributed Processing Symposium Workshops (IPDPSW’14). 228--235.Google Scholar
- S. Beamer, K. Asanovic, and D. Patterson. 2015. Locality exists in graph processing: Workload characterization on an Ivy bridge server. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’15). 56--65.Google Scholar
- P. Bedi and C. Sharma. 2016. Community detection in social networks. WIREs Data Min. Knowl. Discov. 6, 3 (May 2016), 115--135.Google Scholar
- B. Betkaoui, Y. Wang, D. B. Thomas, and W. Luk. 2012. A reconfigurable computing approach for efficient and scalable parallel graph exploration. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP’12). 8--15.Google Scholar
- C. E. Bichot and P. Siarry. 2013. Graph Partitioning. John Wiley 8 Sons, Ltd.Google Scholar
- P. Boldi, M. Santini, and S. Vigna. 2008. A large time-aware web graph. ACM SIGIR Forum 42, 2 (2008), 33--38.Google Scholar
Digital Library
- S. Brin and L. Page. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1–7 (1998), 107--117.Google Scholar
Digital Library
- Y. Chi, G. Dai, Y. Wang, G. Sun, G. Li, and H. Yang. 2016. NXgraph: An efficient graph processing system on a single machine. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’16). 409--420.Google Scholar
- Graph 500 Committees. 2017. Graph 500. Retrieved from http://graph500.org/.Google Scholar
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. 2001. Introduction to Algorithms (2 ed.). MIT Press and McGraw-Hill. 531–539 pages.Google Scholar
- G. Dai, Y. Chi, Y. Wang, and H. Yang. 2016. FPGP: Graph processing framework on FPGA a case study of breadth-first search. In Proceedings of the ACM/SIGDA International Conference on Field-Programmable Gate Arrays (FPGA’16). 105--110.Google Scholar
- G. Dai, T. Huang, Y. Chi, N. Xu, Y. Wang, and H. Yang. 2017. ForeGraph: Exploring large-scale graph processing on multi-FPGA architecture. In Proceedings of the ACM/SIGDA International Conference on Field-Programmable Gate Arrays (FPGA’17). 217--226.Google Scholar
- M. deLorimier, N. Kapre, N. Mehta, D. Rizzo, I. Eslick, R. Rubin, T. E. Uribe, T. F. Jr. Knight, and A. DeHon. 2006. GraphStep: A system architecture for sparse-graph algorithms. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’06). 143--151.Google Scholar
- M. Faloutsos, P. Faloutsos, and C. Faloutsos. 1999. On power-law relationships of the internet topology. SIGCOMM Comput. Commun. Rev. 29, 4 (1999), 251--262.Google Scholar
Digital Library
- J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. 2012. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI’12). 17--30.Google Scholar
- T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1--13.Google Scholar
- J. L. Hennessy and D. A. Patterson. 2011. Computer Architecture: A Quantitative ApproachComputer Architecture: A Quantitative Approach (5th ed.). Morgan Kaufmann.Google Scholar
- D. Hilbert. 1891. Ueber die stetige Abbildung einer Linie auf ein Flächenstäck. Math. Ann. 38, 3 (1891), 459--460.Google Scholar
Cross Ref
- H. Jun, J. Cho, K. Lee, H. Son, K. Kim, H. Jin, and K. Kim. 2017. HBM (high bandwidth memory) DRAM technology and architecture. In Proceedings of the IEEE International Memory Workshop (IMW’17). 1--4.Google Scholar
- N. Kapre. 2015. Custom FPGA-based soft-processors for sparse graph acceleration. In Proceedings of the IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP’15). 9--16.Google Scholar
Cross Ref
- H. Kwak, C. Lee, H. Park, and S. Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the ACM International Conference on World Wide Web (WWW’10). 591--600.Google Scholar
- G. Lei, Y. Dou, R. Li, and F. Xia. 2016. An FPGA implementation for solving the large single-source-shortest-path problem. IEEE Trans. Circ. Syst. II: Express Briefs 63, 5 (2016), 473--477.Google Scholar
Cross Ref
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. 2012. Distributed GraphLab: A framework for machine learning and data mining in the cloud. In Proceedings of the VLDB Endowment (VLDB’12). 716–727.Google Scholar
- G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’10). 135--146.Google Scholar
- Micron. 2015. DDR4 SDRAM for Automotive. Retrieved from https://www.micron.com/products/dram/ddr4-sdram/.Google Scholar
- E. Nurvitadhi, G. Weisz, Y. Wang, S. Hurkat, M. Nguyen, J. C. Hoe, J. F. Martínez, and C. Guestrin. 2014. GraphGen: An FPGA framework for vertex-centric graph computation. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’14). 25--28.Google Scholar
- T. Oguntebi and K. Olukotun. 2016. GraphOps: A dataflow library for graph analytics acceleration. In Proceedings of the ACM/SIGDA International Conference on Field-Programmable Gate Arrays (FPGA’16). 111--117.Google Scholar
- P. Rosenfeld, E. Cooper-Balis, and B. Jacob. 2011. DRAMSim2: A cycle accurate memory system simulator. IEEE Comput. Architect. Lett. 10, 1 (2011), 16--19.Google Scholar
Digital Library
- A. Roy, I. Mihailovic, and W. Zwaenepoel. 2013. X-Stream: Edge-centric graph processing using streaming partitions. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’13). 472--488.Google Scholar
- Z. Shao, R. Li, D. Hu, X. Liao, and H. Jin. 2019. Improving performance of graph processing on FPGA-DRAM platform by two-level vertex caching. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). 320--329.Google Scholar
- J. Shun and G. E. Blelloch. 2013. Ligra: A lightweight graph processing framework for shared memory. ACM SIGPLAN Notices 48, 8 (2013), 135--146.Google Scholar
Digital Library
- Stanford. 2018. SNAP large network dataset collection. Retrieved from http://snap.stanford.edu/data/index.html.Google Scholar
- Y. Tian, A. Balmin, S. A. Corsten, S. Tatikonda, and J. McPherson. 2013. From “Think Like a Vertex” to “Think Like a Graph.” In Proceedings of the VLDB Endowment (VLDB’13). 193--204.Google Scholar
- J. Ugander, B. Karrer, L. Backstrom, and C. Marlow. 2011. The anatomy of the facebook social graph. Retrieved from http://arxiv.org/abs/1111.4503.Google Scholar
- Y. Wang, Y. Pan, A. Davidson, Y. Wu, C. Yang, L. Wang, M. Osama, C. Yuan, W. Liu, A. T. Riffel, and J. D. Owens. 2017. Gunrock: GPU graph analytics. ACM Trans. Parallel Comput. 4, 1 (2017), 3:1–3:49.Google Scholar
Digital Library
- Wikipedia. 2010. PCI Express. Retrieved from https://en.wikipedia.org/wiki/PCI_Express.Google Scholar
- Xilinx. 2017. Block Memory Generator v8.4. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/blk_mem_gen/v8_4/.Google Scholar
- Xilinx. 2018. UltraScale Architecture Memory Resources-User Guide. Retrieved from https://www.xilinx.com/support/documentation/user_guides/.Google Scholar
- Xilinx. 2018. Xilinx Boards and Kits. Retrieved from https://www.xilinx.com/products/boards-and-kits.html.Google Scholar
- J. Zhang, S. Khoram, and J. Li. 2017. Boosting the performance of FPGA-based graph processor using hybrid memory cube: A case for breadth first search. In Proceedings of the ACM/SIGDA International Conference on Field-Programmable Gate Arrays (FPGA’17). 207--216.Google Scholar
- J. Zhang and J. Li. 2018. Degree-aware hybrid graph traversal on FPGA-HMC platform. In Proceedings of the ACM/SIGDA International Conference on Field-Programmable Gate Arrays (FPGA’18). 229--238.Google Scholar
- K. Zhang, R. Chen, and H. Chen. 2015. NUMA-aware graph-structured analytics. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). 183--193.Google Scholar
- J. Zhong and B. He. 2014. Medusa: A parallel graph processing system on graphics processors. ACM SIGMOD Record 43, 2 (2014), 35--40.Google Scholar
Digital Library
- S. Zhou, C. Chelmis, and V. K. Prasanna. 2015. Accelerating large-scale single-source shortest path on FPGA. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW’15). 129--136.Google Scholar
- S. Zhou, C. Chelmis, and V. K. Prasanna. 2015. Optimizing memory performance for FPGA implementation of pagerank. In Proceedings of the IEEE International Conference on ReConFigurable Computing and FPGAs (ReConFig’15). 1--6.Google Scholar
- S. Zhou, C. Chelmis, and V. K. Prasanna. 2016. High-throughput and energy-efficient graph processing on FPGA. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). 103--110.Google Scholar
- X. Zhu, W. Han, and W. Chen. 2015. GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning. In Proceedings of the USENIX Conference on Usenix Annual Technical Conference (ATC’15). 375--386.Google Scholar
- Y. Zou and M. Lin. 2018. GridGAS: An I/O-efficient heterogeneous FPGA+CPU computing platform for very large-scale graph analytics. In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT’18). 246--249.Google Scholar
- Y. Zou and M. Lin. 2018. Very large-scale and node-heavy graph analytics with heterogeneous FPGA+CPU computing platform. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI’18). 638--643.Google Scholar
Index Terms
Processing Grid-format Real-world Graphs on DRAM-based FPGA Accelerators with Application-specific Caching Mechanisms
Recommendations
Improving Performance of Graph Processing on FPGA-DRAM Platform by Two-level Vertex Caching
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysIn recent years, graph processing attracts lots of attention due to its broad applicability in solving real-world problems. With the flexibility and programmability, FPGA platforms provide the opportunity of processing the graph data with high ...
Accelerating Graph Analytics by Co-Optimizing Storage and Access on an FPGA-HMC Platform
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysGraph analytics, which explores the relationships among interconnected entities, is becoming increasingly important due to its broad applicability, from machine learning to social sciences. However, due to the irregular data access patterns in graph ...
ACTS: A Near-Memory FPGA Graph Processing Framework
FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate ArraysDespite the high off-chip bandwidth and on-chip parallelism offered by today's near-memory accelerators, software-based (CPU and GPU) graph processing frameworks still suffer performance degradation from under-utilization of available memory bandwidth ...






Comments