skip to main content
research-article

Processing Grid-format Real-world Graphs on DRAM-based FPGA Accelerators with Application-specific Caching Mechanisms

Authors Info & Claims
Published:03 June 2020Publication History
Skip Abstract Section

Abstract

Graph processing is one of the important research topics in the big-data era. To build a general framework for graph processing by using a DRAM-based FPGA board with deep memory hierarchy, one of the reasonable methods is to partition a given big graph into multiple small subgraphs, represent the graph with a two-dimensional grid, and then process the subgraphs one after another to divide and conquer the whole problem. Such a method (grid-graph processing) stores the graph data in the off-chip memory devices (e.g., on-board or host DRAM) that have large storage capacities but relatively small bandwidths, and processes individual small subgraphs one after another by using the on-chip memory devices (e.g., FFs, BRAM, and URAM) that have small storage capacities but superior random access performances. However, directly exchanging graph (vertex and edge) data between the processing units in FPGA chip with slow off-chip DRAMs during grid-graph processing leads to limited performances and excessive data transmission amounts between the FPGA chip and off-chip memory devices.

In this article, we show that it is effective in improving the performance of grid-graph processing on DRAM-based FPGA hardware accelerators by leveraging the flexibility and programmability of FPGAs to build application-specific caching mechanisms, which bridge the performance gaps between on-chip and off-chip memory devices, and reduce the data transmission amounts by exploiting the localities on data accessing. We design two application-specific caching mechanisms (i.e., vertex caching and edge caching) to exploit two types of localities (i.e., vertex locality and subgraph locality) that exist in grid-graph processing, respectively. Experimental results show that with the vertex caching mechanism, our system (named as FabGraph) achieves up to 3.1× and 2.5× speedups for BFS and PageRank, respectively, over ForeGraph when processing medium graphs stored in the on-board DRAM. With the edge caching mechanism, the extension of FabGraph (named as FabGraph+) achieves up to 9.96× speedups for BFS over FPGP when processing large graphs stored in the host DRAM.

References

  1. R. K. Ahuja, K. Mehlhorn, J. Orlin, and R. E. Tarjan. 1990. Faster algorithms for the shortest path problem. J. Amer. Comput. Mach. 37, 2 (1990), 213--223.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Z. Ai, M. Zhang, Y. Wu, X. Qian, K. Chen, and W. Zheng. 2019. Clip: A disk I/O focused parallel out-of-core graph processing system. IEEE Trans. Parallel Distrib. Syst. 30, 1 (2019), 45--62.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. ARM. 2019. AMBA AXI and ACE Protocol Specification. Retrieved from https://static.docs.arm.com/ihi0022/g/IHI0022G_amba_axi_protocol_spec.pdf.Google ScholarGoogle Scholar
  4. O. G. Attia, T. Johnson, K. Townsend, P. Jones, and J. Zambreno. 2014. CyGraph: A reconfigurable architecture for parallel breadth-first search. In Proceedings of the IEEE International Parallel Distributed Processing Symposium Workshops (IPDPSW’14). 228--235.Google ScholarGoogle Scholar
  5. S. Beamer, K. Asanovic, and D. Patterson. 2015. Locality exists in graph processing: Workload characterization on an Ivy bridge server. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’15). 56--65.Google ScholarGoogle Scholar
  6. P. Bedi and C. Sharma. 2016. Community detection in social networks. WIREs Data Min. Knowl. Discov. 6, 3 (May 2016), 115--135.Google ScholarGoogle Scholar
  7. B. Betkaoui, Y. Wang, D. B. Thomas, and W. Luk. 2012. A reconfigurable computing approach for efficient and scalable parallel graph exploration. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP’12). 8--15.Google ScholarGoogle Scholar
  8. C. E. Bichot and P. Siarry. 2013. Graph Partitioning. John Wiley 8 Sons, Ltd.Google ScholarGoogle Scholar
  9. P. Boldi, M. Santini, and S. Vigna. 2008. A large time-aware web graph. ACM SIGIR Forum 42, 2 (2008), 33--38.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Brin and L. Page. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1–7 (1998), 107--117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Chi, G. Dai, Y. Wang, G. Sun, G. Li, and H. Yang. 2016. NXgraph: An efficient graph processing system on a single machine. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’16). 409--420.Google ScholarGoogle Scholar
  12. Graph 500 Committees. 2017. Graph 500. Retrieved from http://graph500.org/.Google ScholarGoogle Scholar
  13. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. 2001. Introduction to Algorithms (2 ed.). MIT Press and McGraw-Hill. 531–539 pages.Google ScholarGoogle Scholar
  14. G. Dai, Y. Chi, Y. Wang, and H. Yang. 2016. FPGP: Graph processing framework on FPGA a case study of breadth-first search. In Proceedings of the ACM/SIGDA International Conference on Field-Programmable Gate Arrays (FPGA’16). 105--110.Google ScholarGoogle Scholar
  15. G. Dai, T. Huang, Y. Chi, N. Xu, Y. Wang, and H. Yang. 2017. ForeGraph: Exploring large-scale graph processing on multi-FPGA architecture. In Proceedings of the ACM/SIGDA International Conference on Field-Programmable Gate Arrays (FPGA’17). 217--226.Google ScholarGoogle Scholar
  16. M. deLorimier, N. Kapre, N. Mehta, D. Rizzo, I. Eslick, R. Rubin, T. E. Uribe, T. F. Jr. Knight, and A. DeHon. 2006. GraphStep: A system architecture for sparse-graph algorithms. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’06). 143--151.Google ScholarGoogle Scholar
  17. M. Faloutsos, P. Faloutsos, and C. Faloutsos. 1999. On power-law relationships of the internet topology. SIGCOMM Comput. Commun. Rev. 29, 4 (1999), 251--262.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. 2012. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI’12). 17--30.Google ScholarGoogle Scholar
  19. T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1--13.Google ScholarGoogle Scholar
  20. J. L. Hennessy and D. A. Patterson. 2011. Computer Architecture: A Quantitative ApproachComputer Architecture: A Quantitative Approach (5th ed.). Morgan Kaufmann.Google ScholarGoogle Scholar
  21. D. Hilbert. 1891. Ueber die stetige Abbildung einer Linie auf ein Flächenstäck. Math. Ann. 38, 3 (1891), 459--460.Google ScholarGoogle ScholarCross RefCross Ref
  22. H. Jun, J. Cho, K. Lee, H. Son, K. Kim, H. Jin, and K. Kim. 2017. HBM (high bandwidth memory) DRAM technology and architecture. In Proceedings of the IEEE International Memory Workshop (IMW’17). 1--4.Google ScholarGoogle Scholar
  23. N. Kapre. 2015. Custom FPGA-based soft-processors for sparse graph acceleration. In Proceedings of the IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP’15). 9--16.Google ScholarGoogle ScholarCross RefCross Ref
  24. H. Kwak, C. Lee, H. Park, and S. Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the ACM International Conference on World Wide Web (WWW’10). 591--600.Google ScholarGoogle Scholar
  25. G. Lei, Y. Dou, R. Li, and F. Xia. 2016. An FPGA implementation for solving the large single-source-shortest-path problem. IEEE Trans. Circ. Syst. II: Express Briefs 63, 5 (2016), 473--477.Google ScholarGoogle ScholarCross RefCross Ref
  26. Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. 2012. Distributed GraphLab: A framework for machine learning and data mining in the cloud. In Proceedings of the VLDB Endowment (VLDB’12). 716–727.Google ScholarGoogle Scholar
  27. G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’10). 135--146.Google ScholarGoogle Scholar
  28. Micron. 2015. DDR4 SDRAM for Automotive. Retrieved from https://www.micron.com/products/dram/ddr4-sdram/.Google ScholarGoogle Scholar
  29. E. Nurvitadhi, G. Weisz, Y. Wang, S. Hurkat, M. Nguyen, J. C. Hoe, J. F. Martínez, and C. Guestrin. 2014. GraphGen: An FPGA framework for vertex-centric graph computation. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’14). 25--28.Google ScholarGoogle Scholar
  30. T. Oguntebi and K. Olukotun. 2016. GraphOps: A dataflow library for graph analytics acceleration. In Proceedings of the ACM/SIGDA International Conference on Field-Programmable Gate Arrays (FPGA’16). 111--117.Google ScholarGoogle Scholar
  31. P. Rosenfeld, E. Cooper-Balis, and B. Jacob. 2011. DRAMSim2: A cycle accurate memory system simulator. IEEE Comput. Architect. Lett. 10, 1 (2011), 16--19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Roy, I. Mihailovic, and W. Zwaenepoel. 2013. X-Stream: Edge-centric graph processing using streaming partitions. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’13). 472--488.Google ScholarGoogle Scholar
  33. Z. Shao, R. Li, D. Hu, X. Liao, and H. Jin. 2019. Improving performance of graph processing on FPGA-DRAM platform by two-level vertex caching. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). 320--329.Google ScholarGoogle Scholar
  34. J. Shun and G. E. Blelloch. 2013. Ligra: A lightweight graph processing framework for shared memory. ACM SIGPLAN Notices 48, 8 (2013), 135--146.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Stanford. 2018. SNAP large network dataset collection. Retrieved from http://snap.stanford.edu/data/index.html.Google ScholarGoogle Scholar
  36. Y. Tian, A. Balmin, S. A. Corsten, S. Tatikonda, and J. McPherson. 2013. From “Think Like a Vertex” to “Think Like a Graph.” In Proceedings of the VLDB Endowment (VLDB’13). 193--204.Google ScholarGoogle Scholar
  37. J. Ugander, B. Karrer, L. Backstrom, and C. Marlow. 2011. The anatomy of the facebook social graph. Retrieved from http://arxiv.org/abs/1111.4503.Google ScholarGoogle Scholar
  38. Y. Wang, Y. Pan, A. Davidson, Y. Wu, C. Yang, L. Wang, M. Osama, C. Yuan, W. Liu, A. T. Riffel, and J. D. Owens. 2017. Gunrock: GPU graph analytics. ACM Trans. Parallel Comput. 4, 1 (2017), 3:1–3:49.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Wikipedia. 2010. PCI Express. Retrieved from https://en.wikipedia.org/wiki/PCI_Express.Google ScholarGoogle Scholar
  40. Xilinx. 2017. Block Memory Generator v8.4. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/blk_mem_gen/v8_4/.Google ScholarGoogle Scholar
  41. Xilinx. 2018. UltraScale Architecture Memory Resources-User Guide. Retrieved from https://www.xilinx.com/support/documentation/user_guides/.Google ScholarGoogle Scholar
  42. Xilinx. 2018. Xilinx Boards and Kits. Retrieved from https://www.xilinx.com/products/boards-and-kits.html.Google ScholarGoogle Scholar
  43. J. Zhang, S. Khoram, and J. Li. 2017. Boosting the performance of FPGA-based graph processor using hybrid memory cube: A case for breadth first search. In Proceedings of the ACM/SIGDA International Conference on Field-Programmable Gate Arrays (FPGA’17). 207--216.Google ScholarGoogle Scholar
  44. J. Zhang and J. Li. 2018. Degree-aware hybrid graph traversal on FPGA-HMC platform. In Proceedings of the ACM/SIGDA International Conference on Field-Programmable Gate Arrays (FPGA’18). 229--238.Google ScholarGoogle Scholar
  45. K. Zhang, R. Chen, and H. Chen. 2015. NUMA-aware graph-structured analytics. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). 183--193.Google ScholarGoogle Scholar
  46. J. Zhong and B. He. 2014. Medusa: A parallel graph processing system on graphics processors. ACM SIGMOD Record 43, 2 (2014), 35--40.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. S. Zhou, C. Chelmis, and V. K. Prasanna. 2015. Accelerating large-scale single-source shortest path on FPGA. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW’15). 129--136.Google ScholarGoogle Scholar
  48. S. Zhou, C. Chelmis, and V. K. Prasanna. 2015. Optimizing memory performance for FPGA implementation of pagerank. In Proceedings of the IEEE International Conference on ReConFigurable Computing and FPGAs (ReConFig’15). 1--6.Google ScholarGoogle Scholar
  49. S. Zhou, C. Chelmis, and V. K. Prasanna. 2016. High-throughput and energy-efficient graph processing on FPGA. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). 103--110.Google ScholarGoogle Scholar
  50. X. Zhu, W. Han, and W. Chen. 2015. GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning. In Proceedings of the USENIX Conference on Usenix Annual Technical Conference (ATC’15). 375--386.Google ScholarGoogle Scholar
  51. Y. Zou and M. Lin. 2018. GridGAS: An I/O-efficient heterogeneous FPGA+CPU computing platform for very large-scale graph analytics. In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT’18). 246--249.Google ScholarGoogle Scholar
  52. Y. Zou and M. Lin. 2018. Very large-scale and node-heavy graph analytics with heterogeneous FPGA+CPU computing platform. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI’18). 638--643.Google ScholarGoogle Scholar

Index Terms

  1. Processing Grid-format Real-world Graphs on DRAM-based FPGA Accelerators with Application-specific Caching Mechanisms

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Reconfigurable Technology and Systems
        ACM Transactions on Reconfigurable Technology and Systems  Volume 13, Issue 3
        September 2020
        182 pages
        ISSN:1936-7406
        EISSN:1936-7414
        DOI:10.1145/3404107
        • Editor:
        • Deming Chen
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 3 June 2020
        • Online AM: 7 May 2020
        • Accepted: 1 March 2020
        • Revised: 1 February 2020
        • Received: 1 November 2019
        Published in trets Volume 13, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!