skip to main content
research-article

Substream-Centric Maximum Matchings on FPGA

Authors Info & Claims
Published:24 April 2020Publication History
Skip Abstract Section

Abstract

Developing high-performance and energy-efficient algorithms for maximum matchings is becoming increasingly important in social network analysis, computational sciences, scheduling, and others. In this work, we propose the first maximum matching algorithm designed for FPGAs; it is energy-efficient and has provable guarantees on accuracy, performance, and storage utilization. To achieve this, we forego popular graph processing paradigms, such as vertex-centric programming, that often entail large communication costs. Instead, we propose a substream-centric approach, in which the input stream of data is divided into substreams processed independently to enable more parallelism while lowering communication costs. We base our work on the theory of streaming graph algorithms and analyze 14 models and 28 algorithms. We use this analysis to provide theoretical underpinning that matches the physical constraints of FPGA platforms. Our algorithm delivers high performance (more than 4× speedup over tuned parallel CPU variants), low memory, high accuracy, and effective usage of FPGA resources. The substream-centric approach could easily be extended to other algorithms to offer low-power and high-performance graph processing on FPGAs.

References

  1. 10th DIMACS Challenge. 2011. Kronecker Generator Graphs.Google ScholarGoogle Scholar
  2. C. Aggarwal and K. Subbian. 2014. Evolutionary network analysis: A survey. CSUR.Google ScholarGoogle Scholar
  3. G. Aggarwal, M. Datar, S. Rajagopalan, and M. Ruhl. 2004. On the streaming model augmented with a sorting primitive. In Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science, 2004. IEEE, pp. 540--549.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Agron, W. Peck, E. Anderson, D. Andrews, E. Komp, R. Sass, F. Baijot, and J. Stevens. 2006. Run-time services for hybrid CPU/FPGA systems on chip. In Proceedings of the 2006 27th IEEE International Real-Time Systems Symposium (RTSS’06). IEEE, pp. 3--12.Google ScholarGoogle Scholar
  5. J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. 2016. A scalable processing-in-memory accelerator for parallel graph processing. ACM SIGARCH Computer Architecture News 43, 3 (2016), 105--117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. J. Ahn and S. Guha. 2011. Linear programming in the semi-streaming model with application to the maximum matching problem. In ICALP.Google ScholarGoogle Scholar
  7. K. J. Ahn, S. Guha, and A. McGregor. 2012. Analyzing graph structure via linear measurements. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, pp. 459--467.Google ScholarGoogle Scholar
  8. K. J. Ahn, S. Guha, and A. McGregor. 2012. Graph sketches: sparsification, spanners, and subgraphs. In PODS.Google ScholarGoogle Scholar
  9. D. Andrews, D. Niehaus, and P. Ashenden. 2004. Programming models for hybrid CPU/FPGA chips. Computer 37, 1 (2004), 118--120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Andrews, D. Niehaus, R. Jidin, M. Finley, W. Peck, M. Frisbie, J. Ortiz, E. Komp, and P. Ashenden. 2004. Programming models for hybrid FPGA-CPU computational components: A missing link. IEEE Micro 24, 4 (2004), 42--53.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Arora, E. Hazan, and S. Kale. 2012. The multiplicative weights update method: A meta-algorithm and applications. Theory of Computing 8, 1 (2012), 121--164.Google ScholarGoogle ScholarCross RefCross Ref
  12. S. Assadi, S. Khanna, Y. Li, and G. Yaroslavtsev. 2016. Maximum matchings in dynamic graph streams and the simultaneous communication model. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, pp. 1345--1364.Google ScholarGoogle Scholar
  13. O. G. Attia, T. Johnson, K. Townsend, P. Jones, and J. Zambreno. 2014. Cygraph: A reconfigurable architecture for parallel breadth-first search. In Proceedings of the 2014 IEEE International Parallel 8 Distributed Processing Symposium Workshops (IPDPSW). IEEE, pp. 228--235.Google ScholarGoogle Scholar
  14. A. Bar-Noy, R. Bar-Yehuda, A. Freund, J. Naor, and B. Schieber. 2001. A unified approach to approximating resource allocation and scheduling. Journal of the ACM (JACM) 48, 5 (2001), 1069--1090.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Bar-Yehuda, K. Bendel, A. Freund, and D. Rawitz. 2004. Local ratio: A unified framework for approximation algorithms. In memoriam: Shimon Even 1935-2004. ACM Computing Surveys (CSUR) 36, 4 (2004), 422--463.Google ScholarGoogle Scholar
  16. R. Bar-Yehuda and S. Even. 1985. A local-ratio theorem for approximating the weighted vertex cover problem. North-Holland Mathematics Studies 109, (1985), 27--45.Google ScholarGoogle Scholar
  17. T. Ben-Nun, M. Besta, S. Huber, A. N. Ziogas, D. Peter, and T. Hoefler. 2019. A modular benchmarking infrastructure for high-performance and reproducible deep learning. arXiv preprint arXiv:1901.10183.Google ScholarGoogle Scholar
  18. Maciej Besta, Simon Weber, Lukas Gianinazzi, Robert Gerstenberger, Andrey Ivanov, Yishai Oltchik, and Torsten Hoefler. 2019. Slim graph: Practical lossy graph compression for approximate graph processing, storage, and analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Besta, M. Fischer, T. Ben-Nun, J. D. F. Licht, and T. Hoefler. 2019. Substream-centric maximum matchings on FPGA. Feb. 2019. In Proceedings of the 27th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Feb. 2019).Google ScholarGoogle Scholar
  20. M. Besta, M. Fischer, V. Kalavri, M. Kapralov, and T. Hoefler. 2019. Practice of streaming and dynamic graphs: Concepts, models, systems, and parallelism. arXiv preprint arXiv:1912.12740.Google ScholarGoogle Scholar
  21. M. Besta, S. M. Hassan, S. Yalamanchili, R. Ausavarungnirun, O. Mutlu, and T. Hoefler. 2018. Slim noc: A low-diameter on-chip network topology for high energy efficiency and scalability. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, pp. 43--55.Google ScholarGoogle Scholar
  22. M. Besta and T. Hoefler. 2014. Fault tolerance for remote memory access programming models. In ACM HPDC. pp. 37--48.Google ScholarGoogle Scholar
  23. M. Besta and T. Hoefler. 2015. Accelerating irregular computations with hardware transactional memory and active messages. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. ACM, pp. 161--172.Google ScholarGoogle Scholar
  24. M. Besta and T. Hoefler. 2015. Active access: A mechanism for high-performance distributed data-centric computations. In ACM ICS.Google ScholarGoogle Scholar
  25. M. Besta and T. Hoefler. 2018. Survey and taxonomy of lossless graph compression and space-efficient graph representations. arXiv preprint arXiv:1806.01799.Google ScholarGoogle Scholar
  26. M. Besta, R. Kanakagiri, H. Mustafa, M. Karasikov, G. Rätsch, T. Hoefler, and E. Solomonik. 2019. Communication-efficient Jaccard similarity for high-performance distributed genome comparisons. arXiv preprint arXiv:1911.04200.Google ScholarGoogle Scholar
  27. M. Besta, F. Marending, E. Solomonik, and T. Hoefler. 2017. Slimsell: A vectorizable graph representation for breadth-first search. In Proceedings of the IEEE IPDPS, volume 17.Google ScholarGoogle Scholar
  28. M. Besta, E. Peter, R. Gerstenberger, M. Fischer, M. Podstawski, C. Barthels, G. Alonso, and T. Hoefler. 2019. Demystifying graph databases: Analysis and taxonomy of data organization, system designs, and graph queries. arXiv preprint arXiv:1910.09017.Google ScholarGoogle Scholar
  29. M. Besta, M. Podstawski, L. Groner, E. Solomonik, and T. Hoefler. 2017. To push or to pull: On reducing communication and synchronization in graph computations. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. ACM, pp. 93--104.Google ScholarGoogle Scholar
  30. M. Besta, D. Stanojevic, J. D. F. Licht, T. Ben-Nun, and T. Hoefler. 2019. Graph processing on FPGAs: Taxonomy, survey, challenges. arXiv preprint arXiv:1903.06697.Google ScholarGoogle Scholar
  31. M. Besta, D. Stanojevic, T. Zivic, J. Singh, M. Hoerold, and T. Hoefler. 2018. Log (graph): A near-optimal high-performance graph representation. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques. ACM, p. 7.Google ScholarGoogle Scholar
  32. B. Betkaoui, D. B. Thomas, W. Luk, and N. Pothersrzulj. 2011. A framework for FPGA acceleration of large graph problems: Graphlet counting case study. In FPT.Google ScholarGoogle Scholar
  33. B. Betkaoui, Y. Wang, D. B. Thomas, and W. Luk. 2012. Parallel FPGA-based all pairs shortest paths for sparse networks: A human brain connectome case study. In Proceedings of the 22nd International Conference on Field Programmable Logic and Applications (FPL) (Aug. 2012), pp. 99--104.Google ScholarGoogle ScholarCross RefCross Ref
  34. B. Betkaoui, Y. Wang, D. B. Thomas, and W. Luk. 2012. A reconfigurable computing approach for efficient and scalable parallel graph exploration. In Proceedings of the 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors (ASAP). IEEE, pp. 8--15.Google ScholarGoogle Scholar
  35. J. A. Bondy, U. S. R. Murty, et al. 1976. In Graph Theory with Applications, Vol. 290. Macmillan London.Google ScholarGoogle Scholar
  36. L. S. Buriol, G. Frahling, S. Leonardi, A. Marchetti-Spaccamela, and C. Sohler. 2006. Counting triangles in data streams. In PODS.Google ScholarGoogle Scholar
  37. A. Chakrabarti, G. Cormode, and A. Mcgregor. 2009. Annotations in data streams. In ICALP.Google ScholarGoogle Scholar
  38. Y.-W. Chang, J.-M. Lin, and D. Wong. 1998. Graph matching-based algorithms for FPGA segmentation design. In ICCAD.Google ScholarGoogle Scholar
  39. A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. 2015. One trillion edges: Graph processing at Facebook-scale. In VLDB.Google ScholarGoogle Scholar
  40. R. Chitnis, G. Cormode, H. Esfandiari, M. Hajiaghayi, A. McGregor, M. Monemizadeh, and S. Vorotnikova. 2016. Kernelization via sampling with applications to finding matchings and related problems in dynamic graph streams. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, pp. 1326--1344.Google ScholarGoogle Scholar
  41. Y.-k. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei. 2016. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In DAC.Google ScholarGoogle Scholar
  42. T. H. Cormen. 2009. Introduction to Algorithms. MIT press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. 2001. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition.Google ScholarGoogle Scholar
  44. G. Cormode, J. Dark, and C. Konrad. 2018. Independent sets in vertex-arrival streams. arXiv:1807.08331.Google ScholarGoogle Scholar
  45. G. Cormode, H. Jowhari, M. Monemizadeh, and S. Muthukrishnan. 2016. The sparse awakens: Streaming algorithms for matching size estimation in sparse graphs. arXiv preprint arXiv:1608.03118.Google ScholarGoogle Scholar
  46. M. Crouch and D. M. Stubbs. 2014. Improved streaming algorithms for weighted matching, via unweighted matching. In LIPIcs-Leibniz Inf.Google ScholarGoogle Scholar
  47. G. Dai, Y. Chi, Y. Wang, and H. Yang. 2016. FPGP: Graph Processing Framework on FPGA. In FPGA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. G. Dai, T. Huang, Y. Chi, N. Xu, Y. Wang, and H. Yang. 2017. ForeGraph: Exploring large-scale graph processing on multi-FPGA architecture. In FPGA.Google ScholarGoogle Scholar
  49. M. Datar, A. Gionis, P. Indyk, and R. Motwani. 2002. Maintaining stream statistics over sliding windows. SIAM J. on Comp.Google ScholarGoogle Scholar
  50. J. de Fine Licht, S. Meierhans, and T. Hoefler. 2018. Transformations of high-level synthesis codes for high-performance computing. arXiv:1805.08288.Google ScholarGoogle Scholar
  51. J. Dean and S. Ghemawat. 2008. Mapreduce: Simplified data processing on large clusters. Communications of the ACM 51, 1 (2008), 107--113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. C. Demetrescu, I. Finocchi, and A. Ribichini. 2009. Trading off space for passes in graph streaming problems. TALG.Google ScholarGoogle Scholar
  53. S. Di Girolamo, K. Taranov, A. Kurth, M. Schaffner, T. Schneider, J. Beránek, M. Besta, L. Benini, D. Roweth, and T. Hoefler. 2019. Network-accelerated non-contiguous memory transfers. arXiv preprint arXiv:1908.08590.Google ScholarGoogle Scholar
  54. W. J. Dixon and F. J. Massey Jr. 1951. In Introduction to Statistical Analysis. McGraw-Hill.Google ScholarGoogle Scholar
  55. R. Dorrance, F. Ren, and D. Marković. 2014. A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-BLAS on FPGAS. In Proceedings of the 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM. pp. 161--170.Google ScholarGoogle Scholar
  56. H. ElGindy and Y.-L. Shue. 2002. On sparse matrix-vector multiplication with FPGA-based system. In Proceedings of the10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, pp. 273--274.Google ScholarGoogle ScholarCross RefCross Ref
  57. N. Engelhardt and H. K.-H. So. 2016. Gravf: A vertex-centric distributed graph processing framework on FPGAs. In FPL.Google ScholarGoogle Scholar
  58. N. Engelhardt and H. K.-H. So. 2016. Vertex-centric graph processing on FPGA. In FCCM.Google ScholarGoogle Scholar
  59. L. Epstein, A. Levin, J. Mestre, and D. Segev. 2011. Improved approximation guarantees for weighted matching in the semi-streaming model. SIAM Journal on Discrete Mathematics 25, 3 (2011), 1251--1265.Google ScholarGoogle ScholarCross RefCross Ref
  60. J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. 2005. On graph problems in a semi-streaming model. Theoretical CS.Google ScholarGoogle Scholar
  61. R. Gerstenberger, M. Besta, and T. Hoefler. 2014. Enabling highly-scalable remote memory access programming with MPI-3 one sided. Scientific Programming 22, 2 (2014), 75--91.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. M. Ghaffari. 2017. Space-optimal semi-streaming for (2 + ε)-approximate matching. arXiv:1701.03730.Google ScholarGoogle Scholar
  63. L. Gianinazzi, P. Kalvoda, A. De Palma, M. Besta, and T. Hoefler. 2018. Communication-avoiding parallel minimum cuts and connected components. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, pp. 219--232.Google ScholarGoogle Scholar
  64. A. Goel, M. Kapralov, and S. Khanna. 2012. On the communication and streaming complexity of maximum bipartite matching. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, pp. 468--485.Google ScholarGoogle Scholar
  65. J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. 2012. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI volume 12, p. 2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. E. Grigorescu, M. Monemizadeh, and S. Zhou. 2016. Streaming weighted matchings: Optimal meets greedy. arXiv:1608.01487.Google ScholarGoogle Scholar
  67. T. J. Harris. 1994. A survey of PRAM simulation techniques. ACM Computing Surveys (CSUR) 26, 2 (1994), 187--206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. M. R. Henzinger, P. Raghavan, and S. Rajagopalan. 1998. Computing on data streams. External Mem. Alg.Google ScholarGoogle Scholar
  69. T. Hoefler and R. Belli. 2015. Scientific benchmarking of parallel computing systems: Twelve ways to tell the masses when reporting performance results. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, pp. 73.Google ScholarGoogle Scholar
  70. R. Inta, D. J. Bowman, and S. M. Scott. 2012. The Chimera: An off-the-shelf CPU/GPGPU/FPGA hybrid computing platform. International Journal of Reconfigurable Computing, 2012:2, 2012.Google ScholarGoogle Scholar
  71. Intel. 2017. Intel Core i7-8700K Processor.Google ScholarGoogle Scholar
  72. Intel. 2017. Intel Xeon Processor E5-2680 v4.Google ScholarGoogle Scholar
  73. Intel. 2017. Stratix 10 GX/SX Device Overview.Google ScholarGoogle Scholar
  74. Intel Arria. 2017. Intel Arria 10 Device Overview.Google ScholarGoogle Scholar
  75. R. Jidin. 2005. Extending the Thread Programming Model across Hybrid FPGA/CPU Architectures. Information Technology and Telecommunications Center (ITTC), University of Kansas.Google ScholarGoogle Scholar
  76. M. Kapralov. 2013. Better bounds for matchings in the streaming model. In Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, pp. 1679--1697.Google ScholarGoogle ScholarCross RefCross Ref
  77. M. Kapralov, S. Khanna, and M. Sudan. 2014. Approximating matching size from random streams. In Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, pp. 734--751.Google ScholarGoogle Scholar
  78. N. Kapre. 2015. Custom FPGA-based soft-processors for sparse graph acceleration. In ASAP.Google ScholarGoogle Scholar
  79. N. Kapre, N. Mehta, D. Rizzo, I. Eslick, R. Rubin, T. E. Uribe, F. Thomas, Jr., A. DeHon, et al. 2006. Graphstep: A system architecture for sparse-graph algorithms. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2006 (FCCM’06). IEEE, pp. 143–151.Google ScholarGoogle Scholar
  80. C. Karande, A. Mehta, and P. Tripathi. 2001. Online bipartite matching with unknown distributions. In STOC.Google ScholarGoogle Scholar
  81. R. M. Karp, U. V. Vazirani, and V. V. Vazirani. 1990. An optimal algorithm for on-line bipartite matching. In Proceedings of the 22nd Annual ACM Symposium on Theory of Computing. ACM, pp. 352--358.Google ScholarGoogle Scholar
  82. J. Kepner, P. Aaltonen, D. Bader, A. Buluç, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, Scott McMillan, Jose Moreira, John D. Owens, Carl Yang, Marcin Zalewski, and Timothy Mattson. 2016. Mathematical foundations of the GraphBLAS. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC’16). IEEE, 1–9.Google ScholarGoogle ScholarCross RefCross Ref
  83. A. Khan. 2016. Vertex-centric graph processing: The good, the bad, and the ugly. arXiv preprint arXiv:1612.07404.Google ScholarGoogle Scholar
  84. S. Khoram, J. Zhang, M. Strange, and J. Li. 2018. Accelerating graph analytics by co-optimizing storage and access on an FPGA-hmc platform. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, pp. 239--248.Google ScholarGoogle Scholar
  85. KONECT. 2017. Konect network dataset.Google ScholarGoogle Scholar
  86. C. Konrad, F. Magniez, and C. Mathieu. 2012. Maximum matching in semi-streaming with few passes. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. pp. 231--242.Google ScholarGoogle Scholar
  87. G. Kwasniewski, M. Kabić, M. Besta, J. VandeVondele, R. Solcà, and T. Hoefler. 2019. Red-blue pebbling revisited: Near optimal parallel matrix-matrix multiplication. In ACM/IEEE Supercomputing. ACM, p. 24.Google ScholarGoogle Scholar
  88. A. Kyrola, G. E. Blelloch, and C. Guestrin. 2012. Graphchi: Large-scale graph computation on just a PC. USENIX.Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. J. Lee, H. Kim, S. Yoo, K. Choi, H. P. Hofstee, G.-J. Nam, M. R. Nutter, and D. Jamsek. 2017. Extrav: Boosting graph processing near storage with a coherent accelerator. Proceedings of the VLDB Endowment 10, 12 (2017), 1706--1717.Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. G. Lei, Y. Dou, R. Li, and F. Xia. 2016. An FPGA implementation for solving the large single-source-shortest-path problem. IEEE Transactions on Circuits and Systems II: Express Briefs 63, 5 (2016), 473--477.Google ScholarGoogle Scholar
  91. J. Leskovec and A. Krevl. 2014. SNAP Datasets: Stanford large network dataset collection.Google ScholarGoogle Scholar
  92. J. d. F. Licht, G. Kwasniewski, and T. Hoefler. 2019. Flexible communication avoiding matrix multiplication on FPGA with high-level synthesis. arXiv preprint arXiv:1912.06526.Google ScholarGoogle Scholar
  93. H. Liu and P. Singh. 2004. Conceptnet: A practical commonsense reasoning tool-kit. BT Technology Journal 22, 4 (2004), 211--226.Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. 2010. Graphlab: A new framework for parallel machine learning. preprint arXiv:1006.4990.Google ScholarGoogle Scholar
  95. A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. 2007. Challenges in Parallel Graph Processing. Par. Proc. Let.Google ScholarGoogle Scholar
  96. X. Ma, D. Zhang, and D. Chiou. 2017. FPGA-accelerated transactional execution of graph workloads. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, pp. 227--236.Google ScholarGoogle Scholar
  97. G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, pp. 135--146.Google ScholarGoogle Scholar
  98. A. McGregor. 2005. Finding graph matchings in data streams. In APPROX-RANDOM. Springer, Vol. 3624, pp. 170--181.Google ScholarGoogle Scholar
  99. A. McGregor and S. Vorotnikova. 2016. Planar matching in streams revisited. In LIPIcs-Leibniz International Proceedings in Informatics, volume 60. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.Google ScholarGoogle Scholar
  100. A. McGregor and S. Vorotnikova. 2018. A simple, space-efficient, streaming algorithm for matchings in low arboricity graphs. In OASIcs-OpenAccess Series in Informatics volume 61. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.Google ScholarGoogle Scholar
  101. A. McGregor, S. Vorotnikova, and H. T. Vu. 2016. Better algorithms for counting triangles in data streams. In PODS.Google ScholarGoogle Scholar
  102. F. McSherry, M. Isard, and D. G. Murray. 2015. Scalability! But at what COST? In HotOS.Google ScholarGoogle Scholar
  103. S. Muthukrishnan. 2005. Data streams: Algorithms and applications. Foundations and Trends® in Theoretical Computer Science 1, 2 (2005), 117--236.Google ScholarGoogle ScholarCross RefCross Ref
  104. M. E. Newman. 2005. A measure of betweenness centrality based on random walks. Social Networks 27, 1 (2005), 39--54.Google ScholarGoogle Scholar
  105. E. Nurvitadhi, G. Weisz, Y. Wang, S. Hurkat, M. Nguyen, J. C. Hoe, J. F. Martínez, and C. Guestrin. 2014. Graphgen: An FPGA framework for vertex-centric graph computation. In Proceedings of the 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, pp. 25--28.Google ScholarGoogle Scholar
  106. NVidia. 2017. GEFORCE GTX 1080 Ti.Google ScholarGoogle Scholar
  107. T. Oguntebi and K. Olukotun. 2016. Graphops: A dataflow library for graph analytics acceleration. In FPGA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. N. Oliver, R. R. Sharma, S. Chang, B. Chitlur, E. Garcia, J. Grecco, A. Grier, N. Ijih, Y. Liu, P. Marolia, et al. 2011. A reconfigurable computing system based on a cache-coherent fabric. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs. IEEE, 80–85.Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. M. Owaida, D. Sidler, K. Kara, and G. Alonso. 2017. Centaur: A framework for hybrid CPU-FPGA databases. In FCCM.Google ScholarGoogle Scholar
  110. M. M. Ozdal, S. Yesil, T. Kim, A. Ayupov, J. Greth, S. Burns, and O. Ozturk. 2016. Energy efficient architecture for graph analytics accelerators. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), IEEE, pp. 166--177.Google ScholarGoogle Scholar
  111. L. Page, S. Brin, R. Motwani, and T. Winograd. 1999. The Pagerank Citation Ranking: Bringing Order to the Web. Tech. Rep., Stanford InfoLab.Google ScholarGoogle Scholar
  112. C. H. Papadimitriou and K. Steiglitz. 1998. Combinatorial Optimization: Algorithms and Complexity. Courier Corporation.Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. A. Paz and G. Schwartzman. 2017. A (2+)-approximation for maximum weight matching in the semi-streaming model. In Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, pp. 2153--2161.Google ScholarGoogle Scholar
  114. A. Putnam, D. Bennett, E. Dellinger, J. Mason, P. Sundararajan, and S. Eggers. 2008. CHiMPS: A C-level compilation flow for hybrid CPU-FPGA architectures. In Proceedings of the 2008 International Conference on Field Programmable Logic and Applications. IEEE, pp. 173--178.Google ScholarGoogle Scholar
  115. A. Roy, I. Mihailovic, and W. Zwaenepoel. 2013. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, pp. 472--488.Google ScholarGoogle Scholar
  116. S. Salihoglu and J. Widom. 2014. Optimizing graph algorithms on Pregel-like systems. In VLDB.Google ScholarGoogle Scholar
  117. M. Santarini. 2011. Zynq-7000 EPP sets stage for new era of innovations. Xcell.Google ScholarGoogle Scholar
  118. T. Schank. 2007. Algorithmic aspects of triangle-based network analysis.Google ScholarGoogle Scholar
  119. P. Schmid, M. Besta, and T. Hoefler. 2016. High-performance distributed RMA locks. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. ACM, pp. 19--30.Google ScholarGoogle Scholar
  120. H. Schweizer, M. Besta, and T. Hoefler. 2015. Evaluating the cost of atomic operations on modern architectures. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, pp. 445--456.Google ScholarGoogle Scholar
  121. L. Shang, A. S. Kaviani, and K. Bathala. 2002. Dynamic power consumption in Virtex-II FPGA family. In FPGA.Google ScholarGoogle Scholar
  122. Y. Shiloach and U. Vishkin. 1980. An o (log n) Parallel Connectivity Algorithm. Technical Report, Computer Science Department, Technion.Google ScholarGoogle Scholar
  123. D. Sidler, Z. István, M. Owaida, and G. Alonso. 2017. Accelerating pattern matching queries in hybrid CPU-FPGA architectures. In Proceedings of the 2017 ACM International Conference on Management of Data ACM, pages 403--415.Google ScholarGoogle Scholar
  124. Y. Simmhan, A. Kumbhare, C. Wickramaarachchi, S. Nagarkar, S. Ravi, C. Raghavendra, and V. Prasanna. 2014. Goffish: A sub-graph centric framework for large-scale graph analytics. In EuroPar.Google ScholarGoogle Scholar
  125. E. Solomonik, M. Besta, F. Vella, and T. Hoefler. 2017. Scaling betweenness centrality using communication-efficient sparse matrix multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, p. 47.Google ScholarGoogle Scholar
  126. J. Sun, G. Peterson, and O. Storaasli. 2007. Sparse matrix-vector multiplication design on FPGAS. In Proceedings of the15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2007). IEEE, pp. 349--352.Google ScholarGoogle ScholarDigital LibraryDigital Library
  127. J. Sun, N.-N. Zheng, and H.-Y. Shum. 2003. Stereo matching using belief propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 7 (2003), 787--800.Google ScholarGoogle ScholarDigital LibraryDigital Library
  128. R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, and C. Rother. 2008. A comparative study of energy minimization methods for Markov random fields with smoothness-based priors. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 6 (2008), 1068--1080.Google ScholarGoogle ScholarDigital LibraryDigital Library
  129. A. Tate, A. Kamil, A. Dubey, A. Größlinger, B. Chamberlain, B. Goglin, C. Edwards, C. J. Newburn, D. Padua, D. Unat, Didem Unat, Emmanuel Jeannot, James Sexton, Jesus Labarta, John Shalf, Karl , Kathryn O’Brien, Leonidas Linardakis, Maciej Besta, Marie-Christine Sawley, Mark Abraham, Mauro Bianco, Miquel Pericas, Naoya Maruyama, Paul Kelly, Peter Messmer, Robert B. Ross, Romain Cledat, Satoshi Matsuoka, Thomas Schulthess, Torsten Hoeer, and Vitus Leung. 2014. Programming abstractions for data locality. In PADAL Workshop 2014, April 28–29, Swiss National Supercomputing Center.Google ScholarGoogle ScholarCross RefCross Ref
  130. N. Trinajstić, D. J. Klein, and M. Randić. 1986. On some solved and unsolved problems of chemical graph theory. International Journal of Quantum Chemistry.Google ScholarGoogle Scholar
  131. J. Tyhach, M. Hutton, S. Atsatt, A. Rahman, B. Vest, D. Lewis, M. Langhammer, S. Shumarayev, T. Hoang, A. Chan, Dong-Myung Choi, Dan Oh, Hae-Chang Lee, Jack Chui, Ket Chiew Sia, Edwin Kok, Wei-Yee Koay, and Boon-Jin Ang. 2015. Arria 10 device architecture. In CICC.Google ScholarGoogle Scholar
  132. R. Uehara and Z.-Z. Chen. 2000. Parallel approximation algorithms for maximum weighted matching in general graphs. Information Processing Letters 76, 1–2 (2000), 13--17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  133. Y. Umuroglu, D. Morrison, and M. Jahre. 2015. Hybrid breadth-first search on a single-chip FPGA-CPU heterogeneous platform. In FPL.Google ScholarGoogle Scholar
  134. X. Wang and S. G. Ziavras. 2007. Performance-energy tradeoffs for matrix multiplication on FPGA-based mixed-mode chip multiprocessors. In Proceedings of the 8th International Symposium on Quality Electronic Design (ISQED’07). IEEE, pp. 386--391.Google ScholarGoogle Scholar
  135. G. Weisz, E. Nurvitadhi, and J. Hoe. 2013. Graphgen for coram: Graph computation on FPGAs. In CARL.Google ScholarGoogle Scholar
  136. D. Yan, J. Cheng, K. Xing, Y. Lu, W. Ng, and Y. Bu. 2014. Pregel algorithms for graph connectivity problems with performance guarantees. Proceedings of the VLDB Endowment 7, 14 (2014), 1821--1832.Google ScholarGoogle ScholarDigital LibraryDigital Library
  137. C. Yang. 2018. An efficient dispatcher for large scale graph processing on opencl-based FPGAs. arXiv preprint arXiv:1806.11509.Google ScholarGoogle Scholar
  138. P. Yao. 2018. An efficient graph accelerator with parallel data conflict management. arXiv preprint arXiv:1806.00751.Google ScholarGoogle Scholar
  139. M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, Ali Ghodsi, Joseph Gonzales, Scott Shenker, and Ion Stoica. 2016. Apache spark: A unified engine for big data processing. CACM.Google ScholarGoogle Scholar
  140. M. Zelke. 2012. Weighted matching in the semi-streaming model. Algorithmica 62, 1–2, (2012), 1--20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  141. J. Zhang, S. Khoram, and J. Li. 2017. Boosting the performance of FPGA-based graph processor using hybrid memory cube: A case for breadth first search. In FPGA.Google ScholarGoogle Scholar
  142. J. Zhang, S. Khoram, and J. Li. 2017. Boosting the performance of FPGA-based graph processor using hybrid memory cube: A case for breadth first search. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, New York, pp. 207--216.Google ScholarGoogle Scholar
  143. J. Zhang and J. Li. 2018. Degree-aware hybrid graph traversal on FPGA-HMC platform. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, pp. 229--238.Google ScholarGoogle Scholar
  144. S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. 2016. Cambricon-x: An accelerator for sparse neural networks. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). pp. 1--12.Google ScholarGoogle Scholar
  145. J. Zhou, S. Liu, Q. Guo, X. Zhou, T. Zhi, D. Liu, C. Wang, X. Zhou, Y. Chen, and T. Chen. 2017. Tunao: A high-performance and energy-efficient reconfigurable accelerator for graph processing. In CCGRID.Google ScholarGoogle Scholar
  146. S. Zhou, C. Chelmis, and V. K. Prasanna. 2015. Optimizing memory performance for FPGA implementation of pagerank. In ReConFig. pp. 1--6.Google ScholarGoogle Scholar
  147. S. Zhou, C. Chelmis, and V. K. Prasanna. 2016. High-throughput and energy-efficient graph processing on FPGA. In Proceedings of the 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, pp. 103--110.Google ScholarGoogle Scholar
  148. S. Zhou, R. Kannan, H. Zeng, and V. K. Prasanna. 2018. An FPGA framework for edge-centric graph processing. In Proceedings of the 15th ACM International Conference on Computing Frontiers. ACM, pp. 69--77.Google ScholarGoogle Scholar
  149. S. Zhou and V. K. Prasanna. 2017. Accelerating graph analytics on CPU-FPGA heterogeneous platform. In SBAC-PAD.Google ScholarGoogle Scholar
  150. J. Zhu, I. Sander, and A. Jantsch. 2009. Buffer minimization of real-time streaming applications scheduling on hybrid CPU/FPGA architectures. In Proceedings of the Conference on Design, Automation and Test in Europe. European Design and Automation Association, pp. 1506--1511.Google ScholarGoogle Scholar
  151. L. Zhuo and V. K. Prasanna. 2005. Sparse matrix-vector multiplication on FPGAS. In Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays. ACM, pp 63--74.Google ScholarGoogle Scholar

Index Terms

  1. Substream-Centric Maximum Matchings on FPGA

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Transactions on Reconfigurable Technology and Systems
                  ACM Transactions on Reconfigurable Technology and Systems  Volume 13, Issue 2
                  June 2020
                  185 pages
                  ISSN:1936-7406
                  EISSN:1936-7414
                  DOI:10.1145/3383521
                  • Editor:
                  • Deming Chen
                  Issue’s Table of Contents

                  Copyright © 2020 ACM

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 24 April 2020
                  • Accepted: 1 December 2019
                  • Revised: 1 September 2019
                  • Received: 1 May 2019
                  Published in trets Volume 13, Issue 2

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article
                  • Research
                  • Refereed

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader

                HTML Format

                View this article in HTML Format .

                View HTML Format
                About Cookies On This Site

                We use cookies to ensure that we give you the best experience on our website.

                Learn more

                Got it!