Abstract
Developing high-performance and energy-efficient algorithms for maximum matchings is becoming increasingly important in social network analysis, computational sciences, scheduling, and others. In this work, we propose the first maximum matching algorithm designed for FPGAs; it is energy-efficient and has provable guarantees on accuracy, performance, and storage utilization. To achieve this, we forego popular graph processing paradigms, such as vertex-centric programming, that often entail large communication costs. Instead, we propose a substream-centric approach, in which the input stream of data is divided into substreams processed independently to enable more parallelism while lowering communication costs. We base our work on the theory of streaming graph algorithms and analyze 14 models and 28 algorithms. We use this analysis to provide theoretical underpinning that matches the physical constraints of FPGA platforms. Our algorithm delivers high performance (more than 4× speedup over tuned parallel CPU variants), low memory, high accuracy, and effective usage of FPGA resources. The substream-centric approach could easily be extended to other algorithms to offer low-power and high-performance graph processing on FPGAs.
- 10th DIMACS Challenge. 2011. Kronecker Generator Graphs.Google Scholar
- C. Aggarwal and K. Subbian. 2014. Evolutionary network analysis: A survey. CSUR.Google Scholar
- G. Aggarwal, M. Datar, S. Rajagopalan, and M. Ruhl. 2004. On the streaming model augmented with a sorting primitive. In Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science, 2004. IEEE, pp. 540--549.Google Scholar
Digital Library
- J. Agron, W. Peck, E. Anderson, D. Andrews, E. Komp, R. Sass, F. Baijot, and J. Stevens. 2006. Run-time services for hybrid CPU/FPGA systems on chip. In Proceedings of the 2006 27th IEEE International Real-Time Systems Symposium (RTSS’06). IEEE, pp. 3--12.Google Scholar
- J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. 2016. A scalable processing-in-memory accelerator for parallel graph processing. ACM SIGARCH Computer Architecture News 43, 3 (2016), 105--117.Google Scholar
Digital Library
- K. J. Ahn and S. Guha. 2011. Linear programming in the semi-streaming model with application to the maximum matching problem. In ICALP.Google Scholar
- K. J. Ahn, S. Guha, and A. McGregor. 2012. Analyzing graph structure via linear measurements. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, pp. 459--467.Google Scholar
- K. J. Ahn, S. Guha, and A. McGregor. 2012. Graph sketches: sparsification, spanners, and subgraphs. In PODS.Google Scholar
- D. Andrews, D. Niehaus, and P. Ashenden. 2004. Programming models for hybrid CPU/FPGA chips. Computer 37, 1 (2004), 118--120.Google Scholar
Digital Library
- D. Andrews, D. Niehaus, R. Jidin, M. Finley, W. Peck, M. Frisbie, J. Ortiz, E. Komp, and P. Ashenden. 2004. Programming models for hybrid FPGA-CPU computational components: A missing link. IEEE Micro 24, 4 (2004), 42--53.Google Scholar
Digital Library
- S. Arora, E. Hazan, and S. Kale. 2012. The multiplicative weights update method: A meta-algorithm and applications. Theory of Computing 8, 1 (2012), 121--164.Google Scholar
Cross Ref
- S. Assadi, S. Khanna, Y. Li, and G. Yaroslavtsev. 2016. Maximum matchings in dynamic graph streams and the simultaneous communication model. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, pp. 1345--1364.Google Scholar
- O. G. Attia, T. Johnson, K. Townsend, P. Jones, and J. Zambreno. 2014. Cygraph: A reconfigurable architecture for parallel breadth-first search. In Proceedings of the 2014 IEEE International Parallel 8 Distributed Processing Symposium Workshops (IPDPSW). IEEE, pp. 228--235.Google Scholar
- A. Bar-Noy, R. Bar-Yehuda, A. Freund, J. Naor, and B. Schieber. 2001. A unified approach to approximating resource allocation and scheduling. Journal of the ACM (JACM) 48, 5 (2001), 1069--1090.Google Scholar
Digital Library
- R. Bar-Yehuda, K. Bendel, A. Freund, and D. Rawitz. 2004. Local ratio: A unified framework for approximation algorithms. In memoriam: Shimon Even 1935-2004. ACM Computing Surveys (CSUR) 36, 4 (2004), 422--463.Google Scholar
- R. Bar-Yehuda and S. Even. 1985. A local-ratio theorem for approximating the weighted vertex cover problem. North-Holland Mathematics Studies 109, (1985), 27--45.Google Scholar
- T. Ben-Nun, M. Besta, S. Huber, A. N. Ziogas, D. Peter, and T. Hoefler. 2019. A modular benchmarking infrastructure for high-performance and reproducible deep learning. arXiv preprint arXiv:1901.10183.Google Scholar
- Maciej Besta, Simon Weber, Lukas Gianinazzi, Robert Gerstenberger, Andrey Ivanov, Yishai Oltchik, and Torsten Hoefler. 2019. Slim graph: Practical lossy graph compression for approximate graph processing, storage, and analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–25.Google Scholar
Digital Library
- M. Besta, M. Fischer, T. Ben-Nun, J. D. F. Licht, and T. Hoefler. 2019. Substream-centric maximum matchings on FPGA. Feb. 2019. In Proceedings of the 27th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Feb. 2019).Google Scholar
- M. Besta, M. Fischer, V. Kalavri, M. Kapralov, and T. Hoefler. 2019. Practice of streaming and dynamic graphs: Concepts, models, systems, and parallelism. arXiv preprint arXiv:1912.12740.Google Scholar
- M. Besta, S. M. Hassan, S. Yalamanchili, R. Ausavarungnirun, O. Mutlu, and T. Hoefler. 2018. Slim noc: A low-diameter on-chip network topology for high energy efficiency and scalability. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, pp. 43--55.Google Scholar
- M. Besta and T. Hoefler. 2014. Fault tolerance for remote memory access programming models. In ACM HPDC. pp. 37--48.Google Scholar
- M. Besta and T. Hoefler. 2015. Accelerating irregular computations with hardware transactional memory and active messages. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. ACM, pp. 161--172.Google Scholar
- M. Besta and T. Hoefler. 2015. Active access: A mechanism for high-performance distributed data-centric computations. In ACM ICS.Google Scholar
- M. Besta and T. Hoefler. 2018. Survey and taxonomy of lossless graph compression and space-efficient graph representations. arXiv preprint arXiv:1806.01799.Google Scholar
- M. Besta, R. Kanakagiri, H. Mustafa, M. Karasikov, G. Rätsch, T. Hoefler, and E. Solomonik. 2019. Communication-efficient Jaccard similarity for high-performance distributed genome comparisons. arXiv preprint arXiv:1911.04200.Google Scholar
- M. Besta, F. Marending, E. Solomonik, and T. Hoefler. 2017. Slimsell: A vectorizable graph representation for breadth-first search. In Proceedings of the IEEE IPDPS, volume 17.Google Scholar
- M. Besta, E. Peter, R. Gerstenberger, M. Fischer, M. Podstawski, C. Barthels, G. Alonso, and T. Hoefler. 2019. Demystifying graph databases: Analysis and taxonomy of data organization, system designs, and graph queries. arXiv preprint arXiv:1910.09017.Google Scholar
- M. Besta, M. Podstawski, L. Groner, E. Solomonik, and T. Hoefler. 2017. To push or to pull: On reducing communication and synchronization in graph computations. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. ACM, pp. 93--104.Google Scholar
- M. Besta, D. Stanojevic, J. D. F. Licht, T. Ben-Nun, and T. Hoefler. 2019. Graph processing on FPGAs: Taxonomy, survey, challenges. arXiv preprint arXiv:1903.06697.Google Scholar
- M. Besta, D. Stanojevic, T. Zivic, J. Singh, M. Hoerold, and T. Hoefler. 2018. Log (graph): A near-optimal high-performance graph representation. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques. ACM, p. 7.Google Scholar
- B. Betkaoui, D. B. Thomas, W. Luk, and N. Pothersrzulj. 2011. A framework for FPGA acceleration of large graph problems: Graphlet counting case study. In FPT.Google Scholar
- B. Betkaoui, Y. Wang, D. B. Thomas, and W. Luk. 2012. Parallel FPGA-based all pairs shortest paths for sparse networks: A human brain connectome case study. In Proceedings of the 22nd International Conference on Field Programmable Logic and Applications (FPL) (Aug. 2012), pp. 99--104.Google Scholar
Cross Ref
- B. Betkaoui, Y. Wang, D. B. Thomas, and W. Luk. 2012. A reconfigurable computing approach for efficient and scalable parallel graph exploration. In Proceedings of the 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors (ASAP). IEEE, pp. 8--15.Google Scholar
- J. A. Bondy, U. S. R. Murty, et al. 1976. In Graph Theory with Applications, Vol. 290. Macmillan London.Google Scholar
- L. S. Buriol, G. Frahling, S. Leonardi, A. Marchetti-Spaccamela, and C. Sohler. 2006. Counting triangles in data streams. In PODS.Google Scholar
- A. Chakrabarti, G. Cormode, and A. Mcgregor. 2009. Annotations in data streams. In ICALP.Google Scholar
- Y.-W. Chang, J.-M. Lin, and D. Wong. 1998. Graph matching-based algorithms for FPGA segmentation design. In ICCAD.Google Scholar
- A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. 2015. One trillion edges: Graph processing at Facebook-scale. In VLDB.Google Scholar
- R. Chitnis, G. Cormode, H. Esfandiari, M. Hajiaghayi, A. McGregor, M. Monemizadeh, and S. Vorotnikova. 2016. Kernelization via sampling with applications to finding matchings and related problems in dynamic graph streams. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, pp. 1326--1344.Google Scholar
- Y.-k. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei. 2016. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In DAC.Google Scholar
- T. H. Cormen. 2009. Introduction to Algorithms. MIT press.Google Scholar
Digital Library
- T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. 2001. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition.Google Scholar
- G. Cormode, J. Dark, and C. Konrad. 2018. Independent sets in vertex-arrival streams. arXiv:1807.08331.Google Scholar
- G. Cormode, H. Jowhari, M. Monemizadeh, and S. Muthukrishnan. 2016. The sparse awakens: Streaming algorithms for matching size estimation in sparse graphs. arXiv preprint arXiv:1608.03118.Google Scholar
- M. Crouch and D. M. Stubbs. 2014. Improved streaming algorithms for weighted matching, via unweighted matching. In LIPIcs-Leibniz Inf.Google Scholar
- G. Dai, Y. Chi, Y. Wang, and H. Yang. 2016. FPGP: Graph Processing Framework on FPGA. In FPGA.Google Scholar
Digital Library
- G. Dai, T. Huang, Y. Chi, N. Xu, Y. Wang, and H. Yang. 2017. ForeGraph: Exploring large-scale graph processing on multi-FPGA architecture. In FPGA.Google Scholar
- M. Datar, A. Gionis, P. Indyk, and R. Motwani. 2002. Maintaining stream statistics over sliding windows. SIAM J. on Comp.Google Scholar
- J. de Fine Licht, S. Meierhans, and T. Hoefler. 2018. Transformations of high-level synthesis codes for high-performance computing. arXiv:1805.08288.Google Scholar
- J. Dean and S. Ghemawat. 2008. Mapreduce: Simplified data processing on large clusters. Communications of the ACM 51, 1 (2008), 107--113.Google Scholar
Digital Library
- C. Demetrescu, I. Finocchi, and A. Ribichini. 2009. Trading off space for passes in graph streaming problems. TALG.Google Scholar
- S. Di Girolamo, K. Taranov, A. Kurth, M. Schaffner, T. Schneider, J. Beránek, M. Besta, L. Benini, D. Roweth, and T. Hoefler. 2019. Network-accelerated non-contiguous memory transfers. arXiv preprint arXiv:1908.08590.Google Scholar
- W. J. Dixon and F. J. Massey Jr. 1951. In Introduction to Statistical Analysis. McGraw-Hill.Google Scholar
- R. Dorrance, F. Ren, and D. Marković. 2014. A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-BLAS on FPGAS. In Proceedings of the 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM. pp. 161--170.Google Scholar
- H. ElGindy and Y.-L. Shue. 2002. On sparse matrix-vector multiplication with FPGA-based system. In Proceedings of the10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, pp. 273--274.Google Scholar
Cross Ref
- N. Engelhardt and H. K.-H. So. 2016. Gravf: A vertex-centric distributed graph processing framework on FPGAs. In FPL.Google Scholar
- N. Engelhardt and H. K.-H. So. 2016. Vertex-centric graph processing on FPGA. In FCCM.Google Scholar
- L. Epstein, A. Levin, J. Mestre, and D. Segev. 2011. Improved approximation guarantees for weighted matching in the semi-streaming model. SIAM Journal on Discrete Mathematics 25, 3 (2011), 1251--1265.Google Scholar
Cross Ref
- J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. 2005. On graph problems in a semi-streaming model. Theoretical CS.Google Scholar
- R. Gerstenberger, M. Besta, and T. Hoefler. 2014. Enabling highly-scalable remote memory access programming with MPI-3 one sided. Scientific Programming 22, 2 (2014), 75--91.Google Scholar
Digital Library
- M. Ghaffari. 2017. Space-optimal semi-streaming for (2 + ε)-approximate matching. arXiv:1701.03730.Google Scholar
- L. Gianinazzi, P. Kalvoda, A. De Palma, M. Besta, and T. Hoefler. 2018. Communication-avoiding parallel minimum cuts and connected components. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, pp. 219--232.Google Scholar
- A. Goel, M. Kapralov, and S. Khanna. 2012. On the communication and streaming complexity of maximum bipartite matching. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, pp. 468--485.Google Scholar
- J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. 2012. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI volume 12, p. 2.Google Scholar
Digital Library
- E. Grigorescu, M. Monemizadeh, and S. Zhou. 2016. Streaming weighted matchings: Optimal meets greedy. arXiv:1608.01487.Google Scholar
- T. J. Harris. 1994. A survey of PRAM simulation techniques. ACM Computing Surveys (CSUR) 26, 2 (1994), 187--206.Google Scholar
Digital Library
- M. R. Henzinger, P. Raghavan, and S. Rajagopalan. 1998. Computing on data streams. External Mem. Alg.Google Scholar
- T. Hoefler and R. Belli. 2015. Scientific benchmarking of parallel computing systems: Twelve ways to tell the masses when reporting performance results. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, pp. 73.Google Scholar
- R. Inta, D. J. Bowman, and S. M. Scott. 2012. The Chimera: An off-the-shelf CPU/GPGPU/FPGA hybrid computing platform. International Journal of Reconfigurable Computing, 2012:2, 2012.Google Scholar
- Intel. 2017. Intel Core i7-8700K Processor.Google Scholar
- Intel. 2017. Intel Xeon Processor E5-2680 v4.Google Scholar
- Intel. 2017. Stratix 10 GX/SX Device Overview.Google Scholar
- Intel Arria. 2017. Intel Arria 10 Device Overview.Google Scholar
- R. Jidin. 2005. Extending the Thread Programming Model across Hybrid FPGA/CPU Architectures. Information Technology and Telecommunications Center (ITTC), University of Kansas.Google Scholar
- M. Kapralov. 2013. Better bounds for matchings in the streaming model. In Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, pp. 1679--1697.Google Scholar
Cross Ref
- M. Kapralov, S. Khanna, and M. Sudan. 2014. Approximating matching size from random streams. In Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, pp. 734--751.Google Scholar
- N. Kapre. 2015. Custom FPGA-based soft-processors for sparse graph acceleration. In ASAP.Google Scholar
- N. Kapre, N. Mehta, D. Rizzo, I. Eslick, R. Rubin, T. E. Uribe, F. Thomas, Jr., A. DeHon, et al. 2006. Graphstep: A system architecture for sparse-graph algorithms. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2006 (FCCM’06). IEEE, pp. 143–151.Google Scholar
- C. Karande, A. Mehta, and P. Tripathi. 2001. Online bipartite matching with unknown distributions. In STOC.Google Scholar
- R. M. Karp, U. V. Vazirani, and V. V. Vazirani. 1990. An optimal algorithm for on-line bipartite matching. In Proceedings of the 22nd Annual ACM Symposium on Theory of Computing. ACM, pp. 352--358.Google Scholar
- J. Kepner, P. Aaltonen, D. Bader, A. Buluç, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, Scott McMillan, Jose Moreira, John D. Owens, Carl Yang, Marcin Zalewski, and Timothy Mattson. 2016. Mathematical foundations of the GraphBLAS. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC’16). IEEE, 1–9.Google Scholar
Cross Ref
- A. Khan. 2016. Vertex-centric graph processing: The good, the bad, and the ugly. arXiv preprint arXiv:1612.07404.Google Scholar
- S. Khoram, J. Zhang, M. Strange, and J. Li. 2018. Accelerating graph analytics by co-optimizing storage and access on an FPGA-hmc platform. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, pp. 239--248.Google Scholar
- KONECT. 2017. Konect network dataset.Google Scholar
- C. Konrad, F. Magniez, and C. Mathieu. 2012. Maximum matching in semi-streaming with few passes. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. pp. 231--242.Google Scholar
- G. Kwasniewski, M. Kabić, M. Besta, J. VandeVondele, R. Solcà, and T. Hoefler. 2019. Red-blue pebbling revisited: Near optimal parallel matrix-matrix multiplication. In ACM/IEEE Supercomputing. ACM, p. 24.Google Scholar
- A. Kyrola, G. E. Blelloch, and C. Guestrin. 2012. Graphchi: Large-scale graph computation on just a PC. USENIX.Google Scholar
Digital Library
- J. Lee, H. Kim, S. Yoo, K. Choi, H. P. Hofstee, G.-J. Nam, M. R. Nutter, and D. Jamsek. 2017. Extrav: Boosting graph processing near storage with a coherent accelerator. Proceedings of the VLDB Endowment 10, 12 (2017), 1706--1717.Google Scholar
Digital Library
- G. Lei, Y. Dou, R. Li, and F. Xia. 2016. An FPGA implementation for solving the large single-source-shortest-path problem. IEEE Transactions on Circuits and Systems II: Express Briefs 63, 5 (2016), 473--477.Google Scholar
- J. Leskovec and A. Krevl. 2014. SNAP Datasets: Stanford large network dataset collection.Google Scholar
- J. d. F. Licht, G. Kwasniewski, and T. Hoefler. 2019. Flexible communication avoiding matrix multiplication on FPGA with high-level synthesis. arXiv preprint arXiv:1912.06526.Google Scholar
- H. Liu and P. Singh. 2004. Conceptnet: A practical commonsense reasoning tool-kit. BT Technology Journal 22, 4 (2004), 211--226.Google Scholar
Digital Library
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. 2010. Graphlab: A new framework for parallel machine learning. preprint arXiv:1006.4990.Google Scholar
- A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. 2007. Challenges in Parallel Graph Processing. Par. Proc. Let.Google Scholar
- X. Ma, D. Zhang, and D. Chiou. 2017. FPGA-accelerated transactional execution of graph workloads. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, pp. 227--236.Google Scholar
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, pp. 135--146.Google Scholar
- A. McGregor. 2005. Finding graph matchings in data streams. In APPROX-RANDOM. Springer, Vol. 3624, pp. 170--181.Google Scholar
- A. McGregor and S. Vorotnikova. 2016. Planar matching in streams revisited. In LIPIcs-Leibniz International Proceedings in Informatics, volume 60. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.Google Scholar
- A. McGregor and S. Vorotnikova. 2018. A simple, space-efficient, streaming algorithm for matchings in low arboricity graphs. In OASIcs-OpenAccess Series in Informatics volume 61. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.Google Scholar
- A. McGregor, S. Vorotnikova, and H. T. Vu. 2016. Better algorithms for counting triangles in data streams. In PODS.Google Scholar
- F. McSherry, M. Isard, and D. G. Murray. 2015. Scalability! But at what COST? In HotOS.Google Scholar
- S. Muthukrishnan. 2005. Data streams: Algorithms and applications. Foundations and Trends® in Theoretical Computer Science 1, 2 (2005), 117--236.Google Scholar
Cross Ref
- M. E. Newman. 2005. A measure of betweenness centrality based on random walks. Social Networks 27, 1 (2005), 39--54.Google Scholar
- E. Nurvitadhi, G. Weisz, Y. Wang, S. Hurkat, M. Nguyen, J. C. Hoe, J. F. Martínez, and C. Guestrin. 2014. Graphgen: An FPGA framework for vertex-centric graph computation. In Proceedings of the 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, pp. 25--28.Google Scholar
- NVidia. 2017. GEFORCE GTX 1080 Ti.Google Scholar
- T. Oguntebi and K. Olukotun. 2016. Graphops: A dataflow library for graph analytics acceleration. In FPGA.Google Scholar
Digital Library
- N. Oliver, R. R. Sharma, S. Chang, B. Chitlur, E. Garcia, J. Grecco, A. Grier, N. Ijih, Y. Liu, P. Marolia, et al. 2011. A reconfigurable computing system based on a cache-coherent fabric. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs. IEEE, 80–85.Google Scholar
Digital Library
- M. Owaida, D. Sidler, K. Kara, and G. Alonso. 2017. Centaur: A framework for hybrid CPU-FPGA databases. In FCCM.Google Scholar
- M. M. Ozdal, S. Yesil, T. Kim, A. Ayupov, J. Greth, S. Burns, and O. Ozturk. 2016. Energy efficient architecture for graph analytics accelerators. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), IEEE, pp. 166--177.Google Scholar
- L. Page, S. Brin, R. Motwani, and T. Winograd. 1999. The Pagerank Citation Ranking: Bringing Order to the Web. Tech. Rep., Stanford InfoLab.Google Scholar
- C. H. Papadimitriou and K. Steiglitz. 1998. Combinatorial Optimization: Algorithms and Complexity. Courier Corporation.Google Scholar
Digital Library
- A. Paz and G. Schwartzman. 2017. A (2+)-approximation for maximum weight matching in the semi-streaming model. In Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, pp. 2153--2161.Google Scholar
- A. Putnam, D. Bennett, E. Dellinger, J. Mason, P. Sundararajan, and S. Eggers. 2008. CHiMPS: A C-level compilation flow for hybrid CPU-FPGA architectures. In Proceedings of the 2008 International Conference on Field Programmable Logic and Applications. IEEE, pp. 173--178.Google Scholar
- A. Roy, I. Mihailovic, and W. Zwaenepoel. 2013. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, pp. 472--488.Google Scholar
- S. Salihoglu and J. Widom. 2014. Optimizing graph algorithms on Pregel-like systems. In VLDB.Google Scholar
- M. Santarini. 2011. Zynq-7000 EPP sets stage for new era of innovations. Xcell.Google Scholar
- T. Schank. 2007. Algorithmic aspects of triangle-based network analysis.Google Scholar
- P. Schmid, M. Besta, and T. Hoefler. 2016. High-performance distributed RMA locks. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. ACM, pp. 19--30.Google Scholar
- H. Schweizer, M. Besta, and T. Hoefler. 2015. Evaluating the cost of atomic operations on modern architectures. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, pp. 445--456.Google Scholar
- L. Shang, A. S. Kaviani, and K. Bathala. 2002. Dynamic power consumption in Virtex-II FPGA family. In FPGA.Google Scholar
- Y. Shiloach and U. Vishkin. 1980. An o (log n) Parallel Connectivity Algorithm. Technical Report, Computer Science Department, Technion.Google Scholar
- D. Sidler, Z. István, M. Owaida, and G. Alonso. 2017. Accelerating pattern matching queries in hybrid CPU-FPGA architectures. In Proceedings of the 2017 ACM International Conference on Management of Data ACM, pages 403--415.Google Scholar
- Y. Simmhan, A. Kumbhare, C. Wickramaarachchi, S. Nagarkar, S. Ravi, C. Raghavendra, and V. Prasanna. 2014. Goffish: A sub-graph centric framework for large-scale graph analytics. In EuroPar.Google Scholar
- E. Solomonik, M. Besta, F. Vella, and T. Hoefler. 2017. Scaling betweenness centrality using communication-efficient sparse matrix multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, p. 47.Google Scholar
- J. Sun, G. Peterson, and O. Storaasli. 2007. Sparse matrix-vector multiplication design on FPGAS. In Proceedings of the15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2007). IEEE, pp. 349--352.Google Scholar
Digital Library
- J. Sun, N.-N. Zheng, and H.-Y. Shum. 2003. Stereo matching using belief propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 7 (2003), 787--800.Google Scholar
Digital Library
- R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, and C. Rother. 2008. A comparative study of energy minimization methods for Markov random fields with smoothness-based priors. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 6 (2008), 1068--1080.Google Scholar
Digital Library
- A. Tate, A. Kamil, A. Dubey, A. Größlinger, B. Chamberlain, B. Goglin, C. Edwards, C. J. Newburn, D. Padua, D. Unat, Didem Unat, Emmanuel Jeannot, James Sexton, Jesus Labarta, John Shalf, Karl , Kathryn O’Brien, Leonidas Linardakis, Maciej Besta, Marie-Christine Sawley, Mark Abraham, Mauro Bianco, Miquel Pericas, Naoya Maruyama, Paul Kelly, Peter Messmer, Robert B. Ross, Romain Cledat, Satoshi Matsuoka, Thomas Schulthess, Torsten Hoeer, and Vitus Leung. 2014. Programming abstractions for data locality. In PADAL Workshop 2014, April 28–29, Swiss National Supercomputing Center.Google Scholar
Cross Ref
- N. Trinajstić, D. J. Klein, and M. Randić. 1986. On some solved and unsolved problems of chemical graph theory. International Journal of Quantum Chemistry.Google Scholar
- J. Tyhach, M. Hutton, S. Atsatt, A. Rahman, B. Vest, D. Lewis, M. Langhammer, S. Shumarayev, T. Hoang, A. Chan, Dong-Myung Choi, Dan Oh, Hae-Chang Lee, Jack Chui, Ket Chiew Sia, Edwin Kok, Wei-Yee Koay, and Boon-Jin Ang. 2015. Arria 10 device architecture. In CICC.Google Scholar
- R. Uehara and Z.-Z. Chen. 2000. Parallel approximation algorithms for maximum weighted matching in general graphs. Information Processing Letters 76, 1–2 (2000), 13--17.Google Scholar
Digital Library
- Y. Umuroglu, D. Morrison, and M. Jahre. 2015. Hybrid breadth-first search on a single-chip FPGA-CPU heterogeneous platform. In FPL.Google Scholar
- X. Wang and S. G. Ziavras. 2007. Performance-energy tradeoffs for matrix multiplication on FPGA-based mixed-mode chip multiprocessors. In Proceedings of the 8th International Symposium on Quality Electronic Design (ISQED’07). IEEE, pp. 386--391.Google Scholar
- G. Weisz, E. Nurvitadhi, and J. Hoe. 2013. Graphgen for coram: Graph computation on FPGAs. In CARL.Google Scholar
- D. Yan, J. Cheng, K. Xing, Y. Lu, W. Ng, and Y. Bu. 2014. Pregel algorithms for graph connectivity problems with performance guarantees. Proceedings of the VLDB Endowment 7, 14 (2014), 1821--1832.Google Scholar
Digital Library
- C. Yang. 2018. An efficient dispatcher for large scale graph processing on opencl-based FPGAs. arXiv preprint arXiv:1806.11509.Google Scholar
- P. Yao. 2018. An efficient graph accelerator with parallel data conflict management. arXiv preprint arXiv:1806.00751.Google Scholar
- M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, Ali Ghodsi, Joseph Gonzales, Scott Shenker, and Ion Stoica. 2016. Apache spark: A unified engine for big data processing. CACM.Google Scholar
- M. Zelke. 2012. Weighted matching in the semi-streaming model. Algorithmica 62, 1–2, (2012), 1--20.Google Scholar
Digital Library
- J. Zhang, S. Khoram, and J. Li. 2017. Boosting the performance of FPGA-based graph processor using hybrid memory cube: A case for breadth first search. In FPGA.Google Scholar
- J. Zhang, S. Khoram, and J. Li. 2017. Boosting the performance of FPGA-based graph processor using hybrid memory cube: A case for breadth first search. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, New York, pp. 207--216.Google Scholar
- J. Zhang and J. Li. 2018. Degree-aware hybrid graph traversal on FPGA-HMC platform. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, pp. 229--238.Google Scholar
- S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. 2016. Cambricon-x: An accelerator for sparse neural networks. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). pp. 1--12.Google Scholar
- J. Zhou, S. Liu, Q. Guo, X. Zhou, T. Zhi, D. Liu, C. Wang, X. Zhou, Y. Chen, and T. Chen. 2017. Tunao: A high-performance and energy-efficient reconfigurable accelerator for graph processing. In CCGRID.Google Scholar
- S. Zhou, C. Chelmis, and V. K. Prasanna. 2015. Optimizing memory performance for FPGA implementation of pagerank. In ReConFig. pp. 1--6.Google Scholar
- S. Zhou, C. Chelmis, and V. K. Prasanna. 2016. High-throughput and energy-efficient graph processing on FPGA. In Proceedings of the 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, pp. 103--110.Google Scholar
- S. Zhou, R. Kannan, H. Zeng, and V. K. Prasanna. 2018. An FPGA framework for edge-centric graph processing. In Proceedings of the 15th ACM International Conference on Computing Frontiers. ACM, pp. 69--77.Google Scholar
- S. Zhou and V. K. Prasanna. 2017. Accelerating graph analytics on CPU-FPGA heterogeneous platform. In SBAC-PAD.Google Scholar
- J. Zhu, I. Sander, and A. Jantsch. 2009. Buffer minimization of real-time streaming applications scheduling on hybrid CPU/FPGA architectures. In Proceedings of the Conference on Design, Automation and Test in Europe. European Design and Automation Association, pp. 1506--1511.Google Scholar
- L. Zhuo and V. K. Prasanna. 2005. Sparse matrix-vector multiplication on FPGAS. In Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays. ACM, pp 63--74.Google Scholar
Index Terms
Substream-Centric Maximum Matchings on FPGA
Recommendations
Substream-Centric Maximum Matchings on FPGA
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysDeveloping high-performance and energy-efficient algorithms for maximum matchings is becoming increasingly important in social network analysis, computational sciences, scheduling, and others. In this work, we propose the first maximum matching ...
Maximum matchings in regular graphs
It was conjectured by Mkrtchyan, Petrosyan and Vardanyan that every graph G with (G)(G)1 has a maximum matching M such that any two M-unsaturated vertices do not share a neighbor. The results obtained in Mkrtchyan etal. (2010), Petrosyan (2014) and ...
On maximum matchings in almost regular graphs
In 2010, Mkrtchyan, Petrosyan, and Vardanyan proved that every graph G with 2@[email protected](G)@[email protected](G)@?3 contains a maximum matching M such that no two vertices uncovered by M share a neighbor, where @D(G) and @d(G) denote the maximum and minimum degrees of ...






Comments