skip to main content
research-article

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures

Published:28 February 2022Publication History
Skip Abstract Section

Abstract

Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the Sparse Matrix Vector Multiplication (SpMV) kernel. SpMV has been characterized as one of the most significant and thoroughly studied scientific computation kernels. It is primarily a memory-bound kernel with intensive memory accesses due its algorithmic nature, the compressed matrix format used, and the sparsity patterns of the input matrices given. This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make three key contributions. First, we implement a wide variety of software strategies on SpMV for a multithreaded PIM core, including (1) various compressed matrix formats, (2) load balancing schemes across parallel threads and (3) synchronization approaches, and characterize the computational limits of a single multithreaded PIM core. Second, we design various load balancing schemes across multiple PIM cores, and two types of data partitioning techniques to execute SpMV on thousands of PIM cores: (1) 1D-partitioned kernels to perform the complete SpMV computation only using PIM cores, and (2) 2D-partitioned kernels to strive a balance between computation and data transfer costs to PIM-enabled memory. Third, we compare SpMV execution on a real-world PIM system with 2528 PIM cores to an Intel Xeon CPU and an NVIDIA Tesla V100 GPU to study the performance and energy efficiency of various devices, i.e., both memory-centric PIM systems and conventional processor-centric CPU/GPU systems, for the SpMV kernel. SparseP software package provides 25 SpMV kernels for real PIM systems supporting the four most widely used compressed matrix formats, i.e., CSR, COO, BCSR and BCOO, and a wide range of data types. SparseP is publicly and freely available at https://github.com/CMU-SAFARI/SparseP. Our extensive evaluation using 26 matrices with various sparsity patterns provides new insights and recommendations for software designers and hardware architects to efficiently accelerate the SpMV kernel on real PIM systems.

References

  1. Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A Scalable Processing-In-Memory Accelerator for Parallel Graph Processing. In ISCA .Google ScholarGoogle Scholar
  2. Bahar Asgari, Ramyad Hadidi, Joshua Dierberger, Charlotte Steinichen, and Hyesoon Kim. 2020 a. Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads. In CoRR . https://arxiv.org/abs/2011.10932Google ScholarGoogle Scholar
  3. Bahar Asgari, Ramyad Hadidi, Tushar Krishna, Hyesoon Kim, and Sudhakar Yalamanchili. 2020 b. ALRESCHA: A Lightweight Reconfigurable Sparse-Computation Accelerator. In HPCA .Google ScholarGoogle Scholar
  4. Hadi Asghari-Moghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim. 2016. Chameleon: Versatile and PracticalNear-DRAM Acceleration Architecture for Large Memory Systems. In MICRO .Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Mehmet Belgin, Godmar Back, and Calvin J. Ribbens. 2009. Pattern-Based Sparse Matrix Representation for Memory-Efficient SMVM Kernels. In ICS .Google ScholarGoogle Scholar
  6. Akrem Benatia, Weixing Ji, and Yizhuo Wang. 2019. Sparse Matrix Partitioning for Optimizing SpMV on CPU-GPU Heterogeneous Platforms. In IJHPCA .Google ScholarGoogle Scholar
  7. Akrem Benatia, Weixing Ji, Yizhuo Wang, and Feng Shi. 2016. Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU. In ICPP .Google ScholarGoogle Scholar
  8. Akrem Benatia, Weixing Ji, Yizhuo Wang, and Feng Shi. 2018. BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU. In TACO .Google ScholarGoogle Scholar
  9. Maciej Besta, Florian Marending, Edgar Solomonik, and Torsten Hoefler. 2017. SlimSell: A Vectorizable Graph Representation for Breadth-First Search. In IPDPS .Google ScholarGoogle Scholar
  10. Rob H. Bisseling and Wouter Meesen. 2005. Communication Balancing in Parallel Sparse Matrix-Vector Multiplication. In ETNA. Electronic Transactions on Numerical Analysis .Google ScholarGoogle Scholar
  11. Åke Björck. 1996. Numerical Methods for Least Squares Problems. In SIAM .Google ScholarGoogle Scholar
  12. Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schröder. 2003 a. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. In SIGGRAPH .Google ScholarGoogle Scholar
  13. Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schröder. 2003 b. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. In ACM Transactions on Graphics .Google ScholarGoogle Scholar
  14. Sergey Brin and Lawrence Page. 1998. The Anatomy of a Large-scale Hypertextual Web Search Engine. In WWW .Google ScholarGoogle Scholar
  15. Aydin Buluç, Samuel Williams, Leonid Oliker, and James Demmel. 2011. Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication. In IPDPS .Google ScholarGoogle Scholar
  16. Beata Bylina, Jaroslaw Bylina, Przemyslaw Stpiczy'ski, and Dominik Szakowski. 2014. Performance Analysis of Multicore and Multinodal Implementation of SpMV Operation. In FedCSIS .Google ScholarGoogle Scholar
  17. Benjamin Y. Cho, Yongkee Kwon, Sangkug Lym, and Mattan Erez. 2020. Near Data Acceleration with Concurrent Host Access. In ISCA .Google ScholarGoogle Scholar
  18. Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-Driven Autotuning of Sparse Matrix-Vector Multiply on GPUs. In PpopP .Google ScholarGoogle Scholar
  19. CSR5. 2015. CSR5 Cuda . https://github.com/weifengliu-ssslab/Benchmark_SpMV_using_CSR5Google ScholarGoogle Scholar
  20. cuSparse. 2021. cuSparse . https://docs.nvidia.com/cuda/cusparse/index.htmlGoogle ScholarGoogle Scholar
  21. Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An Industry-Standard API for Shared-Memory Programming. In IEEE Comput. Sci. Eng.Google ScholarGoogle Scholar
  22. Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. In TOMS .Google ScholarGoogle Scholar
  23. F. Devaux. 2019. The True Processing In Memory Accelerator. In Hot Chips .Google ScholarGoogle Scholar
  24. Jack Dongarra, Andrew Lumsdaine, Xinhui Niu, Roldan Pozoz, and Karin Remington. 1994. Sparse Matrix Libraries in CGoogle ScholarGoogle Scholar
  25. for High Performance Architectures. In Mathematics .Google ScholarGoogle Scholar
  26. Athena Elafrou, G. Goumas, and N. Koziris. 2017. Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors. In ICPP .Google ScholarGoogle Scholar
  27. Athena Elafrou, Georgios Goumas, and Nectarios Koziris. 2019. Conflict-Free Symmetric Sparse Matrix-Vector Multiplication on Multicore Architectures. In SC .Google ScholarGoogle Scholar
  28. Athena Elafrou, Vasileios Karakasis, Theodoros Gkountouvas, Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. 2018. SparseX: A Library for High-Performance Sparse Matrix-Vector Multiplication on Multicore Platforms. In ACM TOMS .Google ScholarGoogle Scholar
  29. R. D. Falgout. 2006. An Introduction to Algebraic Multigrid. In Computing in Science Engineering .Google ScholarGoogle Scholar
  30. Robert D Falgout and Ulrike Meier Yang. 2002. hypre: A Library of High Performance Preconditioners. In ICCS .Google ScholarGoogle Scholar
  31. Ivan Fernandez, Ricardo Quislant, Christina Giannoula, Mohammed Alser, Juan Gómez-Luna, Eladio Gutiérrez, Oscar Plata, and Onur Mutlu. 2020. NATSA: A Near-Data Processing Accelerator for Time Series Analysis. In ICCD .Google ScholarGoogle Scholar
  32. Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric S. Chung, and Greg Stitt. 2014. A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication. In FCCM .Google ScholarGoogle Scholar
  33. Daichi Fujiki, Niladrish Chatterjee, Donghyuk Lee, and Mike O'Connor. 2019. Near-Memory Data Transformation for Efficient Sparse Matrix Multi-Vector Multiplication. In SC .Google ScholarGoogle Scholar
  34. Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical Near-Data Processing for In-Memory Analytics Frameworks. In PACT .Google ScholarGoogle Scholar
  35. Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. In ASPLOS .Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Christina Giannoula, Ivan Fernandez, Juan Gó mez-Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. 2022. SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems. In CoRR . https://arxiv.org/abs/2201.05072Google ScholarGoogle Scholar
  37. Christina Giannoula, Nandita Vijaykumar, Nikela Papadopoulou, Vasileios Karakostas, Ivan Fernandez, Juan Gó mez-Luna, Lois Orosa, Nectarios Koziris, Georgios I. Goumas, and Onur Mutlu. 2021. SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures. In HPCA .Google ScholarGoogle Scholar
  38. Juan Gó mez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2021. Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture. In CoRR . https://arxiv.org/abs/2105.03814Google ScholarGoogle Scholar
  39. Georgios Goumas, Kornilios Kourtis, Nikos Anastopoulos, Vasileios Karakasis, and Nectarios Koziris. 2009. Performance Evaluation of the Sparse Matrix-Vector Multiplication on Modern Architectures. In J. Supercomput.Google ScholarGoogle Scholar
  40. Paul Grigoras, Pavel Burovskiy, Eddie Hung, and Wayne Luk. 2015. Accelerating SpMV on FPGAs by Compressing Nonzero Values. In FCCM .Google ScholarGoogle Scholar
  41. SAFARI Research Group. 2022. SparseP Software Package . https://github.com/Carnegie Mellon University-SAFARI/SparsePGoogle ScholarGoogle Scholar
  42. Ping Guo, Liqiang Wang, and Po Chen. 2014. A Performance Modeling and Optimization Analysis Tool for Sparse Matrix-Vector Multiplication on GPUs. In IEEE TPDS .Google ScholarGoogle Scholar
  43. Udit Gupta, Xiaodong Wang, Maxim Naumov, Carole-Jean Wu, Brandon Reagen, David Brooks, Bradford Cottel, Kim M. Hazelwood, Bill Jia, Hsien-Hsin S. Lee, Andrey Malevich, Dheevatsa Mudigere, Mikhail Smelyanskiy, Liang Xiong, and Xuan Zhang. 2019. The Architectural Implications of Facebook's DNN-based Personalized Recommendation. In CoRR .Google ScholarGoogle Scholar
  44. Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2021. Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-In-Memory Hardware. In IGSC .Google ScholarGoogle Scholar
  45. Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Algebra. In MICRO .Google ScholarGoogle Scholar
  46. Pascal Hénon, Pierre Ramet, and Jean Roman. 2002. PASTIX: A High-Performance Parallel Direct Solver for Sparse Symmetric Positive Definite Systems. In PMAA .Google ScholarGoogle Scholar
  47. Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V. cCatalyürek, Srinivasan Parthasarathy, and P. Sadayappan. 2018a. Efficient Sparse-Matrix Multi-Vector Product on GPUs. In HPDC .Google ScholarGoogle Scholar
  48. Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V. cCatalyürek, Srinivasan Parthasarathy, and P. Sadayappan. 2018b. Efficient Sparse-Matrix Multi-Vector Product on GPUs. In HPDC .Google ScholarGoogle Scholar
  49. Eun-Jin Im and Katherine A. Yelick. 1999. Optimizing Sparse Matrix Vector Multiplication on SMP. In PPSC.Google ScholarGoogle Scholar
  50. Eun-Jin Im, Katherine Yelick, and Richard Vuduc. 2004. Sparsity: Optimization Framework for Sparse Matrix Kernels. In The International Journal of High Performance Computing Applications .Google ScholarGoogle Scholar
  51. Sivaramakrishna Bharadwaj Indarapu, Manoj Maramreddy, and Kishore Kothapalli. 2014. Architecture- and Workload- Aware Heterogeneous Algorithms for Sparse Matrix Vector Multiplication. In COMPUTE .Google ScholarGoogle Scholar
  52. Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula, Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi, Taha Shahroodi, Juan Gomez Luna, and Onur Mutlu. 2019. SMASH: Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations. In MICRO .Google ScholarGoogle Scholar
  53. Vasileios Karakasis, Georgios Goumas, and Nectarios Koziris. 2009. Perfomance Models for Blocked Sparse Matrix-Vector Multiplication Kernels. In ICPP .Google ScholarGoogle Scholar
  54. Enver Kayaaslan, Bora Uçar, and Cevdet Aykanat. 2015. Semi-Two-Dimensional Partitioning for Parallel Sparse Matrix-Vector Multiplication. In IPDPS Workshop .Google ScholarGoogle Scholar
  55. Liu Ke, Udit Gupta, Carole-Jean Wu, Benjamin Youngjae Cho, Mark Hempstead, Brandon Reagen, Xuan Zhang, David Brooks, Vikas Chandra, Utku Diril, et almbox. 2020. RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing. In ISCA .Google ScholarGoogle Scholar
  56. Kashif Nizam Khan, Mikael Hirki, Tapio Niemi, Jukka K Nurminen, and Zhonghong Ou. 2018. Rapl in Action: Experiences in Using RAPL for Power Measurements. In TOMPECS .Google ScholarGoogle Scholar
  57. Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu. 2012. A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM. In ISCA .Google ScholarGoogle Scholar
  58. David R Kincaid, Thomas C Oppe, and David M Young. 1989. Itpackv 2D User's Guide .Google ScholarGoogle Scholar
  59. Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. 2017. TACO: A Tool to Generate Tensor Algebra Kernels . In ASE .Google ScholarGoogle Scholar
  60. Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. 2008. Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression. In CF .Google ScholarGoogle Scholar
  61. Kornilios Kourtis, Vasileios Karakasis, Georgios Goumas, and Nectarios Koziris. 2011. CSX: An Extended Compression Format for Spmv on Shared Memory Systems. In PPoPP .Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2019. TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning. In MICRO .Google ScholarGoogle Scholar
  63. Young-Cheon Kwon, Suk Han Lee, Jaehoon Lee, Sang-Hyuk Kwon, Je Min Ryu, Jong-Pil Son, O Seongil, Hak-Soo Yu, Haesuk Lee, Soo Young Kim, Youngmin Cho, Jin Guk Kim, Jongyoon Choi, Hyun-Sung Shin, Jin Kim, BengSeng Phuah, HyoungMin Kim, Myeong Jun Song, Ahn Choi, Daeho Kim, SooYoung Kim, Eun-Bong Kim, David Wang, Shinhaeng Kang, Yuhwan Ro, Seungwoo Seo, JoonHo Song, Jaeyoun Youn, Kyomin Sohn, and Nam Sung Kim. 2021. 25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications. In ISSCC .Google ScholarGoogle Scholar
  64. Daniel Langr and Pavel Tvrdík. 2016. Evaluation Criteria for Sparse Matrix Storage Formats. In TPDS .Google ScholarGoogle Scholar
  65. Dominique Lavenier, Remy Cimadomo, and Romaric Jodin. 2020. Variant Calling Parallelization on Processor-in-Memory Architecture. In BIBM.Google ScholarGoogle Scholar
  66. Seyong Lee and Rudolf Eigenmann. 2008. Adaptive Runtime Tuning of Parallel Sparse Matrix-Vector Multiplication on Distributed Memory Systems. In ICS .Google ScholarGoogle Scholar
  67. Sukhan Lee, Shin-Haeng Kang, Jaehoon Lee, H. Kim, Eojin Lee, Seung young Seo, H. Yoon, Seungwon Lee, K. Lim, Hyunsung Shin, Jinhyun Kim, O. Seongil, Anand Iyer, David Wang, K. Sohn, and N. Kim. 2021. Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product. In ISCA .Google ScholarGoogle Scholar
  68. J. Leskovec and R. Sosi?. 2016. SNAP: A General-Purpose Network Analysis and Graph-Mining Library. In TIST .Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: An Input Adaptive Auto-Tuner for Sparse Matrix-Vector Multiplication. In PLDI .Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Kenli Li, Wangdong Yang, and Keqin Li. 2015. Performance Analysis and Optimization for SpMV on GPU Using Probabilistic Modeling. In IEEE TPDS .Google ScholarGoogle Scholar
  71. Colin Yu Lin, Zheng Zhang, Ngai Wong, and Hayden Kwok-Hay So. 2010. Design Space Exploration for Sparse Matrix-Matrix Multiplication on FPGAs. In FPT .Google ScholarGoogle Scholar
  72. Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon.com Recommendations: Item-to-Item Collaborative Filtering. In IC .Google ScholarGoogle Scholar
  73. Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. 2015. Sparse Convolutional Neural Networks. In CVPR .Google ScholarGoogle Scholar
  74. Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu. 2018. Towards Efficient SpMV on Sunway Manycore Architectures. In ICS .Google ScholarGoogle Scholar
  75. Weifeng Liu and Brian Vinter. 2014. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data. In IPDPS .Google ScholarGoogle Scholar
  76. Weifeng Liu and Brian Vinter. 2015a. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In ICS .Google ScholarGoogle Scholar
  77. Weifeng Liu and Brian Vinter. 2015b. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In ICS .Google ScholarGoogle Scholar
  78. Marco Maggioni and Tanya Berger-Wolf. 2013. AdELL: An Adaptive Warp-Balancing ELL Format for Efficient Sparse Matrix-Vector Multiplication on GPUs. In ICPP .Google ScholarGoogle Scholar
  79. Duane Merrill and Michael Garland. 2016. Merge-Based Parallel Sparse Matrix-Vector Multiplication. In SC .Google ScholarGoogle Scholar
  80. Anurag Mukkara, Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, and Daniel Sanchez. 2018. Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling. In MICRO .Google ScholarGoogle Scholar
  81. Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun. 2021. A Modern Primer on Processing in Memory. In Emerging Computing: From Devices to Systems - Looking Beyond Moore and Von Neumann. https://arxiv.org/pdf/2012.03112.pdfGoogle ScholarGoogle Scholar
  82. Naveen Namashivayam, Sanyam Mehta, and Pen-Chung Yew. 2021. Variable-Sized Blocks for Locality-Aware SpMV . In CGO .Google ScholarGoogle Scholar
  83. Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. In CoRR .Google ScholarGoogle Scholar
  84. Yuyao Niu, Zhengyang Lu, Meichen Dong, Zhou Jin, Weifeng Liu, and Guangming Tan. 2021. TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs. In IPDPS .Google ScholarGoogle Scholar
  85. Eriko Nurvitadhi, Asit Mishra, Yu Wang, Ganesh Venkatesh, and Debbie Marr. 2016. Hardware Accelerator for Analytics of Sparse Data. In DAC .Google ScholarGoogle Scholar
  86. NVIDIA. 2016. NVIDIA System Management Interface Program . http://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf .Google ScholarGoogle Scholar
  87. Brian A. Page and Peter M. Kogge. 2018. Scalability of Hybrid Sparse Matrix Dense Vector (SpMV) Multiplication. In HPCS .Google ScholarGoogle Scholar
  88. Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. In HPCA .Google ScholarGoogle Scholar
  89. peakperf. 2021. peakperf. https://github.com/Dr-Noob/peakperf.gitGoogle ScholarGoogle Scholar
  90. Ali Pinar and Michael T. Heath. 1999. Improving Performance of Sparse Matrix-Vector Multiplication. In SC .Google ScholarGoogle Scholar
  91. Udo W. Pooch and Al Nieder. 1973. A Survey of Indexing Techniques for Sparse Matrices. In ACM Comput. Surv.Google ScholarGoogle Scholar
  92. Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In HPCA .Google ScholarGoogle Scholar
  93. Fazle Sadi, Joe Sweeney, Tze Meng Low, James C. Hoe, Larry Pileggi, and Franz Franchetti. 2019. Efficient SpMV Operation for Large and Highly Sparse Matrices Using Scalable Multi-Way Merge Parallelization. In MICRO .Google ScholarGoogle Scholar
  94. SciPy. 2021. List-of-list Sparse Matrix .Google ScholarGoogle Scholar
  95. Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan Parthasarathy, and P. Sadayappan. 2015. Automatic Selection of Sparse Matrix Representation on GPUs. In ICS .Google ScholarGoogle Scholar
  96. Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. 2007. Scan Primitives for GPU Computing. In GH .Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. A. Smith. 2019. 6 New Facts About Facebook . http://mediashift.orgGoogle ScholarGoogle Scholar
  98. Markus Steinberger, Rhaleb Zayer, and Hans-Peter Seidel. 2017. Globally Homogeneous, Locally Adaptive Sparse Matrix-Vector Multiplication on the GPU. In ICS .Google ScholarGoogle Scholar
  99. stream. 2021. stream. https://github.com/jeffhammond/STREAM.gitGoogle ScholarGoogle Scholar
  100. Bor-Yiing Su and Kurt Keutzer. 2012. ClSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs. In ICS .Google ScholarGoogle Scholar
  101. Guangming Tan, Junhong Liu, and Jiajia Li. 2018. Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture. In ACM Trans. Math. Softw.Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. Wai Teng Tang, Ruizhe Zhao, Mian Lu, Yun Liang, Huynh Phung Huyng, Xibai Li, and Rick Siow Mong Goh. 2015. Optimizing and Auto-Tuning Scale-Free Sparse Matrix-Vector Multiplication on Intel Xeon Phi. In CGO .Google ScholarGoogle Scholar
  103. Yaman Umuroglu and Magnus Jahre. 2014. An Energy Efficient Column-Major Backend for FPGA SpMV Accelerators. In ICCD .Google ScholarGoogle Scholar
  104. UPMEM. 2018. Introduction to UPMEM PIM. Processing-in-memory (PIM) on DRAM Accelerator (White Paper) .Google ScholarGoogle Scholar
  105. UPMEM. 2020. UPMEM Website . https://www.upmem.comGoogle ScholarGoogle Scholar
  106. UPMEM. 2021. UPMEM User Manual. Version 2021.3 .Google ScholarGoogle Scholar
  107. R. Vuduc, J.W. Demmel, K.A. Yelick, S. Kamil, R. Nishtala, and B. Lee. 2002. Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply. In SC .Google ScholarGoogle Scholar
  108. Richard Wilson Vuduc and James W. Demmel. 2003. Automatic Performance Tuning of Sparse Matrix Kernels. In PhD Thesis .Google ScholarGoogle Scholar
  109. Richard W. Vuduc and Hyun-Jin Moon. 2005. Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure. In HPCC .Google ScholarGoogle Scholar
  110. Jeremiah Willcock and Andrew Lumsdaine. 2006. Accelerating Sparse Matrix Computations via Data Compression. In ICS .Google ScholarGoogle Scholar
  111. Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms. In SC .Google ScholarGoogle Scholar
  112. Tianji Wu, Bo Wang, Yi Shan, Feng Yan, Yu Wang, and Ningyi Xu. 2010. Efficient PageRank and SpMV Computation on AMD GPUs. In ICPP .Google ScholarGoogle Scholar
  113. Xinfeng Xie, Zheng Liang, Peng Gu, Abanti Basak, Lei Deng, Ling Liang, Xing Hu, and Yuan Xie. 2021. SpaceA: Sparse Matrix Vector Multiplication on Processing-in-Memory Accelerator. In HPCA.Google ScholarGoogle Scholar
  114. Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014a. YaSpMV: Yet Another SpMV Framework on GPUs. In PPoPP .Google ScholarGoogle Scholar
  115. Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014b. YaSpMV: Yet Another SpMV Framework on GPUs. In PPoPP .Google ScholarGoogle Scholar
  116. Wangdong Yang, Kenli Li, and Keqin Li. 2017. A Hybrid Computing Method of SpMV on CPU--GPU Heterogeneous Computing Systems. In JPDC .Google ScholarGoogle Scholar
  117. Wangdong Yang, Kenli Li, Yan Liu, Lin Shi, and Lanjun Wan. 2014. Optimization of Quasi-Diagonal Matrix-Vector Multiplication on GPU. In Int. J. High Perform. Comput. Appl.Google ScholarGoogle Scholar
  118. Wangdong Yang, Kenli Li, Zeyao Mo, and Keqin Li. 2015. Performance Optimization Using Partitioned SpMV on GPUs and Multicore CPUs. In IEEE Transactions on Computers .Google ScholarGoogle Scholar
  119. Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An Accelerator for Sparse Neural Networks. In MICRO .Google ScholarGoogle Scholar
  120. Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018a. Bridging the Gap between Deep Learning and Sparse Matrix Format Selection. In PPoPP .Google ScholarGoogle Scholar
  121. Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018b. Bridging the Gap between Deep Learning and Sparse Matrix Format Selection. In PPoPP .Google ScholarGoogle Scholar
  122. Yue Zhao, Weijie Zhou, Xipeng Shen, and Graham Yiu. 2018c. Overhead-Conscious Format Selection for SpMV-Based Applications. In IPDPS .Google ScholarGoogle Scholar

Index Terms

  1. SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader
              About Cookies On This Site

              We use cookies to ensure that we give you the best experience on our website.

              Learn more

              Got it!