Abstract
Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the Sparse Matrix Vector Multiplication (SpMV) kernel. SpMV has been characterized as one of the most significant and thoroughly studied scientific computation kernels. It is primarily a memory-bound kernel with intensive memory accesses due its algorithmic nature, the compressed matrix format used, and the sparsity patterns of the input matrices given. This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make three key contributions. First, we implement a wide variety of software strategies on SpMV for a multithreaded PIM core, including (1) various compressed matrix formats, (2) load balancing schemes across parallel threads and (3) synchronization approaches, and characterize the computational limits of a single multithreaded PIM core. Second, we design various load balancing schemes across multiple PIM cores, and two types of data partitioning techniques to execute SpMV on thousands of PIM cores: (1) 1D-partitioned kernels to perform the complete SpMV computation only using PIM cores, and (2) 2D-partitioned kernels to strive a balance between computation and data transfer costs to PIM-enabled memory. Third, we compare SpMV execution on a real-world PIM system with 2528 PIM cores to an Intel Xeon CPU and an NVIDIA Tesla V100 GPU to study the performance and energy efficiency of various devices, i.e., both memory-centric PIM systems and conventional processor-centric CPU/GPU systems, for the SpMV kernel. SparseP software package provides 25 SpMV kernels for real PIM systems supporting the four most widely used compressed matrix formats, i.e., CSR, COO, BCSR and BCOO, and a wide range of data types. SparseP is publicly and freely available at https://github.com/CMU-SAFARI/SparseP. Our extensive evaluation using 26 matrices with various sparsity patterns provides new insights and recommendations for software designers and hardware architects to efficiently accelerate the SpMV kernel on real PIM systems.
- Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A Scalable Processing-In-Memory Accelerator for Parallel Graph Processing. In ISCA .Google Scholar
- Bahar Asgari, Ramyad Hadidi, Joshua Dierberger, Charlotte Steinichen, and Hyesoon Kim. 2020 a. Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads. In CoRR . https://arxiv.org/abs/2011.10932Google Scholar
- Bahar Asgari, Ramyad Hadidi, Tushar Krishna, Hyesoon Kim, and Sudhakar Yalamanchili. 2020 b. ALRESCHA: A Lightweight Reconfigurable Sparse-Computation Accelerator. In HPCA .Google Scholar
- Hadi Asghari-Moghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim. 2016. Chameleon: Versatile and PracticalNear-DRAM Acceleration Architecture for Large Memory Systems. In MICRO .Google Scholar
Digital Library
- Mehmet Belgin, Godmar Back, and Calvin J. Ribbens. 2009. Pattern-Based Sparse Matrix Representation for Memory-Efficient SMVM Kernels. In ICS .Google Scholar
- Akrem Benatia, Weixing Ji, and Yizhuo Wang. 2019. Sparse Matrix Partitioning for Optimizing SpMV on CPU-GPU Heterogeneous Platforms. In IJHPCA .Google Scholar
- Akrem Benatia, Weixing Ji, Yizhuo Wang, and Feng Shi. 2016. Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU. In ICPP .Google Scholar
- Akrem Benatia, Weixing Ji, Yizhuo Wang, and Feng Shi. 2018. BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU. In TACO .Google Scholar
- Maciej Besta, Florian Marending, Edgar Solomonik, and Torsten Hoefler. 2017. SlimSell: A Vectorizable Graph Representation for Breadth-First Search. In IPDPS .Google Scholar
- Rob H. Bisseling and Wouter Meesen. 2005. Communication Balancing in Parallel Sparse Matrix-Vector Multiplication. In ETNA. Electronic Transactions on Numerical Analysis .Google Scholar
- Åke Björck. 1996. Numerical Methods for Least Squares Problems. In SIAM .Google Scholar
- Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schröder. 2003 a. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. In SIGGRAPH .Google Scholar
- Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schröder. 2003 b. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. In ACM Transactions on Graphics .Google Scholar
- Sergey Brin and Lawrence Page. 1998. The Anatomy of a Large-scale Hypertextual Web Search Engine. In WWW .Google Scholar
- Aydin Buluç, Samuel Williams, Leonid Oliker, and James Demmel. 2011. Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication. In IPDPS .Google Scholar
- Beata Bylina, Jaroslaw Bylina, Przemyslaw Stpiczy'ski, and Dominik Szakowski. 2014. Performance Analysis of Multicore and Multinodal Implementation of SpMV Operation. In FedCSIS .Google Scholar
- Benjamin Y. Cho, Yongkee Kwon, Sangkug Lym, and Mattan Erez. 2020. Near Data Acceleration with Concurrent Host Access. In ISCA .Google Scholar
- Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-Driven Autotuning of Sparse Matrix-Vector Multiply on GPUs. In PpopP .Google Scholar
- CSR5. 2015. CSR5 Cuda . https://github.com/weifengliu-ssslab/Benchmark_SpMV_using_CSR5Google Scholar
- cuSparse. 2021. cuSparse . https://docs.nvidia.com/cuda/cusparse/index.htmlGoogle Scholar
- Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An Industry-Standard API for Shared-Memory Programming. In IEEE Comput. Sci. Eng.Google Scholar
- Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. In TOMS .Google Scholar
- F. Devaux. 2019. The True Processing In Memory Accelerator. In Hot Chips .Google Scholar
- Jack Dongarra, Andrew Lumsdaine, Xinhui Niu, Roldan Pozoz, and Karin Remington. 1994. Sparse Matrix Libraries in CGoogle Scholar
- for High Performance Architectures. In Mathematics .Google Scholar
- Athena Elafrou, G. Goumas, and N. Koziris. 2017. Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors. In ICPP .Google Scholar
- Athena Elafrou, Georgios Goumas, and Nectarios Koziris. 2019. Conflict-Free Symmetric Sparse Matrix-Vector Multiplication on Multicore Architectures. In SC .Google Scholar
- Athena Elafrou, Vasileios Karakasis, Theodoros Gkountouvas, Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. 2018. SparseX: A Library for High-Performance Sparse Matrix-Vector Multiplication on Multicore Platforms. In ACM TOMS .Google Scholar
- R. D. Falgout. 2006. An Introduction to Algebraic Multigrid. In Computing in Science Engineering .Google Scholar
- Robert D Falgout and Ulrike Meier Yang. 2002. hypre: A Library of High Performance Preconditioners. In ICCS .Google Scholar
- Ivan Fernandez, Ricardo Quislant, Christina Giannoula, Mohammed Alser, Juan Gómez-Luna, Eladio Gutiérrez, Oscar Plata, and Onur Mutlu. 2020. NATSA: A Near-Data Processing Accelerator for Time Series Analysis. In ICCD .Google Scholar
- Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric S. Chung, and Greg Stitt. 2014. A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication. In FCCM .Google Scholar
- Daichi Fujiki, Niladrish Chatterjee, Donghyuk Lee, and Mike O'Connor. 2019. Near-Memory Data Transformation for Efficient Sparse Matrix Multi-Vector Multiplication. In SC .Google Scholar
- Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical Near-Data Processing for In-Memory Analytics Frameworks. In PACT .Google Scholar
- Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. In ASPLOS .Google Scholar
Digital Library
- Christina Giannoula, Ivan Fernandez, Juan Gó mez-Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. 2022. SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems. In CoRR . https://arxiv.org/abs/2201.05072Google Scholar
- Christina Giannoula, Nandita Vijaykumar, Nikela Papadopoulou, Vasileios Karakostas, Ivan Fernandez, Juan Gó mez-Luna, Lois Orosa, Nectarios Koziris, Georgios I. Goumas, and Onur Mutlu. 2021. SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures. In HPCA .Google Scholar
- Juan Gó mez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2021. Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture. In CoRR . https://arxiv.org/abs/2105.03814Google Scholar
- Georgios Goumas, Kornilios Kourtis, Nikos Anastopoulos, Vasileios Karakasis, and Nectarios Koziris. 2009. Performance Evaluation of the Sparse Matrix-Vector Multiplication on Modern Architectures. In J. Supercomput.Google Scholar
- Paul Grigoras, Pavel Burovskiy, Eddie Hung, and Wayne Luk. 2015. Accelerating SpMV on FPGAs by Compressing Nonzero Values. In FCCM .Google Scholar
- SAFARI Research Group. 2022. SparseP Software Package . https://github.com/Carnegie Mellon University-SAFARI/SparsePGoogle Scholar
- Ping Guo, Liqiang Wang, and Po Chen. 2014. A Performance Modeling and Optimization Analysis Tool for Sparse Matrix-Vector Multiplication on GPUs. In IEEE TPDS .Google Scholar
- Udit Gupta, Xiaodong Wang, Maxim Naumov, Carole-Jean Wu, Brandon Reagen, David Brooks, Bradford Cottel, Kim M. Hazelwood, Bill Jia, Hsien-Hsin S. Lee, Andrey Malevich, Dheevatsa Mudigere, Mikhail Smelyanskiy, Liang Xiong, and Xuan Zhang. 2019. The Architectural Implications of Facebook's DNN-based Personalized Recommendation. In CoRR .Google Scholar
- Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2021. Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-In-Memory Hardware. In IGSC .Google Scholar
- Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Algebra. In MICRO .Google Scholar
- Pascal Hénon, Pierre Ramet, and Jean Roman. 2002. PASTIX: A High-Performance Parallel Direct Solver for Sparse Symmetric Positive Definite Systems. In PMAA .Google Scholar
- Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V. cCatalyürek, Srinivasan Parthasarathy, and P. Sadayappan. 2018a. Efficient Sparse-Matrix Multi-Vector Product on GPUs. In HPDC .Google Scholar
- Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V. cCatalyürek, Srinivasan Parthasarathy, and P. Sadayappan. 2018b. Efficient Sparse-Matrix Multi-Vector Product on GPUs. In HPDC .Google Scholar
- Eun-Jin Im and Katherine A. Yelick. 1999. Optimizing Sparse Matrix Vector Multiplication on SMP. In PPSC.Google Scholar
- Eun-Jin Im, Katherine Yelick, and Richard Vuduc. 2004. Sparsity: Optimization Framework for Sparse Matrix Kernels. In The International Journal of High Performance Computing Applications .Google Scholar
- Sivaramakrishna Bharadwaj Indarapu, Manoj Maramreddy, and Kishore Kothapalli. 2014. Architecture- and Workload- Aware Heterogeneous Algorithms for Sparse Matrix Vector Multiplication. In COMPUTE .Google Scholar
- Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula, Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi, Taha Shahroodi, Juan Gomez Luna, and Onur Mutlu. 2019. SMASH: Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations. In MICRO .Google Scholar
- Vasileios Karakasis, Georgios Goumas, and Nectarios Koziris. 2009. Perfomance Models for Blocked Sparse Matrix-Vector Multiplication Kernels. In ICPP .Google Scholar
- Enver Kayaaslan, Bora Uçar, and Cevdet Aykanat. 2015. Semi-Two-Dimensional Partitioning for Parallel Sparse Matrix-Vector Multiplication. In IPDPS Workshop .Google Scholar
- Liu Ke, Udit Gupta, Carole-Jean Wu, Benjamin Youngjae Cho, Mark Hempstead, Brandon Reagen, Xuan Zhang, David Brooks, Vikas Chandra, Utku Diril, et almbox. 2020. RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing. In ISCA .Google Scholar
- Kashif Nizam Khan, Mikael Hirki, Tapio Niemi, Jukka K Nurminen, and Zhonghong Ou. 2018. Rapl in Action: Experiences in Using RAPL for Power Measurements. In TOMPECS .Google Scholar
- Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu. 2012. A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM. In ISCA .Google Scholar
- David R Kincaid, Thomas C Oppe, and David M Young. 1989. Itpackv 2D User's Guide .Google Scholar
- Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. 2017. TACO: A Tool to Generate Tensor Algebra Kernels . In ASE .Google Scholar
- Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. 2008. Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression. In CF .Google Scholar
- Kornilios Kourtis, Vasileios Karakasis, Georgios Goumas, and Nectarios Koziris. 2011. CSX: An Extended Compression Format for Spmv on Shared Memory Systems. In PPoPP .Google Scholar
Digital Library
- Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2019. TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning. In MICRO .Google Scholar
- Young-Cheon Kwon, Suk Han Lee, Jaehoon Lee, Sang-Hyuk Kwon, Je Min Ryu, Jong-Pil Son, O Seongil, Hak-Soo Yu, Haesuk Lee, Soo Young Kim, Youngmin Cho, Jin Guk Kim, Jongyoon Choi, Hyun-Sung Shin, Jin Kim, BengSeng Phuah, HyoungMin Kim, Myeong Jun Song, Ahn Choi, Daeho Kim, SooYoung Kim, Eun-Bong Kim, David Wang, Shinhaeng Kang, Yuhwan Ro, Seungwoo Seo, JoonHo Song, Jaeyoun Youn, Kyomin Sohn, and Nam Sung Kim. 2021. 25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications. In ISSCC .Google Scholar
- Daniel Langr and Pavel Tvrdík. 2016. Evaluation Criteria for Sparse Matrix Storage Formats. In TPDS .Google Scholar
- Dominique Lavenier, Remy Cimadomo, and Romaric Jodin. 2020. Variant Calling Parallelization on Processor-in-Memory Architecture. In BIBM.Google Scholar
- Seyong Lee and Rudolf Eigenmann. 2008. Adaptive Runtime Tuning of Parallel Sparse Matrix-Vector Multiplication on Distributed Memory Systems. In ICS .Google Scholar
- Sukhan Lee, Shin-Haeng Kang, Jaehoon Lee, H. Kim, Eojin Lee, Seung young Seo, H. Yoon, Seungwon Lee, K. Lim, Hyunsung Shin, Jinhyun Kim, O. Seongil, Anand Iyer, David Wang, K. Sohn, and N. Kim. 2021. Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product. In ISCA .Google Scholar
- J. Leskovec and R. Sosi?. 2016. SNAP: A General-Purpose Network Analysis and Graph-Mining Library. In TIST .Google Scholar
Digital Library
- Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: An Input Adaptive Auto-Tuner for Sparse Matrix-Vector Multiplication. In PLDI .Google Scholar
Digital Library
- Kenli Li, Wangdong Yang, and Keqin Li. 2015. Performance Analysis and Optimization for SpMV on GPU Using Probabilistic Modeling. In IEEE TPDS .Google Scholar
- Colin Yu Lin, Zheng Zhang, Ngai Wong, and Hayden Kwok-Hay So. 2010. Design Space Exploration for Sparse Matrix-Matrix Multiplication on FPGAs. In FPT .Google Scholar
- Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon.com Recommendations: Item-to-Item Collaborative Filtering. In IC .Google Scholar
- Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. 2015. Sparse Convolutional Neural Networks. In CVPR .Google Scholar
- Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu. 2018. Towards Efficient SpMV on Sunway Manycore Architectures. In ICS .Google Scholar
- Weifeng Liu and Brian Vinter. 2014. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data. In IPDPS .Google Scholar
- Weifeng Liu and Brian Vinter. 2015a. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In ICS .Google Scholar
- Weifeng Liu and Brian Vinter. 2015b. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In ICS .Google Scholar
- Marco Maggioni and Tanya Berger-Wolf. 2013. AdELL: An Adaptive Warp-Balancing ELL Format for Efficient Sparse Matrix-Vector Multiplication on GPUs. In ICPP .Google Scholar
- Duane Merrill and Michael Garland. 2016. Merge-Based Parallel Sparse Matrix-Vector Multiplication. In SC .Google Scholar
- Anurag Mukkara, Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, and Daniel Sanchez. 2018. Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling. In MICRO .Google Scholar
- Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun. 2021. A Modern Primer on Processing in Memory. In Emerging Computing: From Devices to Systems - Looking Beyond Moore and Von Neumann. https://arxiv.org/pdf/2012.03112.pdfGoogle Scholar
- Naveen Namashivayam, Sanyam Mehta, and Pen-Chung Yew. 2021. Variable-Sized Blocks for Locality-Aware SpMV . In CGO .Google Scholar
- Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. In CoRR .Google Scholar
- Yuyao Niu, Zhengyang Lu, Meichen Dong, Zhou Jin, Weifeng Liu, and Guangming Tan. 2021. TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs. In IPDPS .Google Scholar
- Eriko Nurvitadhi, Asit Mishra, Yu Wang, Ganesh Venkatesh, and Debbie Marr. 2016. Hardware Accelerator for Analytics of Sparse Data. In DAC .Google Scholar
- NVIDIA. 2016. NVIDIA System Management Interface Program . http://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf .Google Scholar
- Brian A. Page and Peter M. Kogge. 2018. Scalability of Hybrid Sparse Matrix Dense Vector (SpMV) Multiplication. In HPCS .Google Scholar
- Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. In HPCA .Google Scholar
- peakperf. 2021. peakperf. https://github.com/Dr-Noob/peakperf.gitGoogle Scholar
- Ali Pinar and Michael T. Heath. 1999. Improving Performance of Sparse Matrix-Vector Multiplication. In SC .Google Scholar
- Udo W. Pooch and Al Nieder. 1973. A Survey of Indexing Techniques for Sparse Matrices. In ACM Comput. Surv.Google Scholar
- Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In HPCA .Google Scholar
- Fazle Sadi, Joe Sweeney, Tze Meng Low, James C. Hoe, Larry Pileggi, and Franz Franchetti. 2019. Efficient SpMV Operation for Large and Highly Sparse Matrices Using Scalable Multi-Way Merge Parallelization. In MICRO .Google Scholar
- SciPy. 2021. List-of-list Sparse Matrix .Google Scholar
- Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan Parthasarathy, and P. Sadayappan. 2015. Automatic Selection of Sparse Matrix Representation on GPUs. In ICS .Google Scholar
- Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. 2007. Scan Primitives for GPU Computing. In GH .Google Scholar
Digital Library
- A. Smith. 2019. 6 New Facts About Facebook . http://mediashift.orgGoogle Scholar
- Markus Steinberger, Rhaleb Zayer, and Hans-Peter Seidel. 2017. Globally Homogeneous, Locally Adaptive Sparse Matrix-Vector Multiplication on the GPU. In ICS .Google Scholar
- stream. 2021. stream. https://github.com/jeffhammond/STREAM.gitGoogle Scholar
- Bor-Yiing Su and Kurt Keutzer. 2012. ClSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs. In ICS .Google Scholar
- Guangming Tan, Junhong Liu, and Jiajia Li. 2018. Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture. In ACM Trans. Math. Softw.Google Scholar
Digital Library
- Wai Teng Tang, Ruizhe Zhao, Mian Lu, Yun Liang, Huynh Phung Huyng, Xibai Li, and Rick Siow Mong Goh. 2015. Optimizing and Auto-Tuning Scale-Free Sparse Matrix-Vector Multiplication on Intel Xeon Phi. In CGO .Google Scholar
- Yaman Umuroglu and Magnus Jahre. 2014. An Energy Efficient Column-Major Backend for FPGA SpMV Accelerators. In ICCD .Google Scholar
- UPMEM. 2018. Introduction to UPMEM PIM. Processing-in-memory (PIM) on DRAM Accelerator (White Paper) .Google Scholar
- UPMEM. 2020. UPMEM Website . https://www.upmem.comGoogle Scholar
- UPMEM. 2021. UPMEM User Manual. Version 2021.3 .Google Scholar
- R. Vuduc, J.W. Demmel, K.A. Yelick, S. Kamil, R. Nishtala, and B. Lee. 2002. Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply. In SC .Google Scholar
- Richard Wilson Vuduc and James W. Demmel. 2003. Automatic Performance Tuning of Sparse Matrix Kernels. In PhD Thesis .Google Scholar
- Richard W. Vuduc and Hyun-Jin Moon. 2005. Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure. In HPCC .Google Scholar
- Jeremiah Willcock and Andrew Lumsdaine. 2006. Accelerating Sparse Matrix Computations via Data Compression. In ICS .Google Scholar
- Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms. In SC .Google Scholar
- Tianji Wu, Bo Wang, Yi Shan, Feng Yan, Yu Wang, and Ningyi Xu. 2010. Efficient PageRank and SpMV Computation on AMD GPUs. In ICPP .Google Scholar
- Xinfeng Xie, Zheng Liang, Peng Gu, Abanti Basak, Lei Deng, Ling Liang, Xing Hu, and Yuan Xie. 2021. SpaceA: Sparse Matrix Vector Multiplication on Processing-in-Memory Accelerator. In HPCA.Google Scholar
- Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014a. YaSpMV: Yet Another SpMV Framework on GPUs. In PPoPP .Google Scholar
- Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014b. YaSpMV: Yet Another SpMV Framework on GPUs. In PPoPP .Google Scholar
- Wangdong Yang, Kenli Li, and Keqin Li. 2017. A Hybrid Computing Method of SpMV on CPU--GPU Heterogeneous Computing Systems. In JPDC .Google Scholar
- Wangdong Yang, Kenli Li, Yan Liu, Lin Shi, and Lanjun Wan. 2014. Optimization of Quasi-Diagonal Matrix-Vector Multiplication on GPU. In Int. J. High Perform. Comput. Appl.Google Scholar
- Wangdong Yang, Kenli Li, Zeyao Mo, and Keqin Li. 2015. Performance Optimization Using Partitioned SpMV on GPUs and Multicore CPUs. In IEEE Transactions on Computers .Google Scholar
- Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An Accelerator for Sparse Neural Networks. In MICRO .Google Scholar
- Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018a. Bridging the Gap between Deep Learning and Sparse Matrix Format Selection. In PPoPP .Google Scholar
- Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018b. Bridging the Gap between Deep Learning and Sparse Matrix Format Selection. In PPoPP .Google Scholar
- Yue Zhao, Weijie Zhou, Xipeng Shen, and Graham Yiu. 2018c. Overhead-Conscious Format Selection for SpMV-Based Applications. In IPDPS .Google Scholar
Index Terms
SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures
Recommendations
Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures
SIGMETRICS '22Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they ...
Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures
SIGMETRICS/PERFORMANCE '22: Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer SystemsSeveral manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they ...
SparseX: A Library for High-Performance Sparse Matrix-Vector Multiplication on Multicore Platforms
The Sparse Matrix-Vector Multiplication (SpMV) kernel ranks among the most important and thoroughly studied linear algebra operations, as it lies at the heart of many iterative methods for the solution of sparse linear systems, and often constitutes a ...






Comments