Abstract
Graph algorithms such as graph traversal have been gaining ever-increasing importance in the era of big data. However, graph processing on traditional architectures issues many random and irregular memory accesses, leading to a huge number of data movements and the consumption of very large amounts of energy. To minimize the waste of memory bandwidth, we investigate utilizing processing-in-memory (PIM), combined with non-volatile metal-oxide resistive random access memory (ReRAM), to improve both computation and I/O performance.
We propose a new ReRAM-based processing-in-memory architecture called RPBFS, in which graph data can be persistently stored and processed in place. We study the problem of graph traversal, and we design an efficient graph traversal algorithm in RPBFS. Benefiting from low data movement overhead and high bank-level parallel computation, RPBFS shows a significant performance improvement compared with both the CPU-based and the GPU-based BFS implementations. On a suite of real-world graphs, our architecture yields a speedup in graph traversal performance of up to 33.8×, and achieves a reduction in energy over conventional systems of up to 142.8×.
- Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). IEEE, 105--117. Google Scholar
Digital Library
- Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). IEEE, 336--348. Google Scholar
Digital Library
- Berkin Akin, Franz Franchetti, and James C. Hoe. 2015. Data reorganization in memory using 3D-stacked DRAM. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 131--143. Google Scholar
Digital Library
- Fabien Alibart, Ligang Gao, Brian D. Hoskins, and Dmitri B. Strukov. 2012. High precision tuning of state for memristive devices by adaptable variation-tolerant algorithm. Nanotechnology 23, 7 (2012), 075201.Google Scholar
Cross Ref
- Rajeev Balasubramonian, Jichuan Chang, Troy Manning, Jaime H. Moreno, Richard Murphy, Ravi Nair, and Steven Swanson. 2014. Near-data processing: Insights from a MICRO-46 workshop. IEEE Micro 34, 4 (2014), 36--42.Google Scholar
- Scott Beamer, Krste Asanović, and David Patterson. 2013. Direction-optimizing breadth-first search. Scientific Programming 21, 3--4 (2013), 137--148. Google Scholar
Digital Library
- Scott Beamer, Krste Asanovic, and David Patterson. 2015. Locality exists in graph processing: Workload characterization on an ivy bridge server. In Proceedings of the 2015 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 56--65. Google Scholar
Digital Library
- Rong Chen, Jiaxin Shi, Yanzhe Chen, and Haibo Chen. 2015. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. In Proceedings of the 10th European Conference on Computer Systems. ACM, 1. Google Scholar
Digital Library
- Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE, 27--39. Google Scholar
Digital Library
- Bram Cohen. 2003. Incentives build robustness in bittorrent. In Proceedings of the Workshop on Economics of Peer-to-Peer Systems, Vol. 6. 68--72.Google Scholar
- Thomas H. Cormen. 2009. Introduction to Algorithms. MIT Press. Google Scholar
Digital Library
- Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An industry standard API for shared-memory programming. IEEE Computational Science and Engineering 5, 1 (1998), 46--55. Google Scholar
Digital Library
- Xiangyu Dong, Cong Xu, Norm Jouppi, and Yuan Xie. 2014. NVSim: A circuit-level performance, energy, and area model for emerging non-volatile memory. In Emerging Memory Technologies. Springer, 15--50.Google Scholar
- Lei Han, Zhaoyan Shen, Zili Shao, H. Howie Huang, and Tao Li. 2017. A novel ReRAM-based processing-in-memory architecture for graph computing. In Proceedings of 2017 IEEE 6th Non-Volatile Memory Systems and Applications Symposium (NVMSA). IEEE, 1--6.Google Scholar
Cross Ref
- Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 78--88. Google Scholar
Digital Library
- George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20, 1 (1998), 359--392. Google Scholar
Digital Library
- Jure Leskovec and Andrej Krevl. 2015. {SNAP Datasets}:{Stanford} Large Network Dataset Collection.Google Scholar
- Jing Li, Chao-I Wu, Scott C. Lewis, Jackie Morrish, Tien-Yen Wang, Richard Jordan, Tom Maffitt, Matthew Breitwisch, Alejandro Schrott, Roger Cheek, and others. 2011. A novel reconfigurable sensing scheme for variable level storage in phase change memory. In Proceedings of the 2011 3rd IEEE International Memory Workshop (IMW). IEEE, 1--4.Google Scholar
Cross Ref
- Duo Liu, Tianzheng Wang, Yi Wang, Zili Shao, Qingfeng Zhuge, and Edwin H.-M. Sha. 2014. Application-specific wear leveling for extending lifetime of phase change memory in embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 33, 10 (2014), 1450--1462.Google Scholar
Cross Ref
- Hang Liu and H Howie Huang. 2015. Enterprise: Breadth-first graph traversal on GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 68. Google Scholar
Digital Library
- Hang Liu, H. Howie Huang, and Yang Hu. 2016. iBFS: Concurrent breadth-first search on GPUs. In Proceedings of the 2016 International Conference on Management of Data. ACM, 403--416. Google Scholar
Digital Library
- Xiaoxiao Liu, Mengjie Mao, Beiye Liu, Hai Li, Yiran Chen, Boxun Li, Yu Wang, Hao Jiang, Mark Barnell, Qing Wu, and others. 2015. RENO: A high-efficient reconfigurable neuromorphic computing accelerator design. In Proceedings of the 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6. Google Scholar
Digital Library
- Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM, 273--282. Google Scholar
Digital Library
- Martin Dimitrov and Carl Strickland. 2016. Intel power gadget. Intel Corporation 7 (2016). https://software.intel.com/en-us/articles/intel-power-gadget-20.Google Scholar
- Duane Merrill, Michael Garland, and Andrew Grimshaw. 2012. Scalable GPU graph traversal. In ACM SIGPLAN Notices, Vol. 47. ACM, 117--128. Google Scholar
Digital Library
- Nooshin Mirzadeh, Yusuf Onur Koçberber, Babak Falsafi, and Boris Grot. 2015. Sort vs. Hash join revisited for near-memory execution. In Proceedings of the 5th Workshop on Architectures and Systems for Big Data (ASBD’15).Google Scholar
- Dimin Niu, Cong Xu, Naveen Muralimanohar, Norman P. Jouppi, and Yuan Xie. 2013. Design of cross-point metal-oxide ReRAM emphasizing reliability and cost. In Proceedings of the 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 17--23. Google Scholar
Digital Library
- Eriko Nurvitadhi, Gabriel Weisz, Yu Wang, Skand Hurkat, Marie Nguyen, James C. Hoe, José F. Martínez, and Carlos Guestrin. 2014. Graphgen: An FPGA framework for vertex-centric graph computation. In Proceedings of the 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 25--28. Google Scholar
Digital Library
- NVIDIA. 2017. CUDA Toolkit Documentation. Technical Report. http://docs.nvidia.com/cuda/profiler-users-guide/index.htmlnvprof-overview.Google Scholar
- Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth, Steven Burns, and Ozcan Ozturk. 2016. Energy efficient architecture for graph analytics accelerators. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 166--177. Google Scholar
Digital Library
- Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.Google Scholar
- Seth H. Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+ logic devices on MapReduce workloads. In Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 190--200.Google Scholar
Cross Ref
- Zhiwei Qin, Yi Wang, Duo Liu, Zili Shao, and Yong Guan. 2011. MNFTL: An efficient flash translation layer for MLC NAND flash memory storage systems. In Proceedings of the 48th Design Automation Conference. ACM, 17--22. Google Scholar
Digital Library
- Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, 472--488. Google Scholar
Digital Library
- Semih Salihoglu and Jennifer Widom. 2013. GPS: A graph processing system. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management. ACM, 22. Google Scholar
Digital Library
- Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and others. 2013. RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization. In Proceedings of the 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 185--197. Google Scholar
Digital Library
- Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE, 14--26. Google Scholar
Digital Library
- Hojun Shim, Yongsoo Joo, Yongseok Choi, Hyung Gyu Lee, and Naehyuck Chang. 2003. Low-energy off-chip SDRAM memory systems for embedded applications. ACM Transactions on Embedded Computing Systems (TECS) 2, 1 (2003), 98--130. Google Scholar
Digital Library
- Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. 2017. PipeLayer: A pipelined ReRAM-based accelerator for deep learning. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 541--552.Google Scholar
Cross Ref
- Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2017. GraphR: Accelerating graph processing using ReRAM. Arxiv:1708.06248.Google Scholar
- Yuliang Sun, Yu Wang, and Huazhong Yang. 2017. Energy-efficient SQL query exploiting RRAM-based process-in-memory structure. In Proceedings of the 2017 IEEE 6th Non-Volatile Memory Systems and Applications Symposium (NVMSA). IEEE, 1--6.Google Scholar
Cross Ref
- Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2016. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 11. Google Scholar
Digital Library
- Yi Wang, Zhiwei Qin, Renhai Chen, Zili Shao, Qixin Wang, Shuai Li, and Laurence T. Yang. 2016. A real-time flash translation layer for NAND flash memory storage systems. IEEE Transactions on Multi-Scale Computing Systems 2, 1 (2016), 17--29. Google Scholar
Digital Library
- H-S. Philip Wong, Heng-Yuan Lee, Shimeng Yu, Yu-Sheng Chen, Yi Wu, Pang-Shiu Chen, Byoungil Lee, Frederick T. Chen, and Ming-Jinn Tsai. 2012. Metal--oxide RRAM. Proceedings of the IEEE 100, 6 (2012), 1951--1970.Google Scholar
Cross Ref
- Cong Xu, Pai-Yu Chen, Dimin Niu, Yang Zheng, Shimeng Yu, and Yuan Xie. 2014. Architecting 3D vertical resistive memory for next-generation storage systems. In Proceedings of the 2014 IEEE/ACM International Conference on Computer-Aided Design. IEEE, 55--62. Google Scholar
Digital Library
- Cong Xu, Xiangyu Dong, Norman P. Jouppi, and Yuan Xie. 2011. Design implications of memristor-based RRAM cross-point structures. In Proceedings of the Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 1--6.Google Scholar
- Cong Xu, Dimin Niu, Naveen Muralimanohar, Rajeev Balasubramonian, Tao Zhang, Shimeng Yu, and Yuan Xie. 2015. Overcoming the challenges of crossbar resistive memory architectures. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 476--488.Google Scholar
Cross Ref
- Cong Xu, Dimin Niu, Naveen Muralimanohar, Norman P. Jouppi, and Yuan Xie. 2013. Understanding the trade-offs in multi-level cell ReRAM memory design. In Proceedings of the 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6. Google Scholar
Digital Library
- Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented programmable processing in memory. In Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing. ACM, 85--98. Google Scholar
Digital Library
- Hang Zhang, Nong Xiao, Fang Liu, and Zhiguang Chen. 2016. Leader: Accelerating ReRAM-based main memory by leveraging access latency discrepancy in crossbar arrays. In Proceedings of Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 756--761. Google Scholar
Digital Library
- Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the performance of FPGA-based graph processor using hybrid memory cube: A case for breadth first search. In Proceedings of FPGA. 207--216. Google Scholar
Digital Library
- Jianlong Zhong and Bingsheng He. 2014. Medusa: Simplified graph processing on GPUs. IEEE Transactions on Parallel and Distributed Systems 25, 6 (2014), 1543--1552. Google Scholar
Digital Library
- Qiuling Zhu, Berkin Akin, H. Ekin Sumbul, Fazle Sadi, James C. Hoe, Larry Pileggi, and Franz Franchetti. 2013. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In Proceedings of the 2013 IEEE International 3D Systems Integration Conference (3DIC). IEEE, 1--7.Google Scholar
Cross Ref
- Qiuling Zhu, Tobias Graf, H. Ekin Sumbul, Larry Pileggi, and Franz Franchetti. 2013. Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware. In Proceedings of the 2013 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--6.Google Scholar
Cross Ref
Index Terms
A Novel ReRAM-Based Processing-in-Memory Architecture for Graph Traversal
Recommendations
A Novel ReRAM-based Main Memory Structure for Optimizing Access Latency and Reliability
DAC '17: Proceedings of the 54th Annual Design Automation Conference 2017Emerging Resistive Memory (ReRAM) is a promising candidate as the replacement for DRAM because of its low power consumption, high density and high endurance. Due to the unique crossbar structure, ReRAM can be constructed with a very high density. ...
Memristor-Based (ReRAM) Data Memory Architecture in ASIP Design
DSD '13: Proceedings of the 2013 Euromicro Conference on Digital System DesignRecently, multiple non-volatile emerging memories (NVMs) have been proposed and show promising properties to replace SRAM-based memories in future SoCs. However, these new emerging memories, such as STT-MRAM and ReRAM, provide new challenges for the ...
A Novel Resistive Memory-based Process-in-memory Architecture for Efficient Logic and Add Operations
The coming era of big data revives the Processing-in-memory (PIM) architecture to relieve the memory wall problem that embarrasses the modern computing system. However, most existing PIM designs just put computing units closer to memory, rather than a ...






Comments