skip to main content
research-article

A Novel ReRAM-Based Processing-in-Memory Architecture for Graph Traversal

Authors Info & Claims
Published:26 February 2018Publication History
Skip Abstract Section

Abstract

Graph algorithms such as graph traversal have been gaining ever-increasing importance in the era of big data. However, graph processing on traditional architectures issues many random and irregular memory accesses, leading to a huge number of data movements and the consumption of very large amounts of energy. To minimize the waste of memory bandwidth, we investigate utilizing processing-in-memory (PIM), combined with non-volatile metal-oxide resistive random access memory (ReRAM), to improve both computation and I/O performance.

We propose a new ReRAM-based processing-in-memory architecture called RPBFS, in which graph data can be persistently stored and processed in place. We study the problem of graph traversal, and we design an efficient graph traversal algorithm in RPBFS. Benefiting from low data movement overhead and high bank-level parallel computation, RPBFS shows a significant performance improvement compared with both the CPU-based and the GPU-based BFS implementations. On a suite of real-world graphs, our architecture yields a speedup in graph traversal performance of up to 33.8×, and achieves a reduction in energy over conventional systems of up to 142.8×.

References

  1. Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). IEEE, 105--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). IEEE, 336--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Berkin Akin, Franz Franchetti, and James C. Hoe. 2015. Data reorganization in memory using 3D-stacked DRAM. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 131--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Fabien Alibart, Ligang Gao, Brian D. Hoskins, and Dmitri B. Strukov. 2012. High precision tuning of state for memristive devices by adaptable variation-tolerant algorithm. Nanotechnology 23, 7 (2012), 075201.Google ScholarGoogle ScholarCross RefCross Ref
  5. Rajeev Balasubramonian, Jichuan Chang, Troy Manning, Jaime H. Moreno, Richard Murphy, Ravi Nair, and Steven Swanson. 2014. Near-data processing: Insights from a MICRO-46 workshop. IEEE Micro 34, 4 (2014), 36--42.Google ScholarGoogle Scholar
  6. Scott Beamer, Krste Asanović, and David Patterson. 2013. Direction-optimizing breadth-first search. Scientific Programming 21, 3--4 (2013), 137--148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Scott Beamer, Krste Asanovic, and David Patterson. 2015. Locality exists in graph processing: Workload characterization on an ivy bridge server. In Proceedings of the 2015 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 56--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Rong Chen, Jiaxin Shi, Yanzhe Chen, and Haibo Chen. 2015. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. In Proceedings of the 10th European Conference on Computer Systems. ACM, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE, 27--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bram Cohen. 2003. Incentives build robustness in bittorrent. In Proceedings of the Workshop on Economics of Peer-to-Peer Systems, Vol. 6. 68--72.Google ScholarGoogle Scholar
  11. Thomas H. Cormen. 2009. Introduction to Algorithms. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An industry standard API for shared-memory programming. IEEE Computational Science and Engineering 5, 1 (1998), 46--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Xiangyu Dong, Cong Xu, Norm Jouppi, and Yuan Xie. 2014. NVSim: A circuit-level performance, energy, and area model for emerging non-volatile memory. In Emerging Memory Technologies. Springer, 15--50.Google ScholarGoogle Scholar
  14. Lei Han, Zhaoyan Shen, Zili Shao, H. Howie Huang, and Tao Li. 2017. A novel ReRAM-based processing-in-memory architecture for graph computing. In Proceedings of 2017 IEEE 6th Non-Volatile Memory Systems and Applications Symposium (NVMSA). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  15. Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 78--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20, 1 (1998), 359--392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jure Leskovec and Andrej Krevl. 2015. {SNAP Datasets}:{Stanford} Large Network Dataset Collection.Google ScholarGoogle Scholar
  18. Jing Li, Chao-I Wu, Scott C. Lewis, Jackie Morrish, Tien-Yen Wang, Richard Jordan, Tom Maffitt, Matthew Breitwisch, Alejandro Schrott, Roger Cheek, and others. 2011. A novel reconfigurable sensing scheme for variable level storage in phase change memory. In Proceedings of the 2011 3rd IEEE International Memory Workshop (IMW). IEEE, 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  19. Duo Liu, Tianzheng Wang, Yi Wang, Zili Shao, Qingfeng Zhuge, and Edwin H.-M. Sha. 2014. Application-specific wear leveling for extending lifetime of phase change memory in embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 33, 10 (2014), 1450--1462.Google ScholarGoogle ScholarCross RefCross Ref
  20. Hang Liu and H Howie Huang. 2015. Enterprise: Breadth-first graph traversal on GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hang Liu, H. Howie Huang, and Yang Hu. 2016. iBFS: Concurrent breadth-first search on GPUs. In Proceedings of the 2016 International Conference on Management of Data. ACM, 403--416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Xiaoxiao Liu, Mengjie Mao, Beiye Liu, Hai Li, Yiran Chen, Boxun Li, Yu Wang, Hao Jiang, Mark Barnell, Qing Wu, and others. 2015. RENO: A high-efficient reconfigurable neuromorphic computing accelerator design. In Proceedings of the 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM, 273--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Martin Dimitrov and Carl Strickland. 2016. Intel power gadget. Intel Corporation 7 (2016). https://software.intel.com/en-us/articles/intel-power-gadget-20.Google ScholarGoogle Scholar
  25. Duane Merrill, Michael Garland, and Andrew Grimshaw. 2012. Scalable GPU graph traversal. In ACM SIGPLAN Notices, Vol. 47. ACM, 117--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Nooshin Mirzadeh, Yusuf Onur Koçberber, Babak Falsafi, and Boris Grot. 2015. Sort vs. Hash join revisited for near-memory execution. In Proceedings of the 5th Workshop on Architectures and Systems for Big Data (ASBD’15).Google ScholarGoogle Scholar
  27. Dimin Niu, Cong Xu, Naveen Muralimanohar, Norman P. Jouppi, and Yuan Xie. 2013. Design of cross-point metal-oxide ReRAM emphasizing reliability and cost. In Proceedings of the 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 17--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Eriko Nurvitadhi, Gabriel Weisz, Yu Wang, Skand Hurkat, Marie Nguyen, James C. Hoe, José F. Martínez, and Carlos Guestrin. 2014. Graphgen: An FPGA framework for vertex-centric graph computation. In Proceedings of the 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 25--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. NVIDIA. 2017. CUDA Toolkit Documentation. Technical Report. http://docs.nvidia.com/cuda/profiler-users-guide/index.htmlnvprof-overview.Google ScholarGoogle Scholar
  30. Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth, Steven Burns, and Ozcan Ozturk. 2016. Energy efficient architecture for graph analytics accelerators. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 166--177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.Google ScholarGoogle Scholar
  32. Seth H. Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+ logic devices on MapReduce workloads. In Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 190--200.Google ScholarGoogle ScholarCross RefCross Ref
  33. Zhiwei Qin, Yi Wang, Duo Liu, Zili Shao, and Yong Guan. 2011. MNFTL: An efficient flash translation layer for MLC NAND flash memory storage systems. In Proceedings of the 48th Design Automation Conference. ACM, 17--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, 472--488. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Semih Salihoglu and Jennifer Widom. 2013. GPS: A graph processing system. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management. ACM, 22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and others. 2013. RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization. In Proceedings of the 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 185--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE, 14--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Hojun Shim, Yongsoo Joo, Yongseok Choi, Hyung Gyu Lee, and Naehyuck Chang. 2003. Low-energy off-chip SDRAM memory systems for embedded applications. ACM Transactions on Embedded Computing Systems (TECS) 2, 1 (2003), 98--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. 2017. PipeLayer: A pipelined ReRAM-based accelerator for deep learning. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 541--552.Google ScholarGoogle ScholarCross RefCross Ref
  40. Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2017. GraphR: Accelerating graph processing using ReRAM. Arxiv:1708.06248.Google ScholarGoogle Scholar
  41. Yuliang Sun, Yu Wang, and Huazhong Yang. 2017. Energy-efficient SQL query exploiting RRAM-based process-in-memory structure. In Proceedings of the 2017 IEEE 6th Non-Volatile Memory Systems and Applications Symposium (NVMSA). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  42. Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2016. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Yi Wang, Zhiwei Qin, Renhai Chen, Zili Shao, Qixin Wang, Shuai Li, and Laurence T. Yang. 2016. A real-time flash translation layer for NAND flash memory storage systems. IEEE Transactions on Multi-Scale Computing Systems 2, 1 (2016), 17--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. H-S. Philip Wong, Heng-Yuan Lee, Shimeng Yu, Yu-Sheng Chen, Yi Wu, Pang-Shiu Chen, Byoungil Lee, Frederick T. Chen, and Ming-Jinn Tsai. 2012. Metal--oxide RRAM. Proceedings of the IEEE 100, 6 (2012), 1951--1970.Google ScholarGoogle ScholarCross RefCross Ref
  45. Cong Xu, Pai-Yu Chen, Dimin Niu, Yang Zheng, Shimeng Yu, and Yuan Xie. 2014. Architecting 3D vertical resistive memory for next-generation storage systems. In Proceedings of the 2014 IEEE/ACM International Conference on Computer-Aided Design. IEEE, 55--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Cong Xu, Xiangyu Dong, Norman P. Jouppi, and Yuan Xie. 2011. Design implications of memristor-based RRAM cross-point structures. In Proceedings of the Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 1--6.Google ScholarGoogle Scholar
  47. Cong Xu, Dimin Niu, Naveen Muralimanohar, Rajeev Balasubramonian, Tao Zhang, Shimeng Yu, and Yuan Xie. 2015. Overcoming the challenges of crossbar resistive memory architectures. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 476--488.Google ScholarGoogle ScholarCross RefCross Ref
  48. Cong Xu, Dimin Niu, Naveen Muralimanohar, Norman P. Jouppi, and Yuan Xie. 2013. Understanding the trade-offs in multi-level cell ReRAM memory design. In Proceedings of the 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented programmable processing in memory. In Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing. ACM, 85--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Hang Zhang, Nong Xiao, Fang Liu, and Zhiguang Chen. 2016. Leader: Accelerating ReRAM-based main memory by leveraging access latency discrepancy in crossbar arrays. In Proceedings of Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 756--761. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the performance of FPGA-based graph processor using hybrid memory cube: A case for breadth first search. In Proceedings of FPGA. 207--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Jianlong Zhong and Bingsheng He. 2014. Medusa: Simplified graph processing on GPUs. IEEE Transactions on Parallel and Distributed Systems 25, 6 (2014), 1543--1552. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Qiuling Zhu, Berkin Akin, H. Ekin Sumbul, Fazle Sadi, James C. Hoe, Larry Pileggi, and Franz Franchetti. 2013. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In Proceedings of the 2013 IEEE International 3D Systems Integration Conference (3DIC). IEEE, 1--7.Google ScholarGoogle ScholarCross RefCross Ref
  54. Qiuling Zhu, Tobias Graf, H. Ekin Sumbul, Larry Pileggi, and Franz Franchetti. 2013. Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware. In Proceedings of the 2013 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Novel ReRAM-Based Processing-in-Memory Architecture for Graph Traversal

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Storage
          ACM Transactions on Storage  Volume 14, Issue 1
          Special Issue on NVM and Storage
          February 2018
          237 pages
          ISSN:1553-3077
          EISSN:1553-3093
          DOI:10.1145/3190860
          • Editor:
          • Sam H. Noh
          Issue’s Table of Contents

          Copyright © 2018 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 26 February 2018
          • Accepted: 1 January 2018
          • Received: 1 November 2017
          Published in tos Volume 14, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!