skip to main content
research-article

MEG: A RISCV-based System Emulation Infrastructure for Near-data Processing Using FPGAs and High-bandwidth Memory

Authors Info & Claims
Published:30 September 2020Publication History
Skip Abstract Section

Abstract

Emerging three-dimensional (3D) memory technologies, such as the Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), provide high-bandwidth and massive memory-level parallelism. With the growing heterogeneity and complexity of computer systems (CPU cores and accelerators, etc.), efficiently integrating emerging memories into existing systems poses new challenges and requires detailed evaluation in a realistic computing environment. In this article, we propose MEG, an open source, configurable, cycle-exact, and RISC-V-based full-system emulation infrastructure using FPGA and HBM. MEG provides a highly modular hardware design and includes a bootable Linux image for a realistic software flow, so that users can perform cross-layer software-hardware co-optimization in a full-system environment. To improve the observability and debuggability of the system, MEG also provides a flexible performance monitoring scheme to guide the performance optimization. The proposed MEG infrastructure can potentially benefit broad communities across computer architecture, system software, and application software. Leveraging MEG, we present two cross-layer system optimizations as illustrative cases to demonstrate the usability of MEG. In the first case study, we present a reconfigurable memory controller to improve the address mapping of standard memory controller. This reconfigurable memory controller along with its OS support allows us to optimize the address mapping scheme to fully exploit the massive parallelism provided by the emerging three-dimensional (3D) memories. In the second case study, we present a lightweight IOMMU design to tackle the unique challenges brought by 3D memory in providing virtual memory support for near-memory accelerators. We provide a prototype implementation of MEG on a Xilinx VU37P FPGA and demonstrate its capability, fidelity, and flexibility on real-world benchmark applications. We hope MEG fills a gap in the space of publicly available FPGA-based full-system emulation infrastructures, specifically targeting memory systems, and inspires further collaborative software/hardware innovations.

References

  1. Jiyoung Kim, Augustin J. Hong, Sung Min Kim, Kyeong-Sik Shin, Emil B. Song, Yongha Hwang, Faxian Xiu, Kosmas Galatsis, Chi On Chui, Rob N. Candler, et al. 2011. A stacked memory device on logic 3D technology for ultra-high-density data storage. Nanotechnology 22, 25 (2011), 254006.Google ScholarGoogle ScholarCross RefCross Ref
  2. J. Thomas Pawlowski. 2011. Hybrid memory cube (HMC). In Proceedings of the 2011 IEEE Hot Chips 23 Symposium (HCS’11). IEEE, 1--24.Google ScholarGoogle ScholarCross RefCross Ref
  3. JEDEC Standard. 2013. High bandwidth memory (HBM) DRAM. JESD235 (2013).Google ScholarGoogle Scholar
  4. Mitesh R. Meswani, Sergey Blagodurov, David Roberts, John Slice, Mike Ignatowski, and Gabriel H. Loh. 2015. Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 126--136.Google ScholarGoogle Scholar
  5. Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2016. A scalable processing-in-memory accelerator for parallel graph processing. ACM SIGARCH Computer Architecture News 43, 3 (2016), 105--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Sam Likun Xi, Oreoluwa Babarinsa, Manos Athanassoulis, and Stratos Idreos. 2015. Beyond the wall: Near-data processing for databases. In Proceedings of the 11th International Workshop on Data Management on New Hardware. ACM, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Milo M. K. Martin, Daniel J. Sorin, Bradford M. Beckmann, Michael R. Marty, Min Xu, Alaa R. Alameldeen, Kevin E. Moore, Mark D. Hill, and David A. Wood. 2005. Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News 33, 4 (November 2005), 92--99. DOI:http://dx.doi.org/10.1145/1105734.1105747 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Nathan L. Binkert, Ronald G. Dreslinski, Lisa R. Hsu, Kevin T. Lim, Ali G. Saidi, and Steven K. Reinhardt. 2006. The M5 simulator: Modeling networked systems. IEEE Micro 26, 4 (2006), 52--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The GEM5 simulator. ACM SIGARCH Comput. Arch. News 39, 2 (2011), 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Frederick Ryckbosch, Stijn Polfliet, and Lieven Eeckhout. 2010. Fast, accurate, and validated full-system software simulation of x86 hardware. IEEE Micro 30, 6 (2010), 46--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. David Wang, Brinda Ganesh, Nuengwong Tuaycharoen, Kathleen Baynes, Aamer Jaleel, and Bruce Jacob. 2005. DRAMsim: A memory system simulator. ACM SIGARCH Comput. Arch. News 33, 4 (2005), 100--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Dong-Ik Jeon and Ki-Seok Chung. 2016. CasHMC: A cycle-accurate simulator for hybrid memory cube. IEEE Comput. Arch. Lett. 16, 1 (2016), 10--13.Google ScholarGoogle ScholarCross RefCross Ref
  13. John D. Leidel and Yong Chen. 2016. HMC-Sim-2.0: A simulation platform for exploring custom memory cube operations. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops. IEEE, 621--630.Google ScholarGoogle Scholar
  14. Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, et al. 2018. FireSim: FPGA-accelerated cycle-exact scale-out system simulation in the public cloud. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 29--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hari Angepat, Derek Chiou, Eric S. Chung, and James C. Hoe. 2014. FPGA-accelerated simulation of computer systems. Synth. Lect. Comput. Arch. 9, 2 (2014), 1--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Abhishek Kumar Jain, Scott Lloyd, and Maya Gokhale. 2018. Microscope on memory: MPSoC-enabled computer memory system assessments. In Proceedings of the 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). IEEE, 173--180.Google ScholarGoogle ScholarCross RefCross Ref
  17. Eric S. Chung, Michael K. Papamichael, Eriko Nurvitadhi, James C. Hoe, Ken Mai, and Babak Falsafi. 2009. ProtoFlex: Towards scalable, full-system multiprocessor simulations using FPGAs. ACM Trans. Reconfig. Technol. Syst. 2, 2 (2009), 1--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hasan Hassan, Nandita Vijaykumar, Samira Khan, Saugata Ghose, Kevin Chang, Gennady Pekhimenko, Donghyuk Lee, Oguz Ergin, and Onur Mutlu. 2017. SoftMC: A flexible and practical open-source infrastructure for enabling experimental DRAM studies. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA’17). IEEE, 241--252.Google ScholarGoogle ScholarCross RefCross Ref
  19. Xilinx. Zynq UltraScale+ Device Technical Reference Manual.Google ScholarGoogle Scholar
  20. Christopher Celio, Pi-Feng Chiu, Borivoje Nikolic, David A. Patterson, and Krste Asanović. 2017. BOOMv2: An open-source out-of-order RISC-V core. In Proceedings of the First Workshop on Computer Architecture Research with RISC-V (CARRV'17).Google ScholarGoogle Scholar
  21. Makoto Motoyoshi. 2009. Through-silicon via (TSV). Proc. IEEE 97, 1 (2009), 43--48.Google ScholarGoogle Scholar
  22. ARM. Arm Cortex-M1 DesignStartࡊ FPGA-Xilinx Edition.Google ScholarGoogle Scholar
  23. Xilinx. MicroBlaze Micro Controller System v3.0.Google ScholarGoogle Scholar
  24. Intel. Nios II Processor Reference Guide.Google ScholarGoogle Scholar
  25. Andrew Waterman, Yunsup Lee, David A Patterson, and Krste Asanović. 2016. The RISC-V Instruction Set Manual, volume I: User-level ISA, version 2.1. (2016).Google ScholarGoogle Scholar
  26. Andrew Waterman, Yunsup Lee, Rimas Avizienis, David A Patterson, and Krste Asanovic. 2016. The RISC-V instruction set manual volume II: Privileged architecture version 1.9. Tech. Rep. UCB/EECS-2016-129, EECS Department, University of California, Berkeley.Google ScholarGoogle Scholar
  27. VectorBlox. Vectorblox ORCA. Retrieved from https://github.com/riscveval/orca-1.Google ScholarGoogle Scholar
  28. Aaron Severance and Guy GF Lemieux. 2013. Embedded supercomputing in FPGAs with the vectorblox MXP matrix processor. In Proceedings of the 9th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. IEEE Press, 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. SpinalHDL. VexRiscv CPU. Retrieved from https://github.com/SpinalHDL/VexRiscv.Google ScholarGoogle Scholar
  30. Eric Matthews, Zavier Aguila, and Lesley Shannon. 2018. Evaluating the performance efficiency of a soft-processor, variable-length, parallel-execution-unit architecture for fpgas using the RISC-V ISA. In Proceedings of the 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  31. Alon Amid, David Biancolin, Abraham Gonzalez, Daniel Grubb, Sagar Karandikar, Harrison Liew, Albert Magyar, Howard Mao, Albert Ou, Nathan Pemberton, Paul Rigge, Colin Schmidt, John Wright, Jerry Zhao, Yakun Sophia Shao, Krste Asanović, and Borivoje Nikolić. 2020. Chipyard: Integrated design, simulation, and implementation framework for custom SoCs. IEEE Micro (2020). DOI:http://dx.doi.org/10.1109/MM.2020.2996616Google ScholarGoogle Scholar
  32. Scott Lloyd and Maya Gokhale. 2017. Near memory key/value lookup acceleration. In Proceedings of the International Symposium on Memory Systems. ACM, 26--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Zehra Sura, Arpith Jacob, Tong Chen, Bryan Rosenburg, Olivier Sallenave, Carlo Bertolli, Samuel Antao, Jose Brunheroto, Yoonho Park, Kevin O’Brien, et al. 2015. Data access optimization in a processing-in-memory system. In Proceedings of the 12th ACM International Conference on Computing Frontiers. ACM, 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry. 2017. Ambit: In-memory accelerator for bulk bitwise operations using commodity DRAM technology. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 273--287. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical near-data processing for in-memory analytics frameworks. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT’15). IEEE, 113--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Pirmin Vogel, Andrea Marongiu, and Luca Benini. 2019. Exploring shared virtual memory for FPGA accelerators with a configurable IOMMU. IEEE Trans. Comput. 68, 4 (April 2019), 510--525. DOI:http://dx.doi.org/10.1109/TC.2018.2879080 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Swapnil Haria, Mark D. Hill, and Michael M. Swift. 2018. Devirtualizing memory in heterogeneous systems. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems. 637--650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yuchen Hao, Zhenman Fang, Glenn Reinman, and Jason Cong. 2017. Supporting address translation for accelerator-centric architectures. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA’17). IEEE, 37--48.Google ScholarGoogle ScholarCross RefCross Ref
  39. Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. ACM SIGARCH Comput. Arch. News 41, 3 (2013), 475--486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Xilinx. 2019. AXI high bandwidth memory controller v1.0. (2019).Google ScholarGoogle Scholar
  41. Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Michael M. Swift. 2013. Efficient virtual memory for big memory servers. ACM SIGARCH Comput. Arch. News 41, 3 (2013), 237--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Yuan Zhou, Udit Gupta, Steve Dai, Ritchie Zhao, Nitish Srivastava, Hanchen Jin, Joseph Featherston, Yi-Hsiang Lai, Gai Liu, Gustavo Angarita Velasquez, Wenping Wang, and Zhiru Zhang. 2018. Rosetta: A realistic high-level synthesis benchmark suite for software-programmable FPGAs. In International Symposium on Field-Programmable Gate Arrays (FPGA’18). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the performance of FPGA-based graph processor using hybrid memory cube: A case for breadth first search. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 207--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Piotr Luszczek, Jack Dongarra, and Jeremy Kepner. 2006. Design and implementation of the HPC challenge benchmark suite. CT Watch Quart. 2, 4A (2006), 18--23.Google ScholarGoogle Scholar
  45. Zsolt István, David Sidler, and Gustavo Alonso. 2016. Runtime parameterizable regular expression operators for databases. In Proceedings of the 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). IEEE, 204--211.Google ScholarGoogle ScholarCross RefCross Ref
  46. Xilinx. Xilinx Database Acceleration. Retrieved from https://github.com/Xilinx/data-analytics/tree/master/omxgather.Google ScholarGoogle Scholar
  47. David Sidler, Zsolt István, Muhsen Owaida, Kaan Kara, and Gustavo Alonso. 2017. doppioDB: A hardware accelerated database. In Proceedings of the 2017 ACM International Conference on Management of Data. 1659--1662. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Alexandr Andoni and Piotr Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science 2006 (FOCS’06). IEEE, 459--468. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Andrew V. Goldberg and Robert E. Tarjan. 1988. A new approach to the maximum-flow problem. J. ACM 35, 4 (1988), 921--940. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Pawan Harish and P. J . Narayanan. 2007. Accelerating large graph algorithms on the GPU using CUDA. In Proceedings of the International Conference on High-performance Computing. Springer, 197--208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Inderjeet Mani and Eric Bloedorn. 1997. Multi-document summarization by graph search and matching. In Proceedings of the Fourteenth National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence. 622--628. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the performance of FPGA-based graph processor using hybrid memory cube: A case for breadth first search. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 207--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. TPCH handwritten implementation. Retrieved from https://github.com/sppalkia/tpch-benches.Google ScholarGoogle Scholar
  54. Shoumik Palkar. 2019. sppalkia/tpch-benches. Retrieved March 2019 from https://github.com/sppalkia/tpch-benches.Google ScholarGoogle Scholar
  55. Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. 2016. Design and evaluation of a processing-in-memory architecture for the smart memory cube. In Proceedings of the 29th International Conference on Architecture of Computing Systems (ARCS’16). Springer-Verlag, New York, NY, 19--31. DOI:http://dx.doi.org/10.1007/978-3-319-30695-7_2 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. MEG: A RISCV-based System Emulation Infrastructure for Near-data Processing Using FPGAs and High-bandwidth Memory

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!