Abstract
Emerging three-dimensional (3D) memory technologies, such as the Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), provide high-bandwidth and massive memory-level parallelism. With the growing heterogeneity and complexity of computer systems (CPU cores and accelerators, etc.), efficiently integrating emerging memories into existing systems poses new challenges and requires detailed evaluation in a realistic computing environment. In this article, we propose MEG, an open source, configurable, cycle-exact, and RISC-V-based full-system emulation infrastructure using FPGA and HBM. MEG provides a highly modular hardware design and includes a bootable Linux image for a realistic software flow, so that users can perform cross-layer software-hardware co-optimization in a full-system environment. To improve the observability and debuggability of the system, MEG also provides a flexible performance monitoring scheme to guide the performance optimization. The proposed MEG infrastructure can potentially benefit broad communities across computer architecture, system software, and application software. Leveraging MEG, we present two cross-layer system optimizations as illustrative cases to demonstrate the usability of MEG. In the first case study, we present a reconfigurable memory controller to improve the address mapping of standard memory controller. This reconfigurable memory controller along with its OS support allows us to optimize the address mapping scheme to fully exploit the massive parallelism provided by the emerging three-dimensional (3D) memories. In the second case study, we present a lightweight IOMMU design to tackle the unique challenges brought by 3D memory in providing virtual memory support for near-memory accelerators. We provide a prototype implementation of MEG on a Xilinx VU37P FPGA and demonstrate its capability, fidelity, and flexibility on real-world benchmark applications. We hope MEG fills a gap in the space of publicly available FPGA-based full-system emulation infrastructures, specifically targeting memory systems, and inspires further collaborative software/hardware innovations.
- Jiyoung Kim, Augustin J. Hong, Sung Min Kim, Kyeong-Sik Shin, Emil B. Song, Yongha Hwang, Faxian Xiu, Kosmas Galatsis, Chi On Chui, Rob N. Candler, et al. 2011. A stacked memory device on logic 3D technology for ultra-high-density data storage. Nanotechnology 22, 25 (2011), 254006.Google Scholar
Cross Ref
- J. Thomas Pawlowski. 2011. Hybrid memory cube (HMC). In Proceedings of the 2011 IEEE Hot Chips 23 Symposium (HCS’11). IEEE, 1--24.Google Scholar
Cross Ref
- JEDEC Standard. 2013. High bandwidth memory (HBM) DRAM. JESD235 (2013).Google Scholar
- Mitesh R. Meswani, Sergey Blagodurov, David Roberts, John Slice, Mike Ignatowski, and Gabriel H. Loh. 2015. Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 126--136.Google Scholar
- Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2016. A scalable processing-in-memory accelerator for parallel graph processing. ACM SIGARCH Computer Architecture News 43, 3 (2016), 105--117. Google Scholar
Digital Library
- Sam Likun Xi, Oreoluwa Babarinsa, Manos Athanassoulis, and Stratos Idreos. 2015. Beyond the wall: Near-data processing for databases. In Proceedings of the 11th International Workshop on Data Management on New Hardware. ACM, 2. Google Scholar
Digital Library
- Milo M. K. Martin, Daniel J. Sorin, Bradford M. Beckmann, Michael R. Marty, Min Xu, Alaa R. Alameldeen, Kevin E. Moore, Mark D. Hill, and David A. Wood. 2005. Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News 33, 4 (November 2005), 92--99. DOI:http://dx.doi.org/10.1145/1105734.1105747 Google Scholar
Digital Library
- Nathan L. Binkert, Ronald G. Dreslinski, Lisa R. Hsu, Kevin T. Lim, Ali G. Saidi, and Steven K. Reinhardt. 2006. The M5 simulator: Modeling networked systems. IEEE Micro 26, 4 (2006), 52--60. Google Scholar
Digital Library
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The GEM5 simulator. ACM SIGARCH Comput. Arch. News 39, 2 (2011), 1--7. Google Scholar
Digital Library
- Frederick Ryckbosch, Stijn Polfliet, and Lieven Eeckhout. 2010. Fast, accurate, and validated full-system software simulation of x86 hardware. IEEE Micro 30, 6 (2010), 46--56. Google Scholar
Digital Library
- David Wang, Brinda Ganesh, Nuengwong Tuaycharoen, Kathleen Baynes, Aamer Jaleel, and Bruce Jacob. 2005. DRAMsim: A memory system simulator. ACM SIGARCH Comput. Arch. News 33, 4 (2005), 100--107. Google Scholar
Digital Library
- Dong-Ik Jeon and Ki-Seok Chung. 2016. CasHMC: A cycle-accurate simulator for hybrid memory cube. IEEE Comput. Arch. Lett. 16, 1 (2016), 10--13.Google Scholar
Cross Ref
- John D. Leidel and Yong Chen. 2016. HMC-Sim-2.0: A simulation platform for exploring custom memory cube operations. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops. IEEE, 621--630.Google Scholar
- Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, et al. 2018. FireSim: FPGA-accelerated cycle-exact scale-out system simulation in the public cloud. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 29--42. Google Scholar
Digital Library
- Hari Angepat, Derek Chiou, Eric S. Chung, and James C. Hoe. 2014. FPGA-accelerated simulation of computer systems. Synth. Lect. Comput. Arch. 9, 2 (2014), 1--80. Google Scholar
Digital Library
- Abhishek Kumar Jain, Scott Lloyd, and Maya Gokhale. 2018. Microscope on memory: MPSoC-enabled computer memory system assessments. In Proceedings of the 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). IEEE, 173--180.Google Scholar
Cross Ref
- Eric S. Chung, Michael K. Papamichael, Eriko Nurvitadhi, James C. Hoe, Ken Mai, and Babak Falsafi. 2009. ProtoFlex: Towards scalable, full-system multiprocessor simulations using FPGAs. ACM Trans. Reconfig. Technol. Syst. 2, 2 (2009), 1--32. Google Scholar
Digital Library
- Hasan Hassan, Nandita Vijaykumar, Samira Khan, Saugata Ghose, Kevin Chang, Gennady Pekhimenko, Donghyuk Lee, Oguz Ergin, and Onur Mutlu. 2017. SoftMC: A flexible and practical open-source infrastructure for enabling experimental DRAM studies. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA’17). IEEE, 241--252.Google Scholar
Cross Ref
- Xilinx. Zynq UltraScale+ Device Technical Reference Manual.Google Scholar
- Christopher Celio, Pi-Feng Chiu, Borivoje Nikolic, David A. Patterson, and Krste Asanović. 2017. BOOMv2: An open-source out-of-order RISC-V core. In Proceedings of the First Workshop on Computer Architecture Research with RISC-V (CARRV'17).Google Scholar
- Makoto Motoyoshi. 2009. Through-silicon via (TSV). Proc. IEEE 97, 1 (2009), 43--48.Google Scholar
- ARM. Arm Cortex-M1 DesignStartࡊ FPGA-Xilinx Edition.Google Scholar
- Xilinx. MicroBlaze Micro Controller System v3.0.Google Scholar
- Intel. Nios II Processor Reference Guide.Google Scholar
- Andrew Waterman, Yunsup Lee, David A Patterson, and Krste Asanović. 2016. The RISC-V Instruction Set Manual, volume I: User-level ISA, version 2.1. (2016).Google Scholar
- Andrew Waterman, Yunsup Lee, Rimas Avizienis, David A Patterson, and Krste Asanovic. 2016. The RISC-V instruction set manual volume II: Privileged architecture version 1.9. Tech. Rep. UCB/EECS-2016-129, EECS Department, University of California, Berkeley.Google Scholar
- VectorBlox. Vectorblox ORCA. Retrieved from https://github.com/riscveval/orca-1.Google Scholar
- Aaron Severance and Guy GF Lemieux. 2013. Embedded supercomputing in FPGAs with the vectorblox MXP matrix processor. In Proceedings of the 9th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. IEEE Press, 6. Google Scholar
Digital Library
- SpinalHDL. VexRiscv CPU. Retrieved from https://github.com/SpinalHDL/VexRiscv.Google Scholar
- Eric Matthews, Zavier Aguila, and Lesley Shannon. 2018. Evaluating the performance efficiency of a soft-processor, variable-length, parallel-execution-unit architecture for fpgas using the RISC-V ISA. In Proceedings of the 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). IEEE, 1--8.Google Scholar
Cross Ref
- Alon Amid, David Biancolin, Abraham Gonzalez, Daniel Grubb, Sagar Karandikar, Harrison Liew, Albert Magyar, Howard Mao, Albert Ou, Nathan Pemberton, Paul Rigge, Colin Schmidt, John Wright, Jerry Zhao, Yakun Sophia Shao, Krste Asanović, and Borivoje Nikolić. 2020. Chipyard: Integrated design, simulation, and implementation framework for custom SoCs. IEEE Micro (2020). DOI:http://dx.doi.org/10.1109/MM.2020.2996616Google Scholar
- Scott Lloyd and Maya Gokhale. 2017. Near memory key/value lookup acceleration. In Proceedings of the International Symposium on Memory Systems. ACM, 26--33. Google Scholar
Digital Library
- Zehra Sura, Arpith Jacob, Tong Chen, Bryan Rosenburg, Olivier Sallenave, Carlo Bertolli, Samuel Antao, Jose Brunheroto, Yoonho Park, Kevin O’Brien, et al. 2015. Data access optimization in a processing-in-memory system. In Proceedings of the 12th ACM International Conference on Computing Frontiers. ACM, 6. Google Scholar
Digital Library
- Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry. 2017. Ambit: In-memory accelerator for bulk bitwise operations using commodity DRAM technology. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 273--287. Google Scholar
Digital Library
- Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical near-data processing for in-memory analytics frameworks. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT’15). IEEE, 113--124. Google Scholar
Digital Library
- Pirmin Vogel, Andrea Marongiu, and Luca Benini. 2019. Exploring shared virtual memory for FPGA accelerators with a configurable IOMMU. IEEE Trans. Comput. 68, 4 (April 2019), 510--525. DOI:http://dx.doi.org/10.1109/TC.2018.2879080 Google Scholar
Digital Library
- Swapnil Haria, Mark D. Hill, and Michael M. Swift. 2018. Devirtualizing memory in heterogeneous systems. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems. 637--650. Google Scholar
Digital Library
- Yuchen Hao, Zhenman Fang, Glenn Reinman, and Jason Cong. 2017. Supporting address translation for accelerator-centric architectures. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA’17). IEEE, 37--48.Google Scholar
Cross Ref
- Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. ACM SIGARCH Comput. Arch. News 41, 3 (2013), 475--486. Google Scholar
Digital Library
- Xilinx. 2019. AXI high bandwidth memory controller v1.0. (2019).Google Scholar
- Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Michael M. Swift. 2013. Efficient virtual memory for big memory servers. ACM SIGARCH Comput. Arch. News 41, 3 (2013), 237--248. Google Scholar
Digital Library
- Yuan Zhou, Udit Gupta, Steve Dai, Ritchie Zhao, Nitish Srivastava, Hanchen Jin, Joseph Featherston, Yi-Hsiang Lai, Gai Liu, Gustavo Angarita Velasquez, Wenping Wang, and Zhiru Zhang. 2018. Rosetta: A realistic high-level synthesis benchmark suite for software-programmable FPGAs. In International Symposium on Field-Programmable Gate Arrays (FPGA’18). Google Scholar
Digital Library
- Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the performance of FPGA-based graph processor using hybrid memory cube: A case for breadth first search. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 207--216. Google Scholar
Digital Library
- Piotr Luszczek, Jack Dongarra, and Jeremy Kepner. 2006. Design and implementation of the HPC challenge benchmark suite. CT Watch Quart. 2, 4A (2006), 18--23.Google Scholar
- Zsolt István, David Sidler, and Gustavo Alonso. 2016. Runtime parameterizable regular expression operators for databases. In Proceedings of the 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). IEEE, 204--211.Google Scholar
Cross Ref
- Xilinx. Xilinx Database Acceleration. Retrieved from https://github.com/Xilinx/data-analytics/tree/master/omxgather.Google Scholar
- David Sidler, Zsolt István, Muhsen Owaida, Kaan Kara, and Gustavo Alonso. 2017. doppioDB: A hardware accelerated database. In Proceedings of the 2017 ACM International Conference on Management of Data. 1659--1662. Google Scholar
Digital Library
- Alexandr Andoni and Piotr Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science 2006 (FOCS’06). IEEE, 459--468. Google Scholar
Digital Library
- Andrew V. Goldberg and Robert E. Tarjan. 1988. A new approach to the maximum-flow problem. J. ACM 35, 4 (1988), 921--940. Google Scholar
Digital Library
- Pawan Harish and P. J . Narayanan. 2007. Accelerating large graph algorithms on the GPU using CUDA. In Proceedings of the International Conference on High-performance Computing. Springer, 197--208. Google Scholar
Digital Library
- Inderjeet Mani and Eric Bloedorn. 1997. Multi-document summarization by graph search and matching. In Proceedings of the Fourteenth National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence. 622--628. Google Scholar
Digital Library
- Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the performance of FPGA-based graph processor using hybrid memory cube: A case for breadth first search. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 207--216. Google Scholar
Digital Library
- TPCH handwritten implementation. Retrieved from https://github.com/sppalkia/tpch-benches.Google Scholar
- Shoumik Palkar. 2019. sppalkia/tpch-benches. Retrieved March 2019 from https://github.com/sppalkia/tpch-benches.Google Scholar
- Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. 2016. Design and evaluation of a processing-in-memory architecture for the smart memory cube. In Proceedings of the 29th International Conference on Architecture of Computing Systems (ARCS’16). Springer-Verlag, New York, NY, 19--31. DOI:http://dx.doi.org/10.1007/978-3-319-30695-7_2 Google Scholar
Digital Library
Index Terms
MEG: A RISCV-based System Emulation Infrastructure for Near-data Processing Using FPGAs and High-bandwidth Memory
Recommendations
HBM Connect: High-Performance HLS Interconnect for FPGA HBM
FPGA '21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysWith the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers can now exploit unprecedented external memory bandwidth. This allows more memory-bounded applications to benefit from FPGA acceleration. However, fully utilizing the ...
Contutto: a novel FPGA-based prototyping platform enabling innovation in the memory subsystem of a server class processor
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on MicroarchitectureWe demonstrate the use of an FPGA as a memory buffer in a POWER8® system, creating a novel prototyping platform that enables innovation in the memory subsystem of POWER-based servers. Our platform, called ConTutto, is pin-compatible with POWER8 buffered ...
Evaluation of Knight Landing High Bandwidth Memory for HPC Workloads
IA3'17: Proceedings of the Seventh Workshop on Irregular Applications: Architectures and AlgorithmsThe Intel Knight Landing (KNL) manycore chip includes 3D-stacked memory named MCDRAM, also known as High Bandwidth Memory (HBM) for parallel applications that needs to scale to high thread count. In this paper, we provide a quantitative study of the KNL ...






Comments