skip to main content
research-article

Trireme: Exploration of Hierarchical Multi-level Parallelism for Hardware Acceleration

Authors Info & Claims
Published:20 April 2023Publication History
Skip Abstract Section

Abstract

The design of heterogeneous systems that include domain specific accelerators is a challenging and time-consuming process. While taking into account area constraints, designers must decide which parts of an application to accelerate in hardware and which to leave in software. Moreover, applications in domains such as Extended Reality (XR) offer opportunities for various forms of parallel execution, including loop level, task level, and pipeline parallelism. To assist the design process and expose every possible level of parallelism, we present Trireme, a fully automated tool-chain that explores multiple levels of parallelism and produces domain-specific accelerator designs and configurations that maximize performance, given an area budget. FPGA SoCs were used as target platforms, and Catapult HLS [7] was used to synthesize RTL using a commercial 12 nm FinFET technology. Experiments on demanding benchmarks from the XR domain revealed a speedup of up to 20×, as well as a speedup of up to 37× for smaller applications, compared to software-only implementations.

REFERENCES

  1. [1] Binkert Nathan, Beckmann Bradford, Black Gabriel, Reinhardt Steven K., Saidi Ali, Basu Arkaprava, Hestness Joel, Hower Derek R., Krishna Tushar, Sardashti Somayeh, et al. 2011. The gem5 simulator. ACM SIGARCH Comput. Archit. News 39, 2 (Feb.2011), 17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Bron Coen and Kerbosch Joep. 1973. Algorithm 457: Finding all cliques of an undirected graph. In Communications ACM, Vol. 9. 575577.Google ScholarGoogle Scholar
  3. [3] Brumar Iulian, Zacharopoulos Georgios, Yao Yuan, Rama Saketh, Wei Gu-Yeon, and Brooks David. 2022. Early DSE and automatic generation of coarse grained merged accelerators. ACM Trans. Embed. Comput. Syst. (June2022). DOI:DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Cadence. 2016. Stratus High-Level Synthesis. Retrieved from https://www.cadence.com/en_US/home/tools/digital-design-and-signoff/synthesis/stratus-high-level-synthesis.html.Google ScholarGoogle Scholar
  5. [5] Campanoni Simone, Brownell Kevin, Kanev Svilen, Jones Timothy M., Wei Gu-Yeon, and Brooks David. 2014. HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA). IEEE, 217228.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Canis Andrew, Choi Jongsok, Fort Blair, Lian Ruolong, Huang Qijing, Calagar Nazanin, Gort Marcel, Qin Jia Jun, Aldham Mark, Czajkowski Tomasz. et al. 2013. From software to accelerators with LegUp high-level synthesis. In Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems. IEEE.Google ScholarGoogle Scholar
  7. [7] Catapult. 2017. Catapult High-level Synthesis.. Retrieved from https://eda.sw.siemens.com/en-US/ic/ic-design/high-level-synthesis-and-verification-platform/.Google ScholarGoogle Scholar
  8. [8] David Durst, Matthew Feldman, Dillon Huff, David Akeley, Ross Daly, Gilbert Louis Bernstein, Marco Patrignani, Kayvon Fatahalian, and Pat Hanrahan. 2020. Type-directed scheduling of streaming accelerators. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation. Association for Computing Machinery, New York, NY, 408–422. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Esmaeilzadeh Hadi, Blem Emily, Amant Renee St., Sankaralingam Karthikeyan, and Burger Doug. 2011. Dark silicon and the end of multicore scaling. In ACM SIGARCH Computer Architecture News, Vol. 39. 365376.Google ScholarGoogle Scholar
  10. [10] Ferretti Lorenzo, Cini Andrea, Zacharopoulos Georgios, Alippi Cesare, and Pozzi Laura. 2021. A graph deep learning framework for high-level synthesis design space exploration. arXiv preprint arXiv:2111.14767 (2021).Google ScholarGoogle Scholar
  11. [11] Lorenzo Ferretti, Andrea Cini, Georgios Zacharopoulos, Cesare Alippi, and Laura Pozzi. 2022. Graph neural networks for high-level synthesis design space exploration. ACM Trans. Des. Automat. Electron. Syst. 28, 2 (2022), 20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Huzaifa Muhammad, Desai Rishi, Grayson Samuel, Jiang Xutao, Jing Ying, Lee Jae, Lu Fang, Pang Yihan, Ravichandran Joseph, Sinclair Finn, Tian Boyuan, Yuan Hengzhi, Zhang Jeffrey, and Adve Sarita V.. 2021. ILLIXR: Enabling end-to-end extended reality research. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). 2438. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Koeplinger David, Feldman Matthew, Prabhakar Raghu, Zhang Yaqi, Hadjis Stefan, Fiszel Ruben, Zhao Tian, Nardi Luigi, Pedram Ardavan, Kozyrakis Christos, et al. 2018. Spatial: A language and compiler for application accelerators. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 296311.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Kotsifakou Maria, Srivastava Prakalp, Sinclair Matthew D., Komuravelli Rakesh, Adve Vikram, and Adve Sarita. 2018. HPVM: Heterogeneous parallel virtual machine. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 6880.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Kumar Snehasish, Srinivasan Vijayalakshmi, Sharifian Amirali, Sumner Nick, and Shriraman Arrvindh. 2016. Peruse and profit: Estimating the accelerability of loops. In Proceedings of the International Conference on Supercomputing. 113.Google ScholarGoogle Scholar
  16. [16] Lai Yi-Hsiang, Chi Yuze, Hu Yuwei, Wang Jie, Yu Cody Hao, Zhou Yuan, Cong Jason, and Zhang Zhiru. 2019. HeteroCL: A multi-paradigm programming infrastructure for software-defined reconfigurable computing. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 242251.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Lattner Chris and Adve Vikram. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the 2nd International Symposium on Code Generation and Optimization. 7588.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] LLVM Project. Circuit IR Compilers and Tools (CIRCT). https://github.com/llvm/circt.Google ScholarGoogle Scholar
  19. [19] Margerm Steven, Sharifian Amirali, Guha Apala, Shriraman Arrvindh, and Pokam Gilles. 2018. TAPAS: Generating parallel accelerators from parallel programs. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 245257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Meeus Wim, Beeck Kristof Van, Goedemé Toon, Meel Jan, and Stroobandt Dirk. 2012. An overview of today’s high-level synthesis tools. Des. Automat. Embed. Syst. 16, 3 (Sept.2012), 3151.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Nardi Luigi, Souza Artur, Koeplinger David, and Olukotun Kunle. 2019. HyperMapper: A practical design space exploration framework. In Proceedings of the IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 425426.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Nguyen Tan, Gurumani Swathi, Rupnow Kyle, and Chen Deming. 2016. FCUDA-SoC: Platform integration for field-programmable SoC with the CUDA-to-FPGA compiler. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 514.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Papakonstantinou Alexandros, Gururaj Karthik, Stratton John A., Chen Deming, Cong Jason, and Hwu Wen-Mei W.. 2009. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In Proceedings of the IEEE 7th Symposium on Application Specific Processors. IEEE, 3542.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Pilato Christian and Ferrandi Fabrizio. 2012. Bambu: A free framework for the high level synthesis of complex applications. In Proceedings of the 23rd International Conference on Field Programmable Logic and Applications.Google ScholarGoogle Scholar
  25. [25] Reagen Brandon, Adolf Robert, Shao Yakun Sophia, Wei Gu-Yeon, and Brooks David. 2014. MachSuite: Benchmarks for accelerator design and customized architectures. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). IEEE, 110119.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Rogers Samuel, Slycord Joshua, Baharani Mohammadreza, and Tabkhi Hamed. 2020. gem5-SALAM: A system architecture for LLVM-based accelerator modeling. In Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 471482.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Schardl Tao B., Moses William S., and Leiserson Charles E.. 2017. Tapir: Embedding fork-join parallelism into LLVM’s intermediate representation. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 249265.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Shao Yakun Sophia, Reagen Brandon, Wei Gu-Yeon, and Brooks David. 2014. Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures. In Proceedings of the 41st Annual International Symposium on Computer Architecture. IEEE, 97108.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Shao Yakun Sophia, Xi Sam Likun, Srinivasan Vijayalakshmi, Wei Gu-Yeon, and Brooks David. 2016. Co-designing accelerators and SoC interfaces using gem5-aladdin. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Simonite Tom. 2016. Moore’s law is dead. Now what? MIT Technol. Rev. May 13 (2016), 4041.Google ScholarGoogle Scholar
  31. [31] Stratton John A., Rodrigues Christopher, Sung I-Jui, Obeid Nady, Chang Li-Wen, Anssari Nasser, Liu Geng Daniel, and Hwu Wen-mei W.. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Cent. Reliab. High-perform. Comput. 127 (2012).Google ScholarGoogle Scholar
  32. [32] Xilinx. 2017. Vivado High-level Synthesis. Retrieved from www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.Google ScholarGoogle Scholar
  33. [33] Xilinx. 2017. Xilinx All Programmable SoC portfolio. Retrieved from www.xilinx.com/products/silicon-devices/soc.html.Google ScholarGoogle Scholar
  34. [34] Yao Yuan and Rama Saketh. yaoyuannnn. CAVA: Camera Vision Pipeline on gem5-Aladdin. https://github.com/yaoyuannnn/cava.Google ScholarGoogle Scholar
  35. [35] Zacharopoulos Georgios, Barbon Andrea, Ansaloni Giovanni, and Pozzi Laura. 2018. Machine learning approach for loop unrolling factor prediction in high level synthesis. In Proceedings of the IEEE International Conference on High Performance Computing & Simulation (HPCS). 9197.Google ScholarGoogle Scholar
  36. [36] Zacharopoulos Georgios, Ferretti Lorenzo, Ansaloni Giovanni, Guglielmo Giuseppe Di, Carloni Luca, and Pozzi Laura. 2019. Compiler-assisted selection of hardware acceleration candidates from application source code. In Proceedings of the International Conference on Computer Design. 19.Google ScholarGoogle Scholar
  37. [37] Zacharopoulos Georgios, Ferretti Lorenzo, Giaquinta Emanuele, Ansaloni Giovanni, and Pozzi Laura. 2019. RegionSeeker: Automatically identifying and selecting accelerators from application source code. IEEE Trans. Comput.-aid Des. Integ. Circ. Syst. 38, 4 (Apr.2019), 741754.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Zacharopoulos Georgios and Pozzi Laura. 2017. ClrFreqCFGPrinter: A Tool for Frequency Annotated Control Flow Graph Generation. Technical Report. European LLVM Developers Meeting.Google ScholarGoogle Scholar
  39. [39] Zhou Ruoyu and Jones Timothy M.. 2019. Janus: Statically-driven and profile-guided automatic dynamic binary parallelisation. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 1525.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Trireme: Exploration of Hierarchical Multi-level Parallelism for Hardware Acceleration

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Embedded Computing Systems
            ACM Transactions on Embedded Computing Systems  Volume 22, Issue 3
            May 2023
            546 pages
            ISSN:1539-9087
            EISSN:1558-3465
            DOI:10.1145/3592782
            • Editor:
            • Tulika Mitra
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 20 April 2023
            • Online AM: 17 January 2023
            • Accepted: 9 January 2023
            • Revised: 11 November 2022
            • Received: 6 October 2021
            Published in tecs Volume 22, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
          • Article Metrics

            • Downloads (Last 12 months)342
            • Downloads (Last 6 weeks)43

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!