skip to main content
research-article

Evaluating Controlled Memory Request Injection for Efficient Bandwidth Utilization and Predictable Execution in Heterogeneous SoCs

Authors Info & Claims
Published:13 December 2022Publication History
Skip Abstract Section

Abstract

High-performance embedded platforms are increasingly adopting heterogeneous systems-on-chip (HeSoC) that couple multi-core CPUs with accelerators such as GPU, FPGA, or AI engines. Adopting HeSoCs in the context of real-time workloads is not immediately possible, though, as contention on shared resources like the memory hierarchy—and in particular the main memory (DRAM)—causes unpredictable latency increase. To tackle this problem, both the research community and certification authorities mandate (i) that accesses from parallel threads to the shared system resources (typically, main memory) happen in a mutually exclusive manner by design, or (ii) that per-thread bandwidth regulation is enforced. Such arbitration schemes provide timing guarantees, but make poor use of the memory bandwidth available in a modern HeSoC. Controlled Memory Request Injection (CMRI) is a recently-proposed bandwidth limitation concept that builds on top of a mutually-exclusive schedule but still allows the threads currently not entitled to access memory to use as much of the unused bandwidth as possible without losing the timing guarantee. CMRI has been discussed in the context of a multi-core CPU, but the same principle applies also to a more complex system such as an HeSoC. In this article, we introduce two CMRI schemes suitable for HeSoCs: Voluntary Throttling via code refactoring and Bandwidth Regulation via dynamic throttling. We extensively characterize a proof-of-concept incarnation of both schemes on two HeSoCs: an NVIDIA Tegra TX2 and a Xilinx UltraScale+, highlighting the benefits and the costs of CMRI for synthetic workloads that model worst-case DRAM access. We also test the effectiveness of CMRI with real benchmarks, studying the effect of interference among the host CPU and the accelerators.

REFERENCES

  1. [1] [n.d.]. Solving Multicore Interference for Safety-Critical Applications. Retrieved from https://www.ghs.com/download/whitepapers/GHS_multicore_interference.pdf.Google ScholarGoogle Scholar
  2. [2] Agirre I., Abella J., Azkarate-Askasua M., and Cazorla F. J.. 2017. On the tailoring of CAST-32A certification guidance to real COTS multicore architectures. In Proceedings of the 2017 12th IEEE International Symposium on Industrial Embedded Systems. 18. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Agosta Giovanni, Fornaciari William, Massari Giuseppe, Pupykina Anna, Reghenzani Federico, and Zanella Michele. 2018. Managing heterogeneous resources in HPC systems. In Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms. Association for Computing Machinery, New York, NY, 712. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Alhammad Ahmed and Pellizzoni Rodolfo. 2014. Schedulability analysis of global memory-predictable scheduling. In Proceedings of the 14th International Conference on Embedded Software. 110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Alhammad Ahmed and Pellizzoni Rodolfo. 2014. Time-predictable execution of multithreaded applications on multicore systems. In Proceedings of the 2014 Design, Automation & Test in Europe Conference & Exhibition. IEEE, 16.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Alhammad Ahmed, Wasly Saud, and Pellizzoni Rodolfo. 2015. Memory efficient global scheduling of real-time tasks. In Proceedings of the 21st IEEE Real-Time and Embedded Technology and Applications Symposium. IEEE, 285296.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Ali W. and Yun H.. 2017. Work-in-progress: Protecting real-time GPU applications on integrated CPU-GPU SoC platforms. In Proceedings of the 2017 IEEE Real-Time and Embedded Technology and Applications Symposium. 141144. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Bak Stanley, Yao Gang, Pellizzoni Rodolfo, and Caccamo Marco. 2012. Memory-aware scheduling of multicore task sets for real-time systems. In Proceedings of the 2012 IEEE International Conference on Embedded and Real-Time Computing Systems and Applications. IEEE, 300309.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Blin Antoine, Courtaud Cédric, Sopena Julien, Lawall Julia, and Muller Gilles. 2016. Maximizing parallelism without exploding deadlines in a mixed criticality embedded system. In Proceedings of the 2016 28th Euromicro Conference on Real-Time Systems. 109119. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Cavicchioli R., Capodieci N., and Bertogna M.. 2017. Memory interference characterization between CPU cores and integrated GPUs in mixed-criticality platforms. In Proceedings of the 2017 22nd IEEE International Conference on Emerging Technologies and Factory Automation. 110. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Cavicchioli Roberto, Capodieci Nicola, Solieri Marco, Bertogna Marko, Valente Paolo, and Marongiu Andrea. 2020. Evaluating controlled memory request injection to counter PREM memory underutilization. In Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 85105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Cerrolaza Jon Perez, Obermaisser Roman, Abella Jaume, Cazorla Francisco J., Grüttner Kim, Agirre Irune, Ahmadian Hamidreza, and Allende Imanol. 2020. Multi-core devices for safety-critical systems: A survey. ACM Computing Surveys 53, 4, Article 79 (Aug.2020), 38 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Conti Francesco, Palossi Daniele, Marongiu Andrea, Rossi Davide, and Benini Luca. 2016. Enabling the heterogeneous accelerator model on ultra-low power microcontroller platforms. In Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition. 12011206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Courtaud C., Sopena J., Muller G., and Pérez D. Gracia. 2019. Improving prediction accuracy of memory interferences for multicore platforms. In Proceedings of the 2019 IEEE Real-Time Systems Symposium. 246259.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Forsberg B., Benini L., and Marongiu A.. 2018. HePREM: Enabling predictable GPU execution on heterogeneous SoC. In Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition. 539544. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Forsberg Björn, Benini Luca, and Marongiu Andrea. 2021. HePREM: A predictable execution model for GPU-based heterogeneous SoCs. IEEE Transactions on Computers 70, 1 (2021), 1729. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Kiszka Jan and contributors other community. 2020. Jailhouse: Linux-based partitioning hypervisor. Siemens. Retrieved from https://github.com/siemens/jailhouse.Google ScholarGoogle Scholar
  18. [18] Majo Zoltan and Gross Thomas R. 2011. Memory management in NUMA multicore systems: Trapped between cache contention and interconnect overhead. In Proceedings of the Acm Sigplan Notices, Vol. 46. ACM, 1120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Mancuso Renato, Dudko Roman, and Caccamo Marco. 2014. Light-prem: Automated software refactoring for predictable execution on cots embedded systems. In Proceedings of the 2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications. IEEE, 110.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Matějka Joel, Forsberg Björn, Sojka Michal, Šůcha Přemysl, Benini Luca, Marongiu Andrea, and Hanzálek Zdeněk. 2019. Combining PREM compilation and static scheduling for high-performance and predictable MPSoC execution. Parallel computing 85 (2019), 2744. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] McVoy Larry W., Staelin Carl, et al. 1996. lmbench: Portable tools for performance analysis.. In Proceedings of the USENIX annual technical conference. San Diego, CA, 279294.Google ScholarGoogle Scholar
  22. [22] Nowotsch Jan, Paulitsch Michael, Bühler Daniel, Theiling Henrik, Wegener Simon, and Schmidt Michael. 2014. Multi-core interference-sensitive WCET analysis leveraging runtime resource capacity enforcement. In Proceedings of the 2014 26th Euromicro Conference on Real-Time Systems. 109118. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Olmedo Ignacio Sañudo, Capodieci Nicola, Martinez Jorge Luis, Marongiu Andrea, and Bertogna Marko. 2020. Dissecting the CUDA scheduling hierarchy: A Performance and predictability perspective. In Proceedings of the 2020 IEEE Real-Time and Embedded Technology and Applications Symposium. 213225. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Pellizzoni Rodolfo, Betti Emiliano, Bak Stanley, Yao Gang, Criswell John, Caccamo Marco, and Kegley Russell. 2011. A predictable execution model for COTS-based embedded systems. In 2011 17th IEEE Real-Time and Embedded Technology and Applications Symposium. IEEE.Google ScholarGoogle Scholar
  25. [25] Pellizzoni Rodolfo, Schranzhofer Andreas, Chen Jian-Jia, Caccamo Marco, and Thiele Lothar. 2010. Worst case delay analysis for memory interference in multicore systems. In Proceedings of the 2010 Design, Automation & Test in Europe Conference & Exhibition. IEEE, 741746.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Ramsauer Ralf, Kiszka Jan, Lohmann Daniel, and Mauerer Wolfgang. 2017. Look mum, no VM exits!(almost). arXiv:1705.06932. Retrieved from https://arxiv.org/abs/1705.06932.Google ScholarGoogle Scholar
  27. [27] Soliman Muhammad R. and Pellizzoni Rodolfo. 2019. PREM-based optimal task segmentation under fixed priority scheduling. In Proceedings of the 31st Euromicro Conference on Real-Time Systems (Leibniz International Proceedings in Informatics (LIPIcs)), Vol. 133. 4:1–4:23. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] University The Ohio State. 2011. PolyBench/C the Polyhedral Benchmark suite. Retrieved from https://web.cse.ohio-state.edu/pouchet.2/software/polybench/. Online; accessed 14 April 2022.Google ScholarGoogle Scholar
  29. [29] Tudor Bogdan Marius, Teo Yong Meng, and See Simon. 2011. Understanding off-chip memory contention of parallel programs in multicore systems. In Proceedings of the 2011 International Conference on Parallel Processing. IEEE, 602611.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Valsan Prathap Kumar, Yun Heechul, and Farshchi Farzad. 2016. Taming non-blocking caches to improve isolation in multicore real-time systems. In Proceedings of the 2016 IEEE Real-Time and Embedded Technology and Applications Symposium. 112. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Vogel Pirmin, Marongiu Andrea, and Benini Luca. 2015. An evaluation of memory sharing performance for heterogeneous embedded SoCs with many-core accelerators. In Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores. 6:1–6:9. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Wen Hao and Wei Zhang. 2017. Interference evaluation in CPU-GPU heterogeneous computing. In Proceedings of the IEEE High Performance Extreme Computing Conference.Google ScholarGoogle Scholar
  33. [33] Yamagiwa Shinichi and Wada Koichi. 2009. Performance study of interference on gpu and cpu resources with multiple applications. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing. IEEE, 18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Yao Gang, Pellizzoni Rodolfo, Bak Stanley, Betti Emiliano, and Caccamo Marco. 2012. Memory-centric scheduling for multicore hard real-time systems. Real-Time Systems 48, 6 (2012), 681715.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Yao G., Pellizzoni R., Bak S., Yun H., and Caccamo M.. 2016. Global real-time memory-centric scheduling for multicore systems. IEEE Transactions on Computers 65, 9 (Sept2016), 27392751. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Yao G., Pellizzoni R., Bak S., Yun H., and Caccamo M.. 2016. Global real-time memory-centric scheduling for multicore systems. IEEE Transactions on Computers 65, 9 (2016), 27392751. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Yun Heechul, Yao Gang, Pellizzoni Rodolfo, Caccamo Marco, and Sha Lui. 2013. Memguard: Memory bandwidth reservation system for efficient performance isolation in multi-core platforms. In Proceedings of the 2013 IEEE 19th Real-Time and Embedded Technology and Applications Symposium. IEEE.Google ScholarGoogle Scholar

Index Terms

  1. Evaluating Controlled Memory Request Injection for Efficient Bandwidth Utilization and Predictable Execution in Heterogeneous SoCs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Embedded Computing Systems
          ACM Transactions on Embedded Computing Systems  Volume 22, Issue 1
          January 2023
          512 pages
          ISSN:1539-9087
          EISSN:1558-3465
          DOI:10.1145/3567467
          • Editor:
          • Tulika Mitra
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 December 2022
          • Online AM: 19 September 2022
          • Accepted: 3 July 2022
          • Revised: 22 April 2022
          • Received: 20 October 2021
          Published in tecs Volume 22, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed
        • Article Metrics

          • Downloads (Last 12 months)146
          • Downloads (Last 6 weeks)11

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!