Abstract
High-performance embedded platforms are increasingly adopting heterogeneous systems-on-chip (HeSoC) that couple multi-core CPUs with accelerators such as GPU, FPGA, or AI engines. Adopting HeSoCs in the context of real-time workloads is not immediately possible, though, as contention on shared resources like the memory hierarchy—and in particular the main memory (DRAM)—causes unpredictable latency increase. To tackle this problem, both the research community and certification authorities mandate (i) that accesses from parallel threads to the shared system resources (typically, main memory) happen in a mutually exclusive manner by design, or (ii) that per-thread bandwidth regulation is enforced. Such arbitration schemes provide timing guarantees, but make poor use of the memory bandwidth available in a modern HeSoC. Controlled Memory Request Injection (CMRI) is a recently-proposed bandwidth limitation concept that builds on top of a mutually-exclusive schedule but still allows the threads currently not entitled to access memory to use as much of the unused bandwidth as possible without losing the timing guarantee. CMRI has been discussed in the context of a multi-core CPU, but the same principle applies also to a more complex system such as an HeSoC. In this article, we introduce two CMRI schemes suitable for HeSoCs: Voluntary Throttling via code refactoring and Bandwidth Regulation via dynamic throttling. We extensively characterize a proof-of-concept incarnation of both schemes on two HeSoCs: an NVIDIA Tegra TX2 and a Xilinx UltraScale+, highlighting the benefits and the costs of CMRI for synthetic workloads that model worst-case DRAM access. We also test the effectiveness of CMRI with real benchmarks, studying the effect of interference among the host CPU and the accelerators.
- [1] [n.d.]. Solving Multicore Interference for Safety-Critical Applications. Retrieved from https://www.ghs.com/download/whitepapers/GHS_multicore_interference.pdf.Google Scholar
- [2] . 2017. On the tailoring of CAST-32A certification guidance to real COTS multicore architectures. In Proceedings of the 2017 12th IEEE International Symposium on Industrial Embedded Systems. 1–8.
DOI: Google ScholarCross Ref
- [3] . 2018. Managing heterogeneous resources in HPC systems. In Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms. Association for Computing Machinery, New York, NY, 7–12.
DOI: Google ScholarDigital Library
- [4] . 2014. Schedulability analysis of global memory-predictable scheduling. In Proceedings of the 14th International Conference on Embedded Software. 1–10.Google Scholar
Digital Library
- [5] . 2014. Time-predictable execution of multithreaded applications on multicore systems. In Proceedings of the 2014 Design, Automation & Test in Europe Conference & Exhibition. IEEE, 1–6.Google Scholar
Cross Ref
- [6] . 2015. Memory efficient global scheduling of real-time tasks. In Proceedings of the 21st IEEE Real-Time and Embedded Technology and Applications Symposium. IEEE, 285–296.Google Scholar
Cross Ref
- [7] . 2017. Work-in-progress: Protecting real-time GPU applications on integrated CPU-GPU SoC platforms. In Proceedings of the 2017 IEEE Real-Time and Embedded Technology and Applications Symposium. 141–144.
DOI: Google ScholarCross Ref
- [8] . 2012. Memory-aware scheduling of multicore task sets for real-time systems. In Proceedings of the 2012 IEEE International Conference on Embedded and Real-Time Computing Systems and Applications. IEEE, 300–309.Google Scholar
Digital Library
- [9] . 2016. Maximizing parallelism without exploding deadlines in a mixed criticality embedded system. In Proceedings of the 2016 28th Euromicro Conference on Real-Time Systems. 109–119.
DOI: Google ScholarCross Ref
- [10] . 2017. Memory interference characterization between CPU cores and integrated GPUs in mixed-criticality platforms. In Proceedings of the 2017 22nd IEEE International Conference on Emerging Technologies and Factory Automation. 1–10.
DOI: Google ScholarDigital Library
- [11] . 2020. Evaluating controlled memory request injection to counter PREM memory underutilization. In Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 85–105.Google Scholar
Digital Library
- [12] . 2020. Multi-core devices for safety-critical systems: A survey. ACM Computing Surveys 53, 4, Article
79 (Aug. 2020), 38 pages.DOI: Google ScholarDigital Library
- [13] . 2016. Enabling the heterogeneous accelerator model on ultra-low power microcontroller platforms. In Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition. 1201–1206.Google Scholar
Digital Library
- [14] . 2019. Improving prediction accuracy of memory interferences for multicore platforms. In Proceedings of the 2019 IEEE Real-Time Systems Symposium. 246–259.Google Scholar
Cross Ref
- [15] . 2018. HePREM: Enabling predictable GPU execution on heterogeneous SoC. In Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition. 539–544.
DOI: Google ScholarCross Ref
- [16] . 2021. HePREM: A predictable execution model for GPU-based heterogeneous SoCs. IEEE Transactions on Computers 70, 1 (2021), 17–29.
DOI: Google ScholarCross Ref
- [17] . 2020. Jailhouse: Linux-based partitioning hypervisor. Siemens. Retrieved from https://github.com/siemens/jailhouse.Google Scholar
- [18] . 2011. Memory management in NUMA multicore systems: Trapped between cache contention and interconnect overhead. In Proceedings of the Acm Sigplan Notices, Vol. 46. ACM, 11–20.Google Scholar
Digital Library
- [19] . 2014. Light-prem: Automated software refactoring for predictable execution on cots embedded systems. In Proceedings of the 2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications. IEEE, 1–10.Google Scholar
Cross Ref
- [20] . 2019. Combining PREM compilation and static scheduling for high-performance and predictable MPSoC execution. Parallel computing 85 (2019), 27–44.
DOI: Google ScholarDigital Library
- [21] . 1996. lmbench: Portable tools for performance analysis.. In Proceedings of the USENIX annual technical conference. San Diego, CA, 279–294.Google Scholar
- [22] . 2014. Multi-core interference-sensitive WCET analysis leveraging runtime resource capacity enforcement. In Proceedings of the 2014 26th Euromicro Conference on Real-Time Systems. 109–118.
DOI: Google ScholarDigital Library
- [23] . 2020. Dissecting the CUDA scheduling hierarchy: A Performance and predictability perspective. In Proceedings of the 2020 IEEE Real-Time and Embedded Technology and Applications Symposium. 213–225.
DOI: Google ScholarCross Ref
- [24] . 2011. A predictable execution model for COTS-based embedded systems. In 2011 17th IEEE Real-Time and Embedded Technology and Applications Symposium. IEEE.Google Scholar
- [25] . 2010. Worst case delay analysis for memory interference in multicore systems. In Proceedings of the 2010 Design, Automation & Test in Europe Conference & Exhibition. IEEE, 741–746.Google Scholar
Cross Ref
- [26] . 2017. Look mum, no VM exits!(almost). arXiv:1705.06932. Retrieved from https://arxiv.org/abs/1705.06932.Google Scholar
- [27] . 2019. PREM-based optimal task segmentation under fixed priority scheduling. In Proceedings of the 31st Euromicro Conference on Real-Time Systems (Leibniz International Proceedings in Informatics (LIPIcs)), Vol. 133. 4:1–4:23.
DOI: Google ScholarCross Ref
- [28] . 2011. PolyBench/C the Polyhedral Benchmark suite. Retrieved from https://web.cse.ohio-state.edu/pouchet.2/software/polybench/.
Online; accessed 14 April 2022. Google Scholar - [29] . 2011. Understanding off-chip memory contention of parallel programs in multicore systems. In Proceedings of the 2011 International Conference on Parallel Processing. IEEE, 602–611.Google Scholar
Digital Library
- [30] . 2016. Taming non-blocking caches to improve isolation in multicore real-time systems. In Proceedings of the 2016 IEEE Real-Time and Embedded Technology and Applications Symposium. 1–12.
DOI: Google ScholarCross Ref
- [31] . 2015. An evaluation of memory sharing performance for heterogeneous embedded SoCs with many-core accelerators. In Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores. 6:1–6:9.
DOI: Google ScholarDigital Library
- [32] . 2017. Interference evaluation in CPU-GPU heterogeneous computing. In Proceedings of the IEEE High Performance Extreme Computing Conference.Google Scholar
- [33] . 2009. Performance study of interference on gpu and cpu resources with multiple applications. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing. IEEE, 1–8.Google Scholar
Digital Library
- [34] . 2012. Memory-centric scheduling for multicore hard real-time systems. Real-Time Systems 48, 6 (2012), 681–715.Google Scholar
Digital Library
- [35] . 2016. Global real-time memory-centric scheduling for multicore systems. IEEE Transactions on Computers 65, 9 (
Sept 2016), 2739–2751.DOI: Google ScholarDigital Library
- [36] . 2016. Global real-time memory-centric scheduling for multicore systems. IEEE Transactions on Computers 65, 9 (2016), 2739–2751.
DOI: Google ScholarDigital Library
- [37] . 2013. Memguard: Memory bandwidth reservation system for efficient performance isolation in multi-core platforms. In Proceedings of the 2013 IEEE 19th Real-Time and Embedded Technology and Applications Symposium. IEEE.Google Scholar
Index Terms
Evaluating Controlled Memory Request Injection for Efficient Bandwidth Utilization and Predictable Execution in Heterogeneous SoCs
Recommendations
Evaluating Controlled Memory Request Injection to Counter PREM Memory Underutilization
Job Scheduling Strategies for Parallel ProcessingAbstractModern heterogeneous systems-on-chip (HeSoC) feature high-performance multi-core CPUs tightly integrated with data-parallel accelerators. Such HeSoCS heavily rely on shared resources, which hinder their adoption in the context of Real-Time ...
Resource-efficient utilization of CPU/GPU-based heterogeneous supercomputers for Bayesian phylogenetic inference
Bayesian inference is one of the most important methods for estimating phylogenetic trees in bioinformatics. Due to the potentially huge computational requirements, several parallel algorithms of Bayesian inference have been implemented to run on CPU-...
Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver
The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block ...






Comments