Abstract
Heterogeneous Multi-Processor Systems-on-Chips (MPSoCs) containing CPU and GPU cores are typically required to execute applications concurrently. However, as will be shown in this paper, existing approaches are not well suited for concurrent applications as they are developed either by considering only a single application or they do not exploit both CPU and GPU cores at the same time. In this paper, we propose an energy-efficient run-time mapping and thread partitioning approach for executing concurrent OpenCL applications on both GPU and GPU cores while satisfying performance requirements. Depending upon the performance requirements, for each concurrently executing application, the mapping process finds the appropriate number of CPU cores and operating frequencies of CPU and GPU cores, and the partitioning process identifies an efficient partitioning of the applications’ threads between CPU and GPU cores. We validate the proposed approach experimentally on the Odroid-XU3 hardware platform with various mixes of applications from the Polybench benchmark suite. Additionally, a case-study is performed with a real-world application SLAMBench. Results show an average energy saving of 32% compared to existing approaches while still satisfying the performance requirements.
- 2013. ARM Mali T628. http://www.arm.com/. (2013).Google Scholar
- 2014. ARM big.LITTLE Technology. http://www.arm.com/. (2014).Google Scholar
- 2015. Qualcomm Adreno 530 and 540. https://www.qualcomm.com/. (2015).Google Scholar
- 2016. ARM Mali 71. http://www.arm.com/. (2016).Google Scholar
- 2016. Exynos 5 Octa (5422). www.samsung.com/exynos/. (2016).Google Scholar
- 2016. Odroid-XU3. http://www.hardkernel.com/main/products/prdt_info.php?g_code=g140448267127. (2016).Google Scholar
- 2016. The open standard for parallel programming of heterogeneous systems. https://goo.gl/A9wXRJ. (2016).Google Scholar
- 2017. FreeOCL: Multi-platform implementation of OpenCL 1.2 targeting CPUs. (2017). https://github.com/zuzuf/freeoclGoogle Scholar
- Ali Aalsaud, Rishad Shafik, Ashur Rafiev, Fie Xia, Sheng Yang, and Alex Yakovlev. 2016. Power--Aware Performance Adaptation of Concurrent Applications in Heterogeneous Many-Core Systems. In Proceedings of the International Symposium on Low Power Electronics and Design. ACM, 368--373. Google Scholar
Digital Library
- Karunakar Reddy Basireddy, Amit Kumar Singh, Geoff V. Merrett, and Bashir M. Al-Hashimi. 2017. ITMD: run-time management of concurrent multi-threaded applications on heterogeneous multi-cores. In Conference on Design, Automation and Test in Europe (DATE), University Booth. 1.Google Scholar
- Kiran Chandramohan and Michael F. P. O’Boyle. 2014. Partitioning data-parallel programs for heterogeneous MPSoCs: time and energy design space exploration. In ACM SIGPLAN Notices, Vol. 49. ACM, 73--82. Google Scholar
Digital Library
- E. Del Sozzo, G. C. Durelli, E. M. G. Trainiti, A. Miele, M. D. Santambrogio, and C. Bolchini. 2016. Workload-aware power optimization strategy for asymmetric multiprocessors. In 2016 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 531--534. Google Scholar
Digital Library
- Bryan Donyanavard, Tiago Mück, Santanu Sarma, and Nikil Dutt. 2016. SPARTA: runtime task allocation for energy efficient heterogeneous many-cores. In Proceedings of the IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM, 27. Google Scholar
Digital Library
- L. Bagnères et al.Switchable scheduling for runtime adaptation of optimization. In Euro-Par’14. 222--233.Google Scholar
- Ivan Grasso, Petar Radojkovic, Nikola Rajovic, Isaac Gelado, and Alex Ramirez. 2014. Energy efficient hpc on embedded socs: Optimization techniques for mali gpu. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 123--132. Google Scholar
Digital Library
- Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012. IEEE, 1--10.Google Scholar
Cross Ref
- Peter Greenhalgh. 2011. Big. little processing with arm cortex-a15 8 cortex-a7. ARM White paper (2011), 1--8.Google Scholar
- Dominik Grewe and Michael F. P. O’Boyle. 2011. A static task partitioning approach for heterogeneous systems using OpenCL. In International Conference on Compiler Construction. Springer, 286--305. Google Scholar
Digital Library
- Dominik Grewe, Zheng Wang, and Michael F. P. O’Boyle. 2013. OpenCL task partitioning in the presence of GPU contention. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 87--101.Google Scholar
- Timo Hönig, Heiko Janker, Christopher Eibel, Oliver Mihelic, Rüdiger Kapitza, and Wolfgang Schröder-Preikschat. 2014. Proactive Energy-Aware Programming with PEEK. In TRIOS. Google Scholar
Digital Library
- Gangwon Jo, Won Jong Jeon, Wookeun Jung, Gordon Taft, and Jaejin Lee. 2014. OpenCL framework for ARM processors with NEON support. In Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing. ACM, 33--40. Google Scholar
Digital Library
- Ali Karami, Farshad Khunjush, and Seyyed Ali Mirsoleimani. 2015. A statistical performance analyzer framework for OpenCL kernels on Nvidia GPUs. The Journal of Supercomputing 71, 8 (2015), 2900--2921. Google Scholar
Digital Library
- David H. K. Kim, Connor Imes, and Henry Hoffmann. 2015. Racing and pacing to idle: Theoretical and empirical analysis of energy optimization heuristics. In Cyber-Physical Systems, Networks, and Applications (CPSNA), 2015 IEEE 3rd International Conference on. IEEE, 78--85. Google Scholar
Digital Library
- Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. 2009. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. 45--55. Google Scholar
Digital Library
- Jun Ma, Guihai Yan, Yinhe Han, and Xiaowei Li. 2016. An Analytical Framework for Estimating Scale-Out and Scale-Up Power Efficiency of Heterogeneous Manycores. IEEE Trans. Comput. 65, 2 (2016), 367--381. Google Scholar
Digital Library
- Luigi Nardi, Bruno Bodin, M. Zeeshan Zia, John Mawer, Andy Nisbet, Paul H. J. Kelly, Andrew J. Davison, Mikel Luján, Michael F. P. O’Boyle, Graham Riley, and others. 2015. Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM. In Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 5783--5790.Google Scholar
Cross Ref
- Prasanna Pandit and R. Govindarajan. 2014. Fluidic kernels: Cooperative execution of opencl programs on multiple heterogeneous devices. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, 273. Google Scholar
Digital Library
- Indrani Paul, Vignesh Ravi, Srilatha Manne, Manish Arora, and Sudhakar Yalamanchili. 2014. Coordinated energy management in heterogeneous processors. Scientific Programming 22, 2 (2014), 93--108.Google Scholar
Digital Library
- Behnaz Pourmohseni, Michael Glaß, and Jürgen Teich. 2017. Automatic operating point distillation for hybrid mapping methodologies. In 2017 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 1135--1140. Google Scholar
Digital Library
- Alok Prakash, Siqi Wang, Alexandru Eugen Irimiea, and Tulika Mitra. 2015. Energy-efficient execution of data-parallel applications on heterogeneous mobile platforms. In IEEE International Conference on Computer Design (ICCD). IEEE, 208--215. Google Scholar
Digital Library
- Amit Kumar Singh, Piotr Dziurzanski, Hashan Roshantha Mendis, and Leandro Soares Indrusiak. 2017. A Survey and Comparative Study of Hard and Soft Real-Time Dynamic Resource Allocation Strategies for Multi-/Many-Core Systems. ACM Comput. Surv. 50, 2, Article 24 (2017), 40 pages. Google Scholar
Digital Library
- Amit Kumar Singh, Charles Leech, Karunakar Reddy Basireddy, Bashir M. Al-Hashimi, and Geoff V. Merrett. 2017. Learning-based Run-time Power and Energy Management of Multi/Many-core Systems: Current and Future Trends. In Journal of Low Power Electronics (JOLPE). 26.Google Scholar
- Amit Kumar Singh, Muhammad Shafique, Akash Kumar, and Jörg Henkel. 2013. Mapping on Multi/Many-core Systems: Survey of Current and Emerging Trends. In Proceedings of the Design Automation Conference (DAC). Article 1, 10 pages. Google Scholar
Digital Library
- Kenzo Van Craeynest, Aamer Jaleel, Lieven Eeckhout, Paolo Narvaez, and Joel Emer. 2012. Scheduling heterogeneous multi-cores through performance impact estimation (PIE). In ACM SIGARCH Computer Architecture News, Vol. 40. IEEE Computer Society, 213--224. Google Scholar
Digital Library
- Hao Wang, Vijay Sathish, Ripudaman Singh, Michael J. Schulte, and Nam Sung Kim. 2012. Workload and power budget partitioning for single-chip heterogeneous processors. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques. ACM, 401--410. Google Scholar
Digital Library
- Hao Wang, Ripudaman Singh, Michael J. Schulte, and Nam Sung Kim. 2014. Memory scheduling towards high-throughput cooperative heterogeneous computing. In Proceedings of the 23rd international conference on Parallel architectures and compilation. ACM, 331--342. Google Scholar
Digital Library
- Yuan Wen, Zheng Wang, and Michael F. P. O’Boyle. 2014. Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms. In High Performance Computing (HiPC), 2014 21st International Conference on. IEEE, 1--10.Google Scholar
- Yi-Ping You, Hen-Jung Wu, Yeh-Ning Tsai, and Yen-Ting Chao. 2015. VirtCL: a framework for OpenCL device abstraction and management. In ACM SIGPLAN Notices, Vol. 50. ACM, 161--172. Google Scholar
Digital Library
Index Terms
Energy-Efficient Run-Time Mapping and Thread Partitioning of Concurrent OpenCL Applications on CPU-GPU MPSoCs
Recommendations
Reliable mapping and partitioning of performance-constrained openCL applications on CPU-GPU MPSoCs
ESTIMedia '17: Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time MultimediaHeterogeneous Multi-Processor Systems-on-Chips (MPSoCs) containing CPU and GPU cores are typically required to execute applications concurrently. Existing approaches exploit applications executing in CPU and GPU cores at the same time taking into ...
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud ComputingMany modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
Performance of CPU/GPU compiler directives on ISO/TTI kernels
GPUs are slowly becoming ubiquitous devices in High Performance Computing, as their capabilities to enhance the performance per watt of compute intensive algorithms as compared to multicore CPUs have been identified. The primary shortcoming of a GPU is ...






Comments