Abstract
Heterogeneous multi-core architectures consisting of CPUs and GPUs are commonplace in today’s embedded systems. These architectures offer potential for energy efficient computing if the application task is mapped to the right core. Realizing such potential is challenging due to the complex and evolving nature of hardware and applications. This paper presents an automatic approach to map OpenCL kernels onto heterogeneous multi-cores for a given optimization criterion – whether it is faster runtime, lower energy consumption or a trade-off between them. This is achieved by developing a machine learning based approach to predict which processor to use to run the OpenCL kernel and the host program, and at what frequency the processor should operate. Instead of hand-tuning a model for each optimization metric, we use machine learning to develop a unified framework that first automatically learns the optimization heuristic for each metric off-line, then uses the learned knowledge to schedule OpenCL kernels at runtime based on code and runtime information of the program. We apply our approach to a set of representative OpenCL benchmarks and evaluate it on an ARM big.LITTLE mobile platform. Our approach achieves over 93% of the performance delivered by a perfect predictor.We obtain, on average, 1.2x, 1.6x, and 1.8x improvement respectively for runtime, energy consumption and the energy delay product when compared to a comparative heterogeneous-aware OpenCL task mapping scheme.
- H. Almatary et al. Reducing the implementation overheads of ipcp and dfp. In RTSS ’15. Google Scholar
Digital Library
- J. Ceng et al. Maps: an integrated framework for mpsoc application parallelization. In DAC ’08. Google Scholar
Digital Library
- P. Chakraborty et al. Opportunity for compute partitioning in pursuit of energy-efficient systems. In LCTES 2016. Google Scholar
Digital Library
- K. Chandramohan and M. F. O’Boyle. Partitioning data-parallel programs for heterogeneous mpsocs: time and energy design space exploration. In LCTES 2014. Google Scholar
Digital Library
- S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC ’09. Google Scholar
Digital Library
- G. Chen et al. Effisha: A software framework for enabling effficient preemptive scheduling of gpu. In PPoPP ’17. Google Scholar
Digital Library
- Y. Cho et al. Energy-reduction offloading technique for streaming media servers. Mobile Information Systems, 2016.Google Scholar
- C. Cummins et al. Synthesizing benchmarks for predictive modeling. In CGO 2017. Google Scholar
Digital Library
- K. Dev and S. Reda. Scheduling challenges and opportunities in integrated cpu+gpu processors. In ESTIMedia’16. Google Scholar
Digital Library
- M. K. Emani et al. Smart, adaptive mapping of parallelism in the presence of external workload. In CGO, 2013.Google Scholar
Digital Library
- S. Eyerman and L. Eeckhout. Probabilistic job symbiosis modeling for smt processor scheduling. In ASPLOS XV, 2010. Google Scholar
Digital Library
- E. Garzón et al. An approach to optimise the energy efficiency of iterative computation on integrated gpu–cpu systems. The Journal of Supercomputing, 2016.Google Scholar
- D. Grewe et al. Portable mapping of data parallel programs to opencl for heterogeneous systems. In CGO ’13.Google Scholar
- D. Grewe et al. A workload-aware mapping approach for data-parallel programs. In HiPCA, 2011. Google Scholar
Digital Library
- D. Grewe et al. Opencl task partitioning in the presence of gpu contention. In LCPC, 2013.Google Scholar
- N. Guan et al. Schedulability analysis of preemptive and nonpreemptive edf on partial runtime-reconfigurable fpgas. ACM TODAES, 2008. Google Scholar
Digital Library
- C. Imes and H. Hoffmann. Bard: A unified framework for managing soft timing and power constraints. In SAMOS, 2016.Google Scholar
- Jääskeläinen et al. Pocl: A performance-portable opencl implementation. Int. J. Parallel Program., 2015. Google Scholar
Digital Library
- W. Jia et al. Gpu performance and power tuning using regression trees. ACM Trans. Archit. Code Optim., 2015. Google Scholar
Digital Library
- R. Kaleem et al. Adaptive heterogeneous scheduling for integrated gpus. In PACT ’14. Google Scholar
Digital Library
- S. S. Latifi Oskouei et al. Cnndroid: Gpu-accelerated execution of trained deep convolutional neural networks on android. In MM ’16, 2016. Google Scholar
Digital Library
- J. Lee, M. Samadi, and S. Mahlke. Orchestrating multiple data-parallel kernels on multiple devices. In PACT ’15,. Google Scholar
Digital Library
- M. S. Lee et al. Accelerating bootstrapping in fhew using gpus. In ASAP ’15,.Google Scholar
- J. Leng et al. Gpuwattch: enabling energy optimizations in gpgpus. In ISCA 2013. Google Scholar
Digital Library
- K. Ma et al. Greengpu: A holistic approach to energy efficiency in gpu-cpu heterogeneous architectures. In ICPP 2014. Google Scholar
Digital Library
- A. Magni et al. Automatic optimization of thread-coarsening for graphics processors. In PACT ’14. Google Scholar
Digital Library
- D. Majeti et al. Automatic data layout generation and kernel mapping for cpu+gpu architectures. In CC 2016. Google Scholar
Digital Library
- P.-J. Micolet et al. A machine learning approach to mapping streaming workloads to dynamic multicore processors. In LCTES 2016. Google Scholar
Digital Library
- W. Ogilvie et al. Minimizing the cost of iterative compilation with active learning. In CGO, 2017. Google Scholar
Digital Library
- P. Pandit and R. Govindarajan. Fluidic kernels: Cooperative execution of opencl programs on multiple heterogeneous devices. In CGO ’14. Google Scholar
Digital Library
- H. Park et al. Zero and data reuse-aware fast convolution for deep neural networks on gpu. In CODES+ISSS 2016. Google Scholar
Digital Library
- J. Ren et al. Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach.Google Scholar
- J. Ren et al. Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In INFOCOM, 2017.Google Scholar
Cross Ref
- A. K. Singh, M. Shafique, A. Kumar, and J. Henkel. Mapping on multi/many-core systems: Survey of current and emerging trends. In DAC 2013. Google Scholar
Digital Library
- A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreaded processor. In ASPLOS IX, 2000. Google Scholar
Digital Library
- J. A. Stratton and others. Parboil: A revised benchmark suite for scientific and commercial throughput computing.Google Scholar
- G. Tournavitis et al. Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping. In PLDI ’09. Google Scholar
Digital Library
- S. Verdoolaege et al. Polyhedral parallel code generation for cuda. ACM TACO, 2013. Google Scholar
Digital Library
- Z. Wang and M. O’Boyle. Mapping parallelism to multi-cores: a machine learning based approach. In PPoPP ’09. Google Scholar
Digital Library
- Z. Wang and M. F. O’Boyle. Partitioning streaming parallelism for multi-cores: a machine learning based approach. In PACT, 2010. Google Scholar
Digital Library
- Z. Wang et al. Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM TACO, 2014. Google Scholar
Digital Library
- Z. Wang et al. Automatic and portable mapping of data parallel programs to opencl for gpu-based heterogeneous systems. ACM TACO, 2015. Google Scholar
Digital Library
- Y. Wen et al. Smart multi-task scheduling for opencl programs on cpu/gpu heterogeneous platforms. In HiPC, 2014.Google Scholar
Cross Ref
Index Terms
Adaptive optimization for OpenCL programs on embedded heterogeneous systems
Recommendations
Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems
General-purpose GPU-based systems are highly attractive, as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This article presents a compiler-based approach to ...
Adaptive optimization for OpenCL programs on embedded heterogeneous systems
LCTES 2017: Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded SystemsHeterogeneous multi-core architectures consisting of CPUs and GPUs are commonplace in today’s embedded systems. These architectures offer potential for energy efficient computing if the application task is mapped to the right core. Realizing such ...
Portable mapping of data parallel programs to OpenCL for heterogeneous systems
CGO '13: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)General purpose GPU based systems are highly attractive as they give potentially massive performance at little cost. Re-alizing such potential is challenging due to the complexity of programming. This paper presents a compiler based approach to ...






Comments