skip to main content
article

Adaptive optimization for OpenCL programs on embedded heterogeneous systems

Published:21 June 2017Publication History
Skip Abstract Section

Abstract

Heterogeneous multi-core architectures consisting of CPUs and GPUs are commonplace in today’s embedded systems. These architectures offer potential for energy efficient computing if the application task is mapped to the right core. Realizing such potential is challenging due to the complex and evolving nature of hardware and applications. This paper presents an automatic approach to map OpenCL kernels onto heterogeneous multi-cores for a given optimization criterion – whether it is faster runtime, lower energy consumption or a trade-off between them. This is achieved by developing a machine learning based approach to predict which processor to use to run the OpenCL kernel and the host program, and at what frequency the processor should operate. Instead of hand-tuning a model for each optimization metric, we use machine learning to develop a unified framework that first automatically learns the optimization heuristic for each metric off-line, then uses the learned knowledge to schedule OpenCL kernels at runtime based on code and runtime information of the program. We apply our approach to a set of representative OpenCL benchmarks and evaluate it on an ARM big.LITTLE mobile platform. Our approach achieves over 93% of the performance delivered by a perfect predictor.We obtain, on average, 1.2x, 1.6x, and 1.8x improvement respectively for runtime, energy consumption and the energy delay product when compared to a comparative heterogeneous-aware OpenCL task mapping scheme.

References

  1. H. Almatary et al. Reducing the implementation overheads of ipcp and dfp. In RTSS ’15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Ceng et al. Maps: an integrated framework for mpsoc application parallelization. In DAC ’08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Chakraborty et al. Opportunity for compute partitioning in pursuit of energy-efficient systems. In LCTES 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Chandramohan and M. F. O’Boyle. Partitioning data-parallel programs for heterogeneous mpsocs: time and energy design space exploration. In LCTES 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC ’09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. Chen et al. Effisha: A software framework for enabling effficient preemptive scheduling of gpu. In PPoPP ’17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Cho et al. Energy-reduction offloading technique for streaming media servers. Mobile Information Systems, 2016.Google ScholarGoogle Scholar
  8. C. Cummins et al. Synthesizing benchmarks for predictive modeling. In CGO 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Dev and S. Reda. Scheduling challenges and opportunities in integrated cpu+gpu processors. In ESTIMedia’16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. K. Emani et al. Smart, adaptive mapping of parallelism in the presence of external workload. In CGO, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Eyerman and L. Eeckhout. Probabilistic job symbiosis modeling for smt processor scheduling. In ASPLOS XV, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Garzón et al. An approach to optimise the energy efficiency of iterative computation on integrated gpu–cpu systems. The Journal of Supercomputing, 2016.Google ScholarGoogle Scholar
  13. D. Grewe et al. Portable mapping of data parallel programs to opencl for heterogeneous systems. In CGO ’13.Google ScholarGoogle Scholar
  14. D. Grewe et al. A workload-aware mapping approach for data-parallel programs. In HiPCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Grewe et al. Opencl task partitioning in the presence of gpu contention. In LCPC, 2013.Google ScholarGoogle Scholar
  16. N. Guan et al. Schedulability analysis of preemptive and nonpreemptive edf on partial runtime-reconfigurable fpgas. ACM TODAES, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Imes and H. Hoffmann. Bard: A unified framework for managing soft timing and power constraints. In SAMOS, 2016.Google ScholarGoogle Scholar
  18. Jääskeläinen et al. Pocl: A performance-portable opencl implementation. Int. J. Parallel Program., 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Jia et al. Gpu performance and power tuning using regression trees. ACM Trans. Archit. Code Optim., 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Kaleem et al. Adaptive heterogeneous scheduling for integrated gpus. In PACT ’14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. S. Latifi Oskouei et al. Cnndroid: Gpu-accelerated execution of trained deep convolutional neural networks on android. In MM ’16, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Lee, M. Samadi, and S. Mahlke. Orchestrating multiple data-parallel kernels on multiple devices. In PACT ’15,. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. S. Lee et al. Accelerating bootstrapping in fhew using gpus. In ASAP ’15,.Google ScholarGoogle Scholar
  24. J. Leng et al. Gpuwattch: enabling energy optimizations in gpgpus. In ISCA 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Ma et al. Greengpu: A holistic approach to energy efficiency in gpu-cpu heterogeneous architectures. In ICPP 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Magni et al. Automatic optimization of thread-coarsening for graphics processors. In PACT ’14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Majeti et al. Automatic data layout generation and kernel mapping for cpu+gpu architectures. In CC 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. P.-J. Micolet et al. A machine learning approach to mapping streaming workloads to dynamic multicore processors. In LCTES 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. W. Ogilvie et al. Minimizing the cost of iterative compilation with active learning. In CGO, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. P. Pandit and R. Govindarajan. Fluidic kernels: Cooperative execution of opencl programs on multiple heterogeneous devices. In CGO ’14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. H. Park et al. Zero and data reuse-aware fast convolution for deep neural networks on gpu. In CODES+ISSS 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Ren et al. Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach.Google ScholarGoogle Scholar
  33. J. Ren et al. Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In INFOCOM, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  34. A. K. Singh, M. Shafique, A. Kumar, and J. Henkel. Mapping on multi/many-core systems: Survey of current and emerging trends. In DAC 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreaded processor. In ASPLOS IX, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. A. Stratton and others. Parboil: A revised benchmark suite for scientific and commercial throughput computing.Google ScholarGoogle Scholar
  37. G. Tournavitis et al. Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping. In PLDI ’09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Verdoolaege et al. Polyhedral parallel code generation for cuda. ACM TACO, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Z. Wang and M. O’Boyle. Mapping parallelism to multi-cores: a machine learning based approach. In PPoPP ’09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Z. Wang and M. F. O’Boyle. Partitioning streaming parallelism for multi-cores: a machine learning based approach. In PACT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Z. Wang et al. Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM TACO, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Z. Wang et al. Automatic and portable mapping of data parallel programs to opencl for gpu-based heterogeneous systems. ACM TACO, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Y. Wen et al. Smart multi-task scheduling for opencl programs on cpu/gpu heterogeneous platforms. In HiPC, 2014.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Adaptive optimization for OpenCL programs on embedded heterogeneous systems

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM SIGPLAN Notices
              ACM SIGPLAN Notices  Volume 52, Issue 5
              LCTES '17
              May 2017
              120 pages
              ISSN:0362-1340
              EISSN:1558-1160
              DOI:10.1145/3140582
              Issue’s Table of Contents
              • cover image ACM Conferences
                LCTES 2017: Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems
                June 2017
                120 pages
                ISBN:9781450350303
                DOI:10.1145/3078633
                • General Chair:
                • Vijay Nagarajan,
                • Program Chair:
                • Zili Shao

              Copyright © 2017 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 21 June 2017

              Check for updates

              Qualifiers

              • article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!