skip to main content
research-article

DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model

Published:25 March 2016Publication History
Skip Abstract Section

Abstract

The rising pressure for simultaneously improving performance and reducing power is driving more diversity into all aspects of computing devices. An algorithm that is well-matched to the target hardware can run multiple times faster and more energy efficiently than one that is not. The problem is complicated by the fact that a program's input also affects the appropriate choice of algorithm. As a result, software developers have been faced with the challenge of determining the appropriate algorithm for each potential combination of target device and data. This paper presents DySel, a novel runtime system for automating such determination for kernel-based data parallel programming models such as OpenCL, CUDA, OpenACC, and C++AMP. These programming models cover many applications that demand high performance in mobile, cloud and high-performance computing. DySel systematically deploys candidate kernels on a small portion of the actual data to determine which achieves the best performance for the hardware-data combination. The test-deployment, referred to as micro-profiling, contributes to the final execution result and incurs less than 8% of overhead in the worst observed case when compared to an oracle. We show four major use cases where DySel provides significantly more consistent performance without tedious effort from the developer.

References

  1. M. Arnold, S. Fink, D. Grove, M. Hind, and P. F. Sweeney. Adaptive optimization in the Jalap\ eno JVM. In ACM SIGPLAN Notices, volume 35, pages 47--65. ACM, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu. An adaptive performance modeling tool for GPU architectures. In ACM SIGPLAN Notices, volume 45, pages 105--114, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimization system. In ACM SIGPLAN Notices, volume 35, pages 1--12. ACM, 2000.Google ScholarGoogle Scholar
  4. N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 18:1--18:11, 2009. ISBN 978--1--60558--744--8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Tsigas, Dolinsky, Augonnet, Bachmayer, Kessler, Moloney, and Osipov]peppherS. Benkner, S. Pllana, J. L. Traf, P. Tsigas, U. Dolinsky, C. Augonnet, B. Bachmayer, C. Kessler, D. Moloney, and V. Osipov. PEPPHER: Efficient and productive usage of hybrid computing systems. IEEE Micro, 31 (5): 28--41, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44--54, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Chen, B. Wu, D. Li, and X. Shen. PORPLE: An extensible optimizer for portable data placement on GPU. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 88--100, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W.-K. Chen, S. Lerner, R. Chaiken, and D. M. Gillies. Mojo: A dynamic optimization system. In 3rd ACM Workshop on Feedback-Directed and Dynamic Optimization (FDDO-3), pages 81--90, 2000.Google ScholarGoogle Scholar
  9. A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 63--74, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J.-F. Dollinger and V. Loechner. Adaptive runtime selection for GPU. In Parallel Processing, 2013 42nd International Conference on, pages 70--79, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Dotsenko, S. S. Baghsorkhi, B. Lloyd, and N. K. Govindaraju. Auto-tuning of fast Fourier transform on graphics processors. In ACM SIGPLAN Notices, volume 46, pages 257--266, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. Gaster, and B. Zheng. Twin Peaks: A Software Platform for Heterogeneous Computing on General-purpose and Graphics Processors. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pages 205--216, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Intel. Vectorizer knobs. https://software.intel.com/en-us/node/540483.Google ScholarGoogle Scholar
  14. P. Jaaskelainen, C. S. de La Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg. pocl: A performance-portable OpenCL implementation, 2014.Google ScholarGoogle Scholar
  15. B. Jang, D. Schaa, P. Mistry, and D. Kaeli. Exploiting memory access patterns to improve memory performance in data-parallel architectures. IEEE Trans. Parallel Distrib. Syst., 22 (1): 105--118, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Khronos OpenCL Working Group and others. The OpenCL Specification. A. Munshi, Ed, 2008.Google ScholarGoogle Scholar
  17. H.-S. Kim, I. El Hajj, J. Stratton, S. Lumetta, and W.-M. Hwu. Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pages 257--268, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Li, U. Dastgeer, and C. Kessler. Adaptive off-line tuning for optimized composition of components for heterogeneous many-core systems. In High Performance Computing for Computational Science-VECPAR 2012, pages 329--345. 2013.Google ScholarGoogle ScholarCross RefCross Ref
  19. A. Magni, C. Dubach, and M. F. O'Boyle. A large-scale cross-architecture evaluation of thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 11, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Reinders. Intel threading building blocks: outfitting C+ for multi-core processor parallelism. " O'Reilly Media, Inc.", 2007.Google ScholarGoogle Scholar
  21. N. Rotem. Intel OpenCL Implicit Vectorization Module, 2011.Google ScholarGoogle Scholar
  22. S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W.-m. W. Hwu. Program optimization space pruning for a multithreaded GPU. In Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, pages 195--204, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Sim, A. Dasgupta, H. Kim, and R. Vuduc. A performance analysis framework for identifying potential benefits in GPGPU applications. In ACM SIGPLAN Notices, volume 47, pages 11--22, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Srinivas, W. Ding, and M. Kandemir. Reactive tiling. In Code Generation and Optimization, 2015 IEEE/ACM International Symposium on, pages 91--102, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  25. M. Stephenson, S. K. S. Hari, Y. Lee, E. Ebrahimi, D. R. Johnson, D. Nellans, M. O'Connor, and S. W. Keckler. Flexible software profiling of gpu architectures. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pages 185--197. ACM, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Stratton, Anssari, Rodrigues, Sung, Obeid, Chang, Liu, and Hwu]John_InparJ. Stratton, N. Anssari, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, G. D. Liu, and W.-m. Hwu. Optimization and architecture effects on GPU computing workload performance. In Innovative Parallel Computing (InPar), 2012, pages 1--10, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  27. J. A. Stratton, S. S. Stone, and W. W. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Languages and Compilers for Parallel Computing, pages 16--30. 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Stratton, Rodrigues, Sung, Obeid, Chang, Anssari, Liu, and Hwu]parboilJ. A. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang, N. Anssari, G. D. Liu, and W. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. IMPACT Technical Report, 2012.Google ScholarGoogle Scholar
  29. Stratton, Rodrigues, Sung, Chang, Anssari, Liu, Hwu, and Obeid]stratton2012algorithmJ. A. Stratton, C. Rodrigues, I.-J. Sung, L.-W. Chang, N. Anssari, G. Liu, W.-m. W. Hwu, and N. Obeid. Algorithm and data optimization techniques for scaling to massively threaded systems. Computer, 45 (8): 0026--32, 2012.Google ScholarGoogle Scholar
  30. R. Vasudevan, S. S. Vadhiyar, and L. V. Kalé. G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems. In Proceedings of the 27th international ACM conference on International conference on supercomputing, pages 349--358, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. J. Voss and R. Eigemann. High-level adaptive program optimization with ADAPT. In ACM SIGPLAN Notices, volume 36, pages 93--102. ACM, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Wang and S. Yalamanchili. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In Workload Characterization (IISWC), 2014 IEEE International Symposium on, pages 51--60, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  33. J. R. Wernsing and G. Stitt. Elastic computing: a framework for transparent, portable, and adaptive multi-core heterogeneous computing. In ACM SIGPLAN Notices, volume 45, pages 115--124, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Y. Yang and H. Zhou. CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications. In ACM SIGPLAN Notices, volume 49, pages 93--106, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In ACM SIGPLAN Notices, volume 45, pages 86--97, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 51, Issue 4
        ASPLOS '16
        April 2016
        774 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2954679
        • Editor:
        • Andy Gill
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
          March 2016
          824 pages
          ISBN:9781450340915
          DOI:10.1145/2872362
          • General Chair:
          • Tom Conte,
          • Program Chair:
          • Yuanyuan Zhou

        Copyright © 2016 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 March 2016

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!