Abstract

The rising pressure for simultaneously improving performance and reducing power is driving more diversity into all aspects of computing devices. An algorithm that is well-matched to the target hardware can run multiple times faster and more energy efficiently than one that is not. The problem is complicated by the fact that a program's input also affects the appropriate choice of algorithm. As a result, software developers have been faced with the challenge of determining the appropriate algorithm for each potential combination of target device and data. This paper presents DySel, a novel runtime system for automating such determination for kernel-based data parallel programming models such as OpenCL, CUDA, OpenACC, and C++AMP. These programming models cover many applications that demand high performance in mobile, cloud and high-performance computing. DySel systematically deploys candidate kernels on a small portion of the actual data to determine which achieves the best performance for the hardware-data combination. The test-deployment, referred to as micro-profiling, contributes to the final execution result and incurs less than 8% of overhead in the worst observed case when compared to an oracle. We show four major use cases where DySel provides significantly more consistent performance without tedious effort from the developer.
- M. Arnold, S. Fink, D. Grove, M. Hind, and P. F. Sweeney. Adaptive optimization in the Jalap\ eno JVM. In ACM SIGPLAN Notices, volume 35, pages 47--65. ACM, 2000.Google Scholar
Digital Library
- S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu. An adaptive performance modeling tool for GPU architectures. In ACM SIGPLAN Notices, volume 45, pages 105--114, 2010.Google Scholar
Digital Library
- V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimization system. In ACM SIGPLAN Notices, volume 35, pages 1--12. ACM, 2000.Google Scholar
- N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 18:1--18:11, 2009. ISBN 978--1--60558--744--8.Google Scholar
Digital Library
- Tsigas, Dolinsky, Augonnet, Bachmayer, Kessler, Moloney, and Osipov]peppherS. Benkner, S. Pllana, J. L. Traf, P. Tsigas, U. Dolinsky, C. Augonnet, B. Bachmayer, C. Kessler, D. Moloney, and V. Osipov. PEPPHER: Efficient and productive usage of hybrid computing systems. IEEE Micro, 31 (5): 28--41, 2011.Google Scholar
Digital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44--54, 2009.Google Scholar
Digital Library
- G. Chen, B. Wu, D. Li, and X. Shen. PORPLE: An extensible optimizer for portable data placement on GPU. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 88--100, 2014.Google Scholar
Digital Library
- W.-K. Chen, S. Lerner, R. Chaiken, and D. M. Gillies. Mojo: A dynamic optimization system. In 3rd ACM Workshop on Feedback-Directed and Dynamic Optimization (FDDO-3), pages 81--90, 2000.Google Scholar
- A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 63--74, 2010.Google Scholar
Digital Library
- J.-F. Dollinger and V. Loechner. Adaptive runtime selection for GPU. In Parallel Processing, 2013 42nd International Conference on, pages 70--79, 2013.Google Scholar
Digital Library
- Y. Dotsenko, S. S. Baghsorkhi, B. Lloyd, and N. K. Govindaraju. Auto-tuning of fast Fourier transform on graphics processors. In ACM SIGPLAN Notices, volume 46, pages 257--266, 2011.Google Scholar
Digital Library
- J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. Gaster, and B. Zheng. Twin Peaks: A Software Platform for Heterogeneous Computing on General-purpose and Graphics Processors. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pages 205--216, 2010.Google Scholar
Digital Library
- Intel. Vectorizer knobs. https://software.intel.com/en-us/node/540483.Google Scholar
- P. Jaaskelainen, C. S. de La Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg. pocl: A performance-portable OpenCL implementation, 2014.Google Scholar
- B. Jang, D. Schaa, P. Mistry, and D. Kaeli. Exploiting memory access patterns to improve memory performance in data-parallel architectures. IEEE Trans. Parallel Distrib. Syst., 22 (1): 105--118, 2011.Google Scholar
Digital Library
- Khronos OpenCL Working Group and others. The OpenCL Specification. A. Munshi, Ed, 2008.Google Scholar
- H.-S. Kim, I. El Hajj, J. Stratton, S. Lumetta, and W.-M. Hwu. Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pages 257--268, 2015.Google Scholar
Digital Library
- L. Li, U. Dastgeer, and C. Kessler. Adaptive off-line tuning for optimized composition of components for heterogeneous many-core systems. In High Performance Computing for Computational Science-VECPAR 2012, pages 329--345. 2013.Google Scholar
Cross Ref
- A. Magni, C. Dubach, and M. F. O'Boyle. A large-scale cross-architecture evaluation of thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 11, 2013.Google Scholar
Digital Library
- J. Reinders. Intel threading building blocks: outfitting C+ for multi-core processor parallelism. " O'Reilly Media, Inc.", 2007.Google Scholar
- N. Rotem. Intel OpenCL Implicit Vectorization Module, 2011.Google Scholar
- S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W.-m. W. Hwu. Program optimization space pruning for a multithreaded GPU. In Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, pages 195--204, 2008.Google Scholar
Digital Library
- J. Sim, A. Dasgupta, H. Kim, and R. Vuduc. A performance analysis framework for identifying potential benefits in GPGPU applications. In ACM SIGPLAN Notices, volume 47, pages 11--22, 2012.Google Scholar
Digital Library
- J. Srinivas, W. Ding, and M. Kandemir. Reactive tiling. In Code Generation and Optimization, 2015 IEEE/ACM International Symposium on, pages 91--102, 2015.Google Scholar
Cross Ref
- M. Stephenson, S. K. S. Hari, Y. Lee, E. Ebrahimi, D. R. Johnson, D. Nellans, M. O'Connor, and S. W. Keckler. Flexible software profiling of gpu architectures. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pages 185--197. ACM, 2015.Google Scholar
Digital Library
- Stratton, Anssari, Rodrigues, Sung, Obeid, Chang, Liu, and Hwu]John_InparJ. Stratton, N. Anssari, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, G. D. Liu, and W.-m. Hwu. Optimization and architecture effects on GPU computing workload performance. In Innovative Parallel Computing (InPar), 2012, pages 1--10, 2012.Google Scholar
Cross Ref
- J. A. Stratton, S. S. Stone, and W. W. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Languages and Compilers for Parallel Computing, pages 16--30. 2008.Google Scholar
Digital Library
- Stratton, Rodrigues, Sung, Obeid, Chang, Anssari, Liu, and Hwu]parboilJ. A. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang, N. Anssari, G. D. Liu, and W. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. IMPACT Technical Report, 2012.Google Scholar
- Stratton, Rodrigues, Sung, Chang, Anssari, Liu, Hwu, and Obeid]stratton2012algorithmJ. A. Stratton, C. Rodrigues, I.-J. Sung, L.-W. Chang, N. Anssari, G. Liu, W.-m. W. Hwu, and N. Obeid. Algorithm and data optimization techniques for scaling to massively threaded systems. Computer, 45 (8): 0026--32, 2012.Google Scholar
- R. Vasudevan, S. S. Vadhiyar, and L. V. Kalé. G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems. In Proceedings of the 27th international ACM conference on International conference on supercomputing, pages 349--358, 2013.Google Scholar
Digital Library
- M. J. Voss and R. Eigemann. High-level adaptive program optimization with ADAPT. In ACM SIGPLAN Notices, volume 36, pages 93--102. ACM, 2001.Google Scholar
Digital Library
- J. Wang and S. Yalamanchili. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In Workload Characterization (IISWC), 2014 IEEE International Symposium on, pages 51--60, 2014.Google Scholar
Cross Ref
- J. R. Wernsing and G. Stitt. Elastic computing: a framework for transparent, portable, and adaptive multi-core heterogeneous computing. In ACM SIGPLAN Notices, volume 45, pages 115--124, 2010.Google Scholar
Digital Library
- Y. Yang and H. Zhou. CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications. In ACM SIGPLAN Notices, volume 49, pages 93--106, 2014.Google Scholar
Digital Library
- Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In ACM SIGPLAN Notices, volume 45, pages 86--97, 2010.Google Scholar
Digital Library
Index Terms
DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model
Recommendations
DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating SystemsThe rising pressure for simultaneously improving performance and reducing power is driving more diversity into all aspects of computing devices. An algorithm that is well-matched to the target hardware can run multiple times faster and more energy ...
DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model
ASPLOS'16The rising pressure for simultaneously improving performance and reducing power is driving more diversity into all aspects of computing devices. An algorithm that is well-matched to the target hardware can run multiple times faster and more energy ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...







Comments