Abstract
Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. This work distribution can be a poor solution as it underutilizes the CPU, has difficulty generalizing beyond the single CPU-GPU combination, and may waste a large fraction of time transferring data. Further, CPUs are performance competitive with GPUs on many workloads, thus simply partitioning work based on the fixed roles may be a poor choice. In this article, we present the single-kernel multiple devices (SKMD) system, a framework that transparently orchestrates collaborative execution of a single data-parallel kernel across multiple asymmetric CPUs and GPUs. The programmer is responsible for developing a single data-parallel kernel in OpenCL, while the system automatically partitions the workload across an arbitrary set of devices, generates kernels to execute the partial workloads, and efficiently merges the partial outputs together. The goal is performance improvement by maximally utilizing all available resources to execute the kernel. SKMD handles the difficult challenges of exposed data transfer costs and the performance variations GPUs have with respect to input size. On real hardware, SKMD achieves an average speedup of 28% on a system with one multicore CPU and two asymmetric GPUs compared to a fastest device execution strategy for a set of popular OpenCL kernels.
- AMD. 2012. Accelerated Parallel Processing (APP) SDK. http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/.Google Scholar
- Kevin J. Brown, Arvind K. Sujeeth, Hyouk Joong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2011. A heterogeneous parallel framework for domain-specific languages. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques. 89--100. Google Scholar
Digital Library
- Gregory Diamos, Andrew Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. 353--364. Google Scholar
Digital Library
- Gregory F. Diamos and Sudhakar Yalamanchili. 2008. Harmony: An execution model and runtime for heterogeneous many core systems. In Proceedings of the 17th International Symposium on High Performance Distributed Computing. 197--200. Google Scholar
Digital Library
- Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual International Symposium on Microarchitecture. 407--420. Google Scholar
Digital Library
- M. R. Garey and D. S. Johnson. 1990. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY. Google Scholar
Digital Library
- Jayanth Gummaraju, Laurent Morichetti, Michael Houston, Ben Sander, Benedict R. Gaster, and Bixia Zheng. 2010. Twin Peaks: A software platform for heterogeneous computing on general-purpose and graphics processors. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. 205--216. Google Scholar
Digital Library
- Sunpyo Hong and Hyesoon Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture. 152--163. Google Scholar
Digital Library
- Amir H. Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke. 2011. Sponge: Portable stream programming on graphics engines. In 16th International Conference on Architectural Support for Programming Languages and Operating Systems. 381--392. Google Scholar
Digital Library
- Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Stargazer: Automated regression-based GPU design space exploration. In Proceedings of the 2012 IEEE Symposium on Performance Analysis of Systems and Software. 2--13. Google Scholar
Digital Library
- Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In Proceedings of the 2011 International Symposium on Code Generation and Optimization. Google Scholar
Digital Library
- Shinpei Kato, Michael McThrow, Carlos Maltzahn, and Scott Brandt. 2012. Gdev: First-class GPU resource management in the operating system. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’12). 401--412. Google Scholar
Digital Library
- Stephen W. Keckler, William J. Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the future of parallel computing. IEEE Micro 31, 5, 7--17. Google Scholar
Digital Library
- Christoph Kessler, Usman Dastgeer, Samuel Thibault, Raymond Namyst, Andrew Richards, Uwe Dolinsky, Siegfried Benkner, Jesper Larsson Traff, and Sabri Pllana. 2012. Programmability and performance portability aspects of heterogeneous multi-/manycore systems. In Proceedings of the 2012 Design, Automation and Test in Europe. 1403--1408. Google Scholar
Digital Library
- KHRONOS. 2014. OpenCL—The open standard for parallel programming of heterogeneous systems. http://www.khronos.org.Google Scholar
- Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. 2011. Achieving a single compute device image in OpenCL for multiple GPUs. In Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 277--288. Google Scholar
Digital Library
- Manjunath Kudlur and Scott Mahlke. 2008. Orchestrating the execution of stream programs on multicore platforms. In Proceedings of the ’08 Conference on Programming Language Design and Implementation. 114--124. Google Scholar
Digital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization. 75--86. Google Scholar
Digital Library
- Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke. 2014. VAST: The illusion of a large memory space for GPUs. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques. 443--454. Google Scholar
Digital Library
- Janghaeng Lee, Mehrzad Samadi, Yongjun Park, and Scott Mahlke. 2013. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 245--256. Google Scholar
Digital Library
- Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, and Nathan Clark. 2010. Thread tailor: Dynamically weaving threads together for efficient, adaptive parallel applications. In Proceedings of the 37th Annual International Symposium on Computer Architecture. 270--279. Google Scholar
Digital Library
- Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. 2010. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proceedings of the 37th Annual International Symposium on Computer Architecture. 451--460. Google Scholar
Digital Library
- Michael D. Linderman, Jamison D. Collins, Hong Wang, and Teresa H. Meng. 2008. Merge: A programming model for heterogeneous multi-core systems. In 13th International Conference on Architectural Support for Programming Languages and Operating Systems. 287--296. Google Scholar
Digital Library
- LLVM. 2014. libclc. Retrieved July 23, 2015 from http://libclc.llvm.org.Google Scholar
- Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. 2009. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual International Symposium on Microarchitecture. 45--55. Google Scholar
Digital Library
- Douglas C. Montgomery, Elizabeth A. Peck, and G. Geoffrey Vining. 2001. Introduction to linear regression analysis (3rd ed.). Wiley, New York, NY. Google Scholar
Digital Library
- NVIDIA. 2012. CUDA Toolkit 4.2. Retrieved July 23, 2015 from https://developer.nvidia.com/cuda-toolkit-42-archive.Google Scholar
- NVIDIA. 2014a. CUDA C Programming Guide. Retrieved July 23, 2015 from http://docs.nvidia.com/cuda.Google Scholar
- NVIDIA. 2014b. PTX: Parallel Thread Execution ISA. Retrieved July 23, 2015 from http://docs.nvidia.com/cuda/parallel-thread-execution.Google Scholar
- David A. Padua and Michael J. Wolfe. 1986. Advanced compiler optimizations for supercomputers. Communications of the ACM 29, 12, 1184--1201. Google Scholar
Digital Library
- J. R. Quinlan. 1986. Induction of decision trees. Journal of Machine Learning 1, 1, 81--106. Google Scholar
Digital Library
- Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: Operating system abstractions to manage GPUs as compute devices. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles. 233--248. Google Scholar
Digital Library
- Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: A compiler and runtime for heterogeneous systems. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. 49--68. Google Scholar
Digital Library
- John A. Stratton, Sam S. Stone, and Wen-Mei W. Hwu. 2008. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 16--30.Google Scholar
- Yusuke Suzuki, Shinpei Kato, Hiroshi Yamada, and Kenji Kono. 2014. GPUvm: Why not virtualizing GPUs at the hypervisor?. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’14). 109--120. Google Scholar
Digital Library
- Linda Torczon and Keith Cooper. 2011. Engineering A Compiler (2nd ed.). Morgan Kaufmann Publishers Inc., Burlington, MA. Google Scholar
Digital Library
- Kaibo Wang, Xiaoning Ding, Rubao Lee, Shinpei Kato, and Xiaodong Zhang. 2014. GDM: Device memory management for GPGPU computing. In 2014 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 533--545. Google Scholar
Digital Library
Index Terms
SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration
Recommendations
gpucc: an open-source GPGPU compiler
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and OptimizationGraphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has ...
Leveraging GPUs using cooperative loop speculation
Graphics processing units, or GPUs, provide TFLOPs of additional performance potential in commodity computer systems that frequently go unused by most applications. Even with the emergence of languages such as CUDA and OpenCL, programming GPUs remains a ...
Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD ForumThe Intel® Xeon Phi™ coprocessor platform has a new software stack that enables new programming models. One such model is offload of computation from a host processor to a coprocessor that is a fully-capable Intel® Architecture CPU, namely, the Intel® ...






Comments