skip to main content
research-article

SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration

Published:31 August 2015Publication History
Skip Abstract Section

Abstract

Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. This work distribution can be a poor solution as it underutilizes the CPU, has difficulty generalizing beyond the single CPU-GPU combination, and may waste a large fraction of time transferring data. Further, CPUs are performance competitive with GPUs on many workloads, thus simply partitioning work based on the fixed roles may be a poor choice. In this article, we present the single-kernel multiple devices (SKMD) system, a framework that transparently orchestrates collaborative execution of a single data-parallel kernel across multiple asymmetric CPUs and GPUs. The programmer is responsible for developing a single data-parallel kernel in OpenCL, while the system automatically partitions the workload across an arbitrary set of devices, generates kernels to execute the partial workloads, and efficiently merges the partial outputs together. The goal is performance improvement by maximally utilizing all available resources to execute the kernel. SKMD handles the difficult challenges of exposed data transfer costs and the performance variations GPUs have with respect to input size. On real hardware, SKMD achieves an average speedup of 28% on a system with one multicore CPU and two asymmetric GPUs compared to a fastest device execution strategy for a set of popular OpenCL kernels.

References

  1. AMD. 2012. Accelerated Parallel Processing (APP) SDK. http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/.Google ScholarGoogle Scholar
  2. Kevin J. Brown, Arvind K. Sujeeth, Hyouk Joong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2011. A heterogeneous parallel framework for domain-specific languages. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques. 89--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Gregory Diamos, Andrew Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. 353--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Gregory F. Diamos and Sudhakar Yalamanchili. 2008. Harmony: An execution model and runtime for heterogeneous many core systems. In Proceedings of the 17th International Symposium on High Performance Distributed Computing. 197--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual International Symposium on Microarchitecture. 407--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. R. Garey and D. S. Johnson. 1990. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jayanth Gummaraju, Laurent Morichetti, Michael Houston, Ben Sander, Benedict R. Gaster, and Bixia Zheng. 2010. Twin Peaks: A software platform for heterogeneous computing on general-purpose and graphics processors. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. 205--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Sunpyo Hong and Hyesoon Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture. 152--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Amir H. Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke. 2011. Sponge: Portable stream programming on graphics engines. In 16th International Conference on Architectural Support for Programming Languages and Operating Systems. 381--392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Stargazer: Automated regression-based GPU design space exploration. In Proceedings of the 2012 IEEE Symposium on Performance Analysis of Systems and Software. 2--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In Proceedings of the 2011 International Symposium on Code Generation and Optimization. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Shinpei Kato, Michael McThrow, Carlos Maltzahn, and Scott Brandt. 2012. Gdev: First-class GPU resource management in the operating system. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’12). 401--412. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Stephen W. Keckler, William J. Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the future of parallel computing. IEEE Micro 31, 5, 7--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Christoph Kessler, Usman Dastgeer, Samuel Thibault, Raymond Namyst, Andrew Richards, Uwe Dolinsky, Siegfried Benkner, Jesper Larsson Traff, and Sabri Pllana. 2012. Programmability and performance portability aspects of heterogeneous multi-/manycore systems. In Proceedings of the 2012 Design, Automation and Test in Europe. 1403--1408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. KHRONOS. 2014. OpenCL—The open standard for parallel programming of heterogeneous systems. http://www.khronos.org.Google ScholarGoogle Scholar
  16. Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. 2011. Achieving a single compute device image in OpenCL for multiple GPUs. In Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 277--288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Manjunath Kudlur and Scott Mahlke. 2008. Orchestrating the execution of stream programs on multicore platforms. In Proceedings of the ’08 Conference on Programming Language Design and Implementation. 114--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization. 75--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke. 2014. VAST: The illusion of a large memory space for GPUs. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques. 443--454. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Janghaeng Lee, Mehrzad Samadi, Yongjun Park, and Scott Mahlke. 2013. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 245--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, and Nathan Clark. 2010. Thread tailor: Dynamically weaving threads together for efficient, adaptive parallel applications. In Proceedings of the 37th Annual International Symposium on Computer Architecture. 270--279. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. 2010. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proceedings of the 37th Annual International Symposium on Computer Architecture. 451--460. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Michael D. Linderman, Jamison D. Collins, Hong Wang, and Teresa H. Meng. 2008. Merge: A programming model for heterogeneous multi-core systems. In 13th International Conference on Architectural Support for Programming Languages and Operating Systems. 287--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. LLVM. 2014. libclc. Retrieved July 23, 2015 from http://libclc.llvm.org.Google ScholarGoogle Scholar
  25. Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. 2009. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual International Symposium on Microarchitecture. 45--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Douglas C. Montgomery, Elizabeth A. Peck, and G. Geoffrey Vining. 2001. Introduction to linear regression analysis (3rd ed.). Wiley, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. NVIDIA. 2012. CUDA Toolkit 4.2. Retrieved July 23, 2015 from https://developer.nvidia.com/cuda-toolkit-42-archive.Google ScholarGoogle Scholar
  28. NVIDIA. 2014a. CUDA C Programming Guide. Retrieved July 23, 2015 from http://docs.nvidia.com/cuda.Google ScholarGoogle Scholar
  29. NVIDIA. 2014b. PTX: Parallel Thread Execution ISA. Retrieved July 23, 2015 from http://docs.nvidia.com/cuda/parallel-thread-execution.Google ScholarGoogle Scholar
  30. David A. Padua and Michael J. Wolfe. 1986. Advanced compiler optimizations for supercomputers. Communications of the ACM 29, 12, 1184--1201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. R. Quinlan. 1986. Induction of decision trees. Journal of Machine Learning 1, 1, 81--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: Operating system abstractions to manage GPUs as compute devices. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles. 233--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: A compiler and runtime for heterogeneous systems. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. 49--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. John A. Stratton, Sam S. Stone, and Wen-Mei W. Hwu. 2008. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 16--30.Google ScholarGoogle Scholar
  35. Yusuke Suzuki, Shinpei Kato, Hiroshi Yamada, and Kenji Kono. 2014. GPUvm: Why not virtualizing GPUs at the hypervisor?. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’14). 109--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Linda Torczon and Keith Cooper. 2011. Engineering A Compiler (2nd ed.). Morgan Kaufmann Publishers Inc., Burlington, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kaibo Wang, Xiaoning Ding, Rubao Lee, Shinpei Kato, and Xiaodong Zhang. 2014. GDM: Device memory management for GPGPU computing. In 2014 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 533--545. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Computer Systems
          ACM Transactions on Computer Systems  Volume 33, Issue 3
          September 2015
          140 pages
          ISSN:0734-2071
          EISSN:1557-7333
          DOI:10.1145/2818727
          Issue’s Table of Contents

          Copyright © 2015 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 31 August 2015
          • Accepted: 1 June 2015
          • Revised: 1 February 2015
          • Received: 1 July 2014
          Published in tocs Volume 33, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!