skip to main content

Juggler: a dependence-aware task-based execution framework for GPUs

Published:10 February 2018Publication History
Skip Abstract Section

Abstract

Scientific applications with single instruction, multiple data (SIMD) computations show considerable performance improvements when run on today's graphics processing units (GPUs). However, the existence of data dependences across thread blocks may significantly impact the speedup by requiring global synchronization across multiprocessors (SMs) inside the GPU. To efficiently run applications with interblock data dependences, we need fine-granular task-based execution models that will treat SMs inside a GPU as stand-alone parallel processing units. Such a scheme will enable faster execution by utilizing all internal computation elements inside the GPU and eliminating unnecessary waits during device-wide global barriers.

In this paper, we propose Juggler, a task-based execution scheme for GPU workloads with data dependences. The Juggler framework takes applications embedding OpenMP 4.5 tasks as input and executes them on the GPU via an efficient in-device runtime, hence eliminating the need for kernel-wide global synchronization. Juggler requires no or little modification to the source code, and once launched, the runtime entirely runs on the GPU without relying on the host through the entire execution. We have evaluated Juggler on an NVIDIA Tesla P100 GPU and obtained up to 31% performance improvement against global barrier based implementation, with minimal runtime overhead.

Skip Supplemental Material Section

Supplemental Material

References

  1. Amir Ali Abdolrashidi, Devashree Tripathy, Mehmet Esat Belviranli, Laxmi Narayan Bhuyan, and Daniel Wong. 2017. Wireframe: Supporting Data-dependent Parallelism Through Dependency Graph Execution in GPUs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '17). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, and S. Tomov. 2011. QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators. In 2011 IEEE International Parallel Distributed Processing Symposium (IPDPS'11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Timo Aila and Samuli Laine. 2009. Understanding the Efficiency of Ray Traversal on GPUs. In Proceedings of the Conference on High Performance Graphics (HPG '09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Augonnet, S. Thibault, R. Namyst, and P.A. Wacrenier. 2009. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. In Euro-Par 2009 Parallel Processing (Euro-Par '09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Michael Bauer, Henry Cook, and Brucek Khailany. 2011. CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Michael Bauer, Sean Treichler, and Alex Aiken. 2014. Singe: Leveraging Warp Specialization for High Performance on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Mehmet E. Belviranli, Peng Deng, Laxmi N. Bhuyan, Rajiv Gupta, and Qi Zhu. 2015. PeerWave: Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Javier Cabezas, Lluís Vilanova, Isaac Gelado, Thomas B. Jablin, Nacho Navarro, and Wen-mei W. Hwu. 2015. Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Guoyang Chen and Xipeng Shen. 2015. Free Launch: Optimizing GPU Dynamic Kernel Launches Through Thread Reuse. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO '15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. Chen, O. Villa, S. Krishnamoorthy, and Guang R Gao. 2010. Dynamic load balancing on single-and multi-GPU systems. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS '10).Google ScholarGoogle ScholarCross RefCross Ref
  11. T. Gautier, J. V. F. Lima, N. Maillard, and B. Raffin. 2013. XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS '13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Govindarajan and Jayvant Anantpur. 2013. Runtime Dependence Computation and Execution of Loops on Heterogeneous Systems. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO '13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. Hermann, B. Raffin, F. Faure, T. Gautier, and J. Allard. 2010. Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations. In Euro-Par 2010-Parallel Processing (Euro-Par, 10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Huynh Phung Huynh, Andrei Hagiescu, and Rick Siow Mong Goh. 2012. Scalable framework for mapping streaming applications onto multi-GPU systems. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (PPoPP '12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian T. Lewis, Chunling Hu, and Keshav Pingali. 2014. Adaptive Heterogeneous Scheduling for Integrated GPUs. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Scott J. Krieder, Justin M. Wozniak, Timothy Armstrong, Michael Wilde, Daniel S. Katz, Ian T. Foster, and Ioan Raicu. 2014. Design and Evaluation of the Gemtc Framework for GPU-enabled Many-task Computing. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC '14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA '14).Google ScholarGoogle Scholar
  18. Seyong Lee and Jeffrey S. Vetter. 2014. OpenARC: Open Accelerator Research Compiler for Directive-based, Efficient Heterogeneous Computing. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC '14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Matt Martineau, Simon McIntosh-Smith, Carlo Bertolli, Jacob, et al. 2016. Performance analysis and optimization of Clang's OpenMP 4.5 GPU support (PMBS '16).Google ScholarGoogle Scholar
  20. Pinar Muyan-Özçelik and John D. Owens. 2016. Multitasking Real-time Embedded GPU Computing Tasks. In Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Marc S Orr, Bradford M Beckmann, Steven K Reinhardt, and David A Wood. 2014. Fine-grain task aggregation and coordination on GPUs. In 41st International Symposium on Computer Architecture (ISCA '14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: Operating System Abstractions to Manage GPUs As Compute Devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP '11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Daniel Sanchez, David Lo, Richard M Yoo, Jeremy Sugerman, and Christos Kozyrakis. 2011. Dynamic fine-grain scheduling of pipeline parallelism. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on (PACT '11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Fengguang Song, Asim YarKhan, and Jack Dongarra. 2009. Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC '09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Markus Steinberger, Bernhard Kainz, Bernhard Kerbl, Stefan Hauswiesner, Michael Kenzel, and Dieter Schmalstieg. 2012. Softshell: Dynamic Scheduling on GPUs. In ACM Trans. Graph., Vol. 31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Stanley Tzeng, Brandon Lloyd, and John D Owens. 2012. A GPU Task-Parallel Model with Dependency Resolution. Computer (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. U. Verner, A. Schuster, and M. Silberstein. 2011. Processing data streams with hard real-time constraints on heterogeneous systems. In Proceedings of the International Conference on Supercomputing (ICS '11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. 2016. LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA '16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. 2016. Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA '16).Google ScholarGoogle Scholar
  30. Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. 2015. Enabling and Exploiting Flexible Task Assignment on GPU Through SM-Centric Program Transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Shucai Xiao and Wu-chun Feng. 2010. Inter-block GPU communication via fast barrier synchronization. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS '10).Google ScholarGoogle ScholarCross RefCross Ref
  32. Shengen Yan, Guoping Long, and Yunquan Zhang. 2013. StreamScan: Fast Scan Algorithms for GPUs Without Global Barrier Synchronization. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Tsung Tai Yeh, Amit Sabne, Putt Sakdhnagool, and Rudolf Eigenmann. 2017. Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, and Wenguang Chen. 2017. Versapipe: A Versatile Programming Framework for Pipelined Computing on GPU. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '17). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Juggler: a dependence-aware task-based execution framework for GPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!