Abstract
Scientific applications with single instruction, multiple data (SIMD) computations show considerable performance improvements when run on today's graphics processing units (GPUs). However, the existence of data dependences across thread blocks may significantly impact the speedup by requiring global synchronization across multiprocessors (SMs) inside the GPU. To efficiently run applications with interblock data dependences, we need fine-granular task-based execution models that will treat SMs inside a GPU as stand-alone parallel processing units. Such a scheme will enable faster execution by utilizing all internal computation elements inside the GPU and eliminating unnecessary waits during device-wide global barriers.
In this paper, we propose Juggler, a task-based execution scheme for GPU workloads with data dependences. The Juggler framework takes applications embedding OpenMP 4.5 tasks as input and executes them on the GPU via an efficient in-device runtime, hence eliminating the need for kernel-wide global synchronization. Juggler requires no or little modification to the source code, and once launched, the runtime entirely runs on the GPU without relying on the host through the entire execution. We have evaluated Juggler on an NVIDIA Tesla P100 GPU and obtained up to 31% performance improvement against global barrier based implementation, with minimal runtime overhead.
Supplemental Material
Available for Download
This file is a snapshot of the Juggler repository located at the following address:Please refer to the repository for the most up-to-date version of the project.
- Amir Ali Abdolrashidi, Devashree Tripathy, Mehmet Esat Belviranli, Laxmi Narayan Bhuyan, and Daniel Wong. 2017. Wireframe: Supporting Data-dependent Parallelism Through Dependency Graph Execution in GPUs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '17). Google Scholar
Digital Library
- E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, and S. Tomov. 2011. QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators. In 2011 IEEE International Parallel Distributed Processing Symposium (IPDPS'11). Google Scholar
Digital Library
- Timo Aila and Samuli Laine. 2009. Understanding the Efficiency of Ray Traversal on GPUs. In Proceedings of the Conference on High Performance Graphics (HPG '09). Google Scholar
Digital Library
- C. Augonnet, S. Thibault, R. Namyst, and P.A. Wacrenier. 2009. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. In Euro-Par 2009 Parallel Processing (Euro-Par '09). Google Scholar
Digital Library
- Michael Bauer, Henry Cook, and Brucek Khailany. 2011. CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11). Google Scholar
Digital Library
- Michael Bauer, Sean Treichler, and Alex Aiken. 2014. Singe: Leveraging Warp Specialization for High Performance on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14). Google Scholar
Digital Library
- Mehmet E. Belviranli, Peng Deng, Laxmi N. Bhuyan, Rajiv Gupta, and Qi Zhu. 2015. PeerWave: Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). Google Scholar
Digital Library
- Javier Cabezas, Lluís Vilanova, Isaac Gelado, Thomas B. Jablin, Nacho Navarro, and Wen-mei W. Hwu. 2015. Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). Google Scholar
Digital Library
- Guoyang Chen and Xipeng Shen. 2015. Free Launch: Optimizing GPU Dynamic Kernel Launches Through Thread Reuse. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO '15). Google Scholar
Digital Library
- L. Chen, O. Villa, S. Krishnamoorthy, and Guang R Gao. 2010. Dynamic load balancing on single-and multi-GPU systems. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS '10).Google Scholar
Cross Ref
- T. Gautier, J. V. F. Lima, N. Maillard, and B. Raffin. 2013. XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS '13). Google Scholar
Digital Library
- R. Govindarajan and Jayvant Anantpur. 2013. Runtime Dependence Computation and Execution of Loops on Heterogeneous Systems. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO '13). Google Scholar
Digital Library
- E. Hermann, B. Raffin, F. Faure, T. Gautier, and J. Allard. 2010. Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations. In Euro-Par 2010-Parallel Processing (Euro-Par, 10). Google Scholar
Digital Library
- Huynh Phung Huynh, Andrei Hagiescu, and Rick Siow Mong Goh. 2012. Scalable framework for mapping streaming applications onto multi-GPU systems. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (PPoPP '12). Google Scholar
Digital Library
- Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian T. Lewis, Chunling Hu, and Keshav Pingali. 2014. Adaptive Heterogeneous Scheduling for Integrated GPUs. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14). Google Scholar
Digital Library
- Scott J. Krieder, Justin M. Wozniak, Timothy Armstrong, Michael Wilde, Daniel S. Katz, Ian T. Foster, and Ioan Raicu. 2014. Design and Evaluation of the Gemtc Framework for GPU-enabled Many-task Computing. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC '14). Google Scholar
Digital Library
- M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA '14).Google Scholar
- Seyong Lee and Jeffrey S. Vetter. 2014. OpenARC: Open Accelerator Research Compiler for Directive-based, Efficient Heterogeneous Computing. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC '14). Google Scholar
Digital Library
- Matt Martineau, Simon McIntosh-Smith, Carlo Bertolli, Jacob, et al. 2016. Performance analysis and optimization of Clang's OpenMP 4.5 GPU support (PMBS '16).Google Scholar
- Pinar Muyan-Özçelik and John D. Owens. 2016. Multitasking Real-time Embedded GPU Computing Tasks. In Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '16). Google Scholar
Digital Library
- Marc S Orr, Bradford M Beckmann, Steven K Reinhardt, and David A Wood. 2014. Fine-grain task aggregation and coordination on GPUs. In 41st International Symposium on Computer Architecture (ISCA '14). Google Scholar
Digital Library
- Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: Operating System Abstractions to Manage GPUs As Compute Devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP '11). Google Scholar
Digital Library
- Daniel Sanchez, David Lo, Richard M Yoo, Jeremy Sugerman, and Christos Kozyrakis. 2011. Dynamic fine-grain scheduling of pipeline parallelism. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on (PACT '11). Google Scholar
Digital Library
- Fengguang Song, Asim YarKhan, and Jack Dongarra. 2009. Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC '09). Google Scholar
Digital Library
- Markus Steinberger, Bernhard Kainz, Bernhard Kerbl, Stefan Hauswiesner, Michael Kenzel, and Dieter Schmalstieg. 2012. Softshell: Dynamic Scheduling on GPUs. In ACM Trans. Graph., Vol. 31. Google Scholar
Digital Library
- Stanley Tzeng, Brandon Lloyd, and John D Owens. 2012. A GPU Task-Parallel Model with Dependency Resolution. Computer (2012). Google Scholar
Digital Library
- U. Verner, A. Schuster, and M. Silberstein. 2011. Processing data streams with hard real-time constraints on heterogeneous systems. In Proceedings of the International Conference on Supercomputing (ICS '11). Google Scholar
Digital Library
- J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. 2016. LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA '16). Google Scholar
Digital Library
- Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. 2016. Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA '16).Google Scholar
- Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. 2015. Enabling and Exploiting Flexible Task Assignment on GPU Through SM-Centric Program Transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). Google Scholar
Digital Library
- Shucai Xiao and Wu-chun Feng. 2010. Inter-block GPU communication via fast barrier synchronization. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS '10).Google Scholar
Cross Ref
- Shengen Yan, Guoping Long, and Yunquan Zhang. 2013. StreamScan: Fast Scan Algorithms for GPUs Without Global Barrier Synchronization. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '13). Google Scholar
Digital Library
- Tsung Tai Yeh, Amit Sabne, Putt Sakdhnagool, and Rudolf Eigenmann. 2017. Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). Google Scholar
Digital Library
- Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, and Wenguang Chen. 2017. Versapipe: A Versatile Programming Framework for Pipelined Computing on GPU. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '17). Google Scholar
Digital Library
Index Terms
Juggler: a dependence-aware task-based execution framework for GPUs
Recommendations
Juggler: a dependence-aware task-based execution framework for GPUs
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingScientific applications with single instruction, multiple data (SIMD) computations show considerable performance improvements when run on today's graphics processing units (GPUs). However, the existence of data dependences across thread blocks may ...
Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices
Accelerator Programming Using DirectivesAbstractAchieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used ...
SD3: A Scalable Approach to Dynamic Data-Dependence Profiling
MICRO '43: Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on MicroarchitectureAs multicore processors are deployed in mainstream computing, the need for software tools to help parallelize programs is increasing dramatically. Data-dependence profiling is an important technique to exploit parallelism in programs. More specifically, ...







Comments