Abstract
Despite the growing popularity of GPGPU programming, there is not yet a portable and formally-specified barrier that one can use to synchronise across workgroups. Moreover, the occupancy-bound execution model of GPUs breaks assumptions inherent in traditional software execution barriers, exposing them to deadlock. We present an occupancy discovery protocol that dynamically discovers a safe estimate of the occupancy for a given GPU and kernel, allowing for a starvation-free (and hence, deadlock-free) inter-workgroup barrier by restricting the number of workgroups according to this estimate. We implement this idea by adapting an existing, previously non-portable, GPU inter-workgroup barrier to use OpenCL 2.0 atomic operations, and prove that the barrier meets its natural specification in terms of synchronisation.
We assess the portability of our approach over eight GPUs spanning four vendors, comparing the performance of our method against alternative methods. Our key findings include: (1) the recall of our discovery protocol is nearly 100%; (2) runtime comparisons vary substantially across GPUs and applications; and (3) our method provides portable and safe inter-workgroup synchronisation across the applications we study.
- J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan, J. Ketema, D. Poetzl, T. Sorensen, and J. Wickerson. GPU concurrency: Weak behaviours and programming assumptions. In ASPLOS, pages 577–591. ACM, 2015. Google Scholar
Digital Library
- M. Batty, M. Dodds, and A. Gotsman. Library abstraction for C/C++ concurrency. In POPL, pages 235–248. ACM, 2013. Google Scholar
Digital Library
- M. Batty, A. F. Donaldson, and J. Wickerson. Overhauling SC atomics in C11 and OpenCL. In POPL, pages 634–648. ACM, 2016. Google Scholar
Digital Library
- A. Betts, N. Chong, A. F. Donaldson, J. Ketema, S. Qadeer, P. Thomson, and J. Wickerson. The design and implementation of a verification technique for GPU kernels. ACM Trans. Program. Lang. Syst., 37(3):10, 2015. Google Scholar
Digital Library
- M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on GPUs. In IISWC, pages 141–151. IEEE, 2012. Google Scholar
Digital Library
- D. Cederman and P. Tsigas. On dynamic load balancing on graphics processors. In SIGGRAPH, pages 57–64. Eurographics Association, 2008. Google Scholar
Digital Library
- S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. Pannotia: Understanding irregular GPGPU graph applications. In IISWC, pages 185–195. IEEE, 2013.Google Scholar
- P. Collingbourne, A. F. Donaldson, J. Ketema, and S. Qadeer. Interleaving and lock-step semantics for analysis and verification of GPU kernels. In ESOP, pages 270–289. Springer, 2013. Google Scholar
Digital Library
- B. Gaster. A look at the OpenCL 2.0 execution model. In IWOCL, pages 2:1–2:1. ACM, 2015. Google Scholar
Digital Library
- B. R. Gaster, D. Hower, and L. Howes. HRF-relaxed: Adapting HRF to the complexities of industrial heterogeneous memory models. Trans. Archit. Code Optim., 2015. Google Scholar
Digital Library
- K. Gupta, J. Stuart, and J. D. Owens. A study of persistent threads style GPU programming for GPGPU workloads. In Proceedings of Innovative Parallel Computing, InPar, pages 1–14. IEEE, 2012.Google Scholar
- M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., 2008. Google Scholar
Digital Library
- D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Heterogeneousrace-free memory models. In ASPLOS, pages 427–440. ACM, 2014. Google Scholar
Digital Library
- Intel. The compute architecture of Intel processor graphics gen9, version 1.0, Aug. 2015.Google Scholar
- ISO/IEC. Standard for programming language C++, 2012.Google Scholar
- Khronos Group. The OpenCL C specification version: 2.0. https://www.khronos.org/registry/cl/ specs/opencl-2.0-openclc.pdf.Google Scholar
- Khronos Group. The OpenCL specification version: 2.0 (rev. 29), July 2015.Google Scholar
- https://www.khronos.org/ registry/cl/specs/opencl-2.0.pdf.Google Scholar
- G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan. GKLEE: concolic verification and test generation for GPUs. In PPoPP, pages 215–224. ACM, 2012. Google Scholar
Digital Library
- S. Maleki, A. Yang, and M. Burtscher. Higher-order and tuplebased massively-parallel prefix sums. In PLDI, pages 539– 552. ACM, 2016. Google Scholar
Digital Library
- D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In PPoPP, pages 117–128. ACM, 2012. Google Scholar
Digital Library
- M. Mrozek and Z. Zdanowicz. GPU daemon: Road to zero cost submission. In IWOCL, pages 11:1–11:4. ACM, 2016. Google Scholar
Digital Library
- Nvidia. CUB, April 2015. http://nvlabs.github. io/cub/.Google Scholar
- Nvidia. CUDA C programming guide, version 7, March 2015. http://docs.nvidia.com/cuda/pdf/ CUDA_C_Programming_Guide.pdf.Google Scholar
- OpenMP Architecture Review Board. OpenMP application programming interface version 4.5, November 2015.Google Scholar
- M. S. Orr, S. Che, A. Yilmazer, B. M. Beckmann, M. D. Hill, and D. A. Wood. Synchronization using remote-scope promotion. In ASPLOS, pages 73–86. ACM, 2015. Google Scholar
Digital Library
- S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU concurrency with elastic kernels. In ASPLOS, pages 407–418. ACM, 2013. Google Scholar
Digital Library
- Y. Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Publishing, 2009.Google Scholar
- T. Sorensen and A. F. Donaldson. The hitchhiker’s guide to cross-platform OpenCL application development. IWOCL, pages 2:1–2:12. ACM, 2016. Google Scholar
Digital Library
- Y. Torres, A. Gonzalez-Escribano, and D. Llanos. Understanding the impact of CUDA tuning techniques for Fermi. In High Performance Computing and Simulation (HPCS), pages 631–639, 2011.Google Scholar
Cross Ref
- S. Tzeng, A. Patney, and J. D. Owens. Task management for irregular-parallel workloads on the GPU. In HPG, pages 29– 37, 2010. Google Scholar
Digital Library
- B. Wu, G. Chen, D. Li, X. Shen, and J. Vetter. Enabling and exploiting flexible task assignment on GPU through SMcentric program transformations. In ICS, pages 119–130. ACM, 2015. Google Scholar
Digital Library
- S. Xiao and W. Feng. Inter-block GPU communication via fast barrier synchronization. In IPDPS, pages 1–12. IEEE, 2010.Google Scholar
Index Terms
Portable inter-workgroup barrier synchronisation for GPUs
Recommendations
Portable inter-workgroup barrier synchronisation for GPUs
OOPSLA 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and ApplicationsDespite the growing popularity of GPGPU programming, there is not yet a portable and formally-specified barrier that one can use to synchronise across workgroups. Moreover, the occupancy-bound execution model of GPUs breaks assumptions inherent in ...
An OpenCL Micro-Benchmark Suite for GPUs and CPUs
PDCAT '12: Proceedings of the 2012 13th International Conference on Parallel and Distributed Computing, Applications and TechnologiesOpenCL (Open Computing Language) is the first open, royalty-free standard for cross-platform, parallel programming of modern processors in personal computers, servers and handheld/embedded devices. OpenCL is vendor-independent and hence not specialized ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and AnalysisOpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...







Comments