Abstract
As GPU availability has increased and programming support has matured, a wider variety of applications are being ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; as such, their correct execution depends on a degree of relative forward progress between threads (or thread groups). Unfortunately, many GPU programming specifications (e.g. Vulkan and Metal) say almost nothing about relative forward progress guarantees between workgroups. Although prior work has proposed a spectrum of plausible progress models for GPUs, cross-vendor specifications have yet to commit to any model.
This work is a collection of tools and experimental data to aid specification designers when considering forward progress guarantees in programming frameworks. As a foundation, we formalize a small parallel programming language that captures the essence of fine-grained synchronization. We then provide a means of formally specifying a progress model, and develop a termination oracle that decides whether a given program is guaranteed to eventually terminate with respect to a given progress model. Next, we formalize a set of constraints that describe concurrent programs that require forward progress to terminate. This allows us to synthesize a large set of 483 progress litmus tests. Combined with the termination oracle, we can determine the expected status of each litmus test -- i.e. whether it is guaranteed to eventually terminate -- under various progress models. We present a large experimental campaign running the litmus tests across 8 GPUs from 5 different vendors. Our results highlight that GPUs have significantly different termination behaviors under our test suite. Most notably, we find that Apple and ARM GPUs do not support the linear occupancy-bound model, as was hypothesized by prior work.
Supplemental Material
- Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. 2015. GPU Concurrency: Weak Behaviours and Programming Assumptions. In Architectural Support for Programming Languages and Operating Systems, ASPLOS. ACM. https://doi.org/10.1145/2694344.2694391 Google Scholar
Digital Library
- AMD. 2019. RDNA Architecture. Whitepaper, https://www.amd.com/system/files/documents/rdna-whitepaper.pdfGoogle Scholar
- Apple. 2020. Metal Shading Language Specification v2.3. https://developer.apple.com/metal/Google Scholar
- Christel Baier and Joost-Pieter Katoen. 2008. Principles of model checking. MIT Press. isbn:978-0-262-02649-9Google Scholar
Digital Library
- Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Benchmark Suite. CoRR, abs/1508.03619 (2015), arXiv:1508.03619. arxiv:1508.03619Google Scholar
- Adam Betts, Nathan Chong, Alastair F. Donaldson, Jeroen Ketema, Shaz Qadeer, Paul Thomson, and John Wickerson. 2015. The Design and Implementation of a Verification Technique for GPU Kernels. ACM Trans. Program. Lang. Syst., 37, 3 (2015), 10:1–10:49. https://doi.org/10.1145/2743017 Google Scholar
Digital Library
- Adam Betts, Nathan Chong, Alastair F. Donaldson, Shaz Qadeer, and Paul Thomson. 2012. GPUVerify: a verifier for GPU kernels. In Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA. ACM. https://doi.org/10.1145/2384616.2384625 Google Scholar
Digital Library
- G. E. Blelloch. 1989. Scans As Primitive Parallel Operations. IEEE Trans. Comput., 38, 11 (1989), 1526–1538. https://doi.org/10.1109/12.42122 Google Scholar
Digital Library
- Daniel Cederman and Philippas Tsigas. 2008. On Dynamic Load Balancing on Graphics Processors. In SIGGRAPH. Eurographics Association, 57–64. https://doi.org/10.5555/1413957.1413967Google Scholar
- David Champelovier, Xavier Clerc, Hubert Garavel, Yves Guerte, Christine McKinty, Vincent Powazny, Frédéric Lang, Wendelin Serwe, and Gideon Smeding. 2021. Reference Manual of the LNT to LOTOS Translator (Version 7.0). Aug., https://cadp.inria.fr/ftp/publications/cadp/Champelovier-Clerc-Garavel-et-al-10.pdf INRIA, Grenoble, France.Google Scholar
- Nathan Chong, Tyler Sorensen, and John Wickerson. 2018. The semantics of transactions and weak memory in x86, Power, ARM, and C++. In PLDI. ACM, 211–225. https://doi.org/10.1145/3192366.3192373 Google Scholar
Digital Library
- Peter Collingbourne, Alastair F. Donaldson, Jeroen Ketema, and Shaz Qadeer. 2013. Interleaving and Lock-Step Semantics for Analysis and Verification of GPU Kernels. In ESOP, Matthias Felleisen and Philippa Gardner (Eds.) (Lecture Notes in Computer Science, Vol. 7792). Springer, 270–289. https://doi.org/10.1007/978-3-642-37036-6_16 Google Scholar
Digital Library
- Alastair F. Donaldson, Hugues Evrard, Andrei Lascu, and Paul Thomson. 2017. Automated testing of graphics shader compilers. Proc. ACM Program. Lang., 1, OOPSLA (2017), https://doi.org/10.1145/3133917 Google Scholar
Digital Library
- Alastair F. Donaldson, Hugues Evrard, and Paul Thomson. 2020. Putting Randomized Compiler Testing into Production (Experience Report). In 34th European Conference on Object-Oriented Programming, ECOOP 2020, November 15-17, 2020, Berlin, Germany (Virtual Conference), Robert Hirschfeld and Tobias Pape (Eds.) (LIPIcs, Vol. 166). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 22:1–22:29. https://doi.org/10.4230/LIPIcs.ECOOP.2020.22 Google Scholar
Cross Ref
- Alexandru Dutu, Matthew D. Sinclair, Bradford M. Beckmann, David A. Wood, and Marcus Chow. 2020. Independent Forward Progress of Work-groups. In International Symposium on Computer Architecture, ISCA. IEEE. https://doi.org/10.1109/ISCA45697.2020.00087 Google Scholar
Digital Library
- Ahmed ElTantawy and Tor M. Aamodt. 2016. MIMD synchronization on SIMT architectures. In 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO. IEEE Computer Society. https://doi.org/10.1109/MICRO.2016.7783714 Google Scholar
Cross Ref
- Dov M. Gabbay, Amir Pnueli, Saharon Shelah, and Jonathan Stavi. 1980. On the Temporal Basis of Fairness. In Symposium on Principles of Programming Languages, Paul W. Abrahams, Richard J. Lipton, and Stephen R. Bourne (Eds.). ACM Press, 163–173. https://doi.org/10.1145/567446.567462 Google Scholar
Digital Library
- Hubert Garavel, Frédéric Lang, Radu Mateescu, and Wendelin Serwe. 2013. CADP 2011: a toolbox for the construction and analysis of distributed processes. International Journal on Software Tools for Technology Transfer, 15, 2 (2013), 89–107. https://doi.org/10.1007/s10009-012-0244-z Google Scholar
Digital Library
- Google. 2020. Amber. https://github.com/google/amberGoogle Scholar
- Kshitij Gupta, Jeff Stuart, and John D. Owens. 2012. A Study of Persistent Threads Style GPU Programming for GPGPU Workloads. In InPar. 1–14. https://doi.org/10.1109/InPar.2012.6339596 Google Scholar
- Axel Habermaier and Alexander Knapp. 2012. On the Correctness of the SIMT Execution Model of GPUs. In ESOP, Helmut Seidl (Ed.) (Lecture Notes in Computer Science, Vol. 7211). Springer, 316–335. https://doi.org/10.1007/978-3-642-28869-2_16 Google Scholar
Digital Library
- HSA Foundation. 2017. HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer, and Object Format (BRIG). (rev 1.1.1).Google Scholar
- 2015. Heterogeneous System Architecture: A New Compute Platform Infrastructure, Wen-mei W. Hwu (Ed.). Morgan Kaufmann.Google Scholar
- Intel. 2021. oneAPI GPU Optimization Guide. https://software.intel.com/content/dam/develop/external/us/en/documents/oneapi-gpu-optimization-guide.pdfGoogle Scholar
- ISO/IEC. 2017. Working Draft, Standard for Programming Language C++ v. N4659. https://doi.org/jtc1/sc22/wg21/docs/papers/2017/n4659.pdfGoogle Scholar
- Daniel Jackson. 2012. Software Abstractions: Logic, Language, and Analysis. MIT Press. isbn:0262017156Google Scholar
Digital Library
- Khronos Group. 2020. Khronos Group Releases OpenCL 3.0. https://khr.io/ocl3pressreleaseGoogle Scholar
- Khronos Group. 2020. Vulkan 1.2.174 - A Specification. https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.htmlGoogle Scholar
- Yonghae Kim and Hyesoon Kim. 2019. Translating CUDA to OpenCL for Hardware Generation using Neural Machine Translation. In International Symposium on Code Generation and Optimization, CGO. IEEE. https://doi.org/10.1109/CGO.2019.8661172 Google Scholar
Cross Ref
- Dexter Kozen. 1983. Results on the Propositional mu-Calculus. Theor. Comput. Sci., 27 (1983), 333–354. https://doi.org/10.1016/0304-3975(82)90125-6 Google Scholar
Cross Ref
- Ori Lahav, Egor Namakonov, Jonas Oberhauser, Anton Podkopaev, and Viktor Vafeiadis. 2021. Making Weak Memory Models Fair. Proc. ACM Program. Lang., 6, OOPSLA (2021), https://doi.org/10.1145/3485475Google Scholar
Digital Library
- Raph Levien. 2020. Prefix sum on Vulkan. https://raphlinus.github.io/gpu/2020/04/30/prefix-sum.htmlGoogle Scholar
- Guodong Li, Peng Li, Geoffrey Sawaya, Ganesh Gopalakrishnan, Indradeep Ghosh, and Sreeranga P. Rajan. 2012. GKLEE: concolic verification and test generation for GPUs. In Principles and Practice of Parallel Programming, PPOPP. ACM. https://doi.org/10.1145/2145816.2145844 Google Scholar
Digital Library
- Daniel Lustig, Andrew Wright, Alexandros Papakonstantinou, and Olivier Giroux. 2017. Automated Synthesis of Comprehensive Memory Model Litmus Test Suites. In ASPLOS. ACM, 661–675. https://doi.org/10.1145/3037697.3037723 Google Scholar
Digital Library
- Sepideh Maleki, Annie Yang, and Martin Burtscher. 2016. Higher-order and Tuple-based Massively-parallel Prefix Sums. In PLDI. ACM, 539–552. https://doi.org/10.1145/2908080.2908089 Google Scholar
Digital Library
- Gabriel Martinez, Mark K. Gardner, and Wu-chun Feng. 2011. CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-Core Architectures. In International Conference on Parallel and Distributed Systems, ICPADS. IEEE Computer Society. https://doi.org/10.1109/ICPADS.2011.48 Google Scholar
Digital Library
- Radu Mateescu and Wendelin Serwe. 2010. A Study of Shared-Memory Mutual Exclusion Protocols Using CADP. In Formal Methods for Industrial Critical Systems FMICS, Stefan Kowalewski and Marco Roveri (Eds.) (Lecture Notes in Computer Science, Vol. 6371). Springer, 180–197. https://doi.org/10.1007/978-3-642-15898-8_12 Google Scholar
Cross Ref
- Radu Mateescu and Damien Thivolle. 2008. A Model Checking Language for Concurrent Value-Passing Systems. In International Symposium on Formal Methods (FM), Jorge Cuéllar, T. S. E. Maibaum, and Kaisa Sere (Eds.) (Lecture Notes in Computer Science, Vol. 5014). Springer, 148–164. https://doi.org/10.1007/978-3-540-68237-0_12 Google Scholar
Digital Library
- MCL. 2008. MCL manual page. http://cadp.inria.fr/man/mcl4.htmlGoogle Scholar
- Bruce Merry. 2015. A Performance Comparison of Sort and Scan Libraries for GPUs. Parallel Process. Lett., 25, 4 (2015), 1550007:1–1550007:8. https://doi.org/10.1142/S0129626415500073 Google Scholar
Cross Ref
- Jacob Nelson and Roberto Palmieri. 2019. Don’t Forget About Synchronization!: A Case Study of K-Means on GPU. In International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM. ACM. https://doi.org/10.1145/3303084.3309488 Google Scholar
Digital Library
- Nvidia. 2017. Nvidia Tesla V100 GPU ARCHITECTURE. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Whitepaper WP-08608-001_v1.1.Google Scholar
- Nvidia. 2021. CUB v1.12.0. https://github.com/NVlabs/cubGoogle Scholar
- Nvidia. 2021. CUDA C++ Programming Guide, Version 11.2.1. https://docs.nvidia.com/cuda/archive/11.2.1/cuda-c-programming-guide/Google Scholar
- Tyler Sorensen and Alastair F. Donaldson. 2016. Exposing errors related to weak memory in GPU applications. In Programming Language Design and Implementation, PLDI, Chandra Krintz and Emery Berger (Eds.). ACM, 100–113. https://doi.org/10.1145/2908080.2908114 Google Scholar
Digital Library
- Tyler Sorensen, Alastair F. Donaldson, Mark Batty, Ganesh Gopalakrishnan, and Zvonimir Rakamaric. 2016. Portable inter-workgroup barrier synchronisation for GPUs. In Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA. ACM. https://doi.org/10.1145/3022671.2984032 Google Scholar
Digital Library
- Tyler Sorensen, Hugues Evrard, and Alastair F. Donaldson. 2018. GPU Schedulers: How Fair Is Fair Enough? In 29th International Conference on Concurrency Theory, CONCUR 2018, September 4-7, 2018, Beijing, China, Sven Schewe and Lijun Zhang (Eds.) (LIPIcs, Vol. 118). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 23:1–23:17. https://doi.org/10.4230/LIPIcs.CONCUR.2018.23 Google Scholar
Cross Ref
- Tyler Sorensen, Sreepathi Pai, and Alastair F. Donaldson. 2019. One Size Doesn’t Fit All: Quantifying Performance Portability of Graph Applications on GPUs. In IEEE International Symposium on Workload Characterization, IISWC. IEEE. https://doi.org/10.1109/IISWC47752.2019.9042139 Google Scholar
Cross Ref
- Tyler Sorensen, Lucas F. Salvador, Harmit Raval, Hugues Evrard, John Wickerson, Margaret Martonosi, and Alastair F. Donaldson. 2021. Artifact for "Specifying and Testing GPU Workgroup Progress Models" (OOPSLA 2021). https://doi.org/10.5281/zenodo.5501522 also available at: Google Scholar
Digital Library
- Tyler Sorensen, Lucas F. Salvador, Harmit Raval, Hugues Evrard, John Wickerson, Margaret Martonosi, and Alastair F. Donaldson. 2021. Specifying and Testing GPU Workgroup Progress Models (Extended Version). CoRR, abs/2109.06132 (2021), https://doi.org/10.1145/3485508 arXiv:2109.06132.Google Scholar
- Tuan Ta, Xianwei Zhang, Anthony Gutierrez, and Bradford M. Beckmann. 2019. Autonomous Data-Race-Free GPU Testing. In International Symposium on Workload Characterization, IISWC. IEEE. https://doi.org/10.1109/IISWC47752.2019.9042019 Google Scholar
Cross Ref
- Stanley Tzeng, Anjul Patney, and John D. Owens. 2010. Task Management for Irregular-Parallel Workloads on the GPU. In HPG. 29–37. https://doi.org/10.5555/1921479.1921485Google Scholar
- Yangzihao Wang, Andrew A. Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2016. Gunrock: a high-performance graph processing library on the GPU. In Principles and Practice of Parallel Programming (PPoPP). ACM. https://doi.org/10.1145/2851141.2851145 Google Scholar
Digital Library
- John Wickerson, Mark Batty, Tyler Sorensen, and George A. Constantinides. 2017. Automatically comparing memory consistency models. In Principles of Programming Languages, POPL. ACM. https://doi.org/10.1145/3093333.3009838 Google Scholar
Digital Library
- Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In International Symposium on Performance Analysis of Systems and Software, ISPASS. IEEE. https://doi.org/10.1109/ISPASS.2010.5452013 Google Scholar
Cross Ref
- Shucai Xiao and Wu-chun Feng. 2010. Inter-block GPU communication via fast barrier synchronization. In IPDPS. 1–12. https://doi.org/10.1109/IPDPS.2010.5470477 Google Scholar
Cross Ref
Index Terms
Specifying and testing GPU workgroup progress models
Recommendations
Architecture-Aware Mapping and Optimization on a 1600-Core GPU
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed SystemsThe graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing SystemsWith fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...






Comments