skip to main content

Specifying and testing GPU workgroup progress models

Published:15 October 2021Publication History
Skip Abstract Section

Abstract

As GPU availability has increased and programming support has matured, a wider variety of applications are being ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; as such, their correct execution depends on a degree of relative forward progress between threads (or thread groups). Unfortunately, many GPU programming specifications (e.g. Vulkan and Metal) say almost nothing about relative forward progress guarantees between workgroups. Although prior work has proposed a spectrum of plausible progress models for GPUs, cross-vendor specifications have yet to commit to any model.

This work is a collection of tools and experimental data to aid specification designers when considering forward progress guarantees in programming frameworks. As a foundation, we formalize a small parallel programming language that captures the essence of fine-grained synchronization. We then provide a means of formally specifying a progress model, and develop a termination oracle that decides whether a given program is guaranteed to eventually terminate with respect to a given progress model. Next, we formalize a set of constraints that describe concurrent programs that require forward progress to terminate. This allows us to synthesize a large set of 483 progress litmus tests. Combined with the termination oracle, we can determine the expected status of each litmus test -- i.e. whether it is guaranteed to eventually terminate -- under various progress models. We present a large experimental campaign running the litmus tests across 8 GPUs from 5 different vendors. Our results highlight that GPUs have significantly different termination behaviors under our test suite. Most notably, we find that Apple and ARM GPUs do not support the linear occupancy-bound model, as was hypothesized by prior work.

Skip Supplemental Material Section

Supplemental Material

Auxiliary Presentation Video

As GPU availability has increased and programming support has matured, a wider variety of applications are being ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; as such, their correct execution depends on a degree of relative forward progress between threads (or thread groups). Unfortunately, many GPU programming specifications (e.g. Vulkan and Metal) say almost nothing about relative forward progress guarantees between workgroups. Although prior work has proposed a spectrum of plausible progress models for GPUs, cross-vendor specifications have yet to commit to any model. This work is a collection of tools and experimental data to aid specification designers when considering forward progress guarantees in programming frameworks.

References

  1. Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. 2015. GPU Concurrency: Weak Behaviours and Programming Assumptions. In Architectural Support for Programming Languages and Operating Systems, ASPLOS. ACM. https://doi.org/10.1145/2694344.2694391 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. AMD. 2019. RDNA Architecture. Whitepaper, https://www.amd.com/system/files/documents/rdna-whitepaper.pdfGoogle ScholarGoogle Scholar
  3. Apple. 2020. Metal Shading Language Specification v2.3. https://developer.apple.com/metal/Google ScholarGoogle Scholar
  4. Christel Baier and Joost-Pieter Katoen. 2008. Principles of model checking. MIT Press. isbn:978-0-262-02649-9Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Benchmark Suite. CoRR, abs/1508.03619 (2015), arXiv:1508.03619. arxiv:1508.03619Google ScholarGoogle Scholar
  6. Adam Betts, Nathan Chong, Alastair F. Donaldson, Jeroen Ketema, Shaz Qadeer, Paul Thomson, and John Wickerson. 2015. The Design and Implementation of a Verification Technique for GPU Kernels. ACM Trans. Program. Lang. Syst., 37, 3 (2015), 10:1–10:49. https://doi.org/10.1145/2743017 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Adam Betts, Nathan Chong, Alastair F. Donaldson, Shaz Qadeer, and Paul Thomson. 2012. GPUVerify: a verifier for GPU kernels. In Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA. ACM. https://doi.org/10.1145/2384616.2384625 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. E. Blelloch. 1989. Scans As Primitive Parallel Operations. IEEE Trans. Comput., 38, 11 (1989), 1526–1538. https://doi.org/10.1109/12.42122 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Daniel Cederman and Philippas Tsigas. 2008. On Dynamic Load Balancing on Graphics Processors. In SIGGRAPH. Eurographics Association, 57–64. https://doi.org/10.5555/1413957.1413967Google ScholarGoogle Scholar
  10. David Champelovier, Xavier Clerc, Hubert Garavel, Yves Guerte, Christine McKinty, Vincent Powazny, Frédéric Lang, Wendelin Serwe, and Gideon Smeding. 2021. Reference Manual of the LNT to LOTOS Translator (Version 7.0). Aug., https://cadp.inria.fr/ftp/publications/cadp/Champelovier-Clerc-Garavel-et-al-10.pdf INRIA, Grenoble, France.Google ScholarGoogle Scholar
  11. Nathan Chong, Tyler Sorensen, and John Wickerson. 2018. The semantics of transactions and weak memory in x86, Power, ARM, and C++. In PLDI. ACM, 211–225. https://doi.org/10.1145/3192366.3192373 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Peter Collingbourne, Alastair F. Donaldson, Jeroen Ketema, and Shaz Qadeer. 2013. Interleaving and Lock-Step Semantics for Analysis and Verification of GPU Kernels. In ESOP, Matthias Felleisen and Philippa Gardner (Eds.) (Lecture Notes in Computer Science, Vol. 7792). Springer, 270–289. https://doi.org/10.1007/978-3-642-37036-6_16 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Alastair F. Donaldson, Hugues Evrard, Andrei Lascu, and Paul Thomson. 2017. Automated testing of graphics shader compilers. Proc. ACM Program. Lang., 1, OOPSLA (2017), https://doi.org/10.1145/3133917 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Alastair F. Donaldson, Hugues Evrard, and Paul Thomson. 2020. Putting Randomized Compiler Testing into Production (Experience Report). In 34th European Conference on Object-Oriented Programming, ECOOP 2020, November 15-17, 2020, Berlin, Germany (Virtual Conference), Robert Hirschfeld and Tobias Pape (Eds.) (LIPIcs, Vol. 166). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 22:1–22:29. https://doi.org/10.4230/LIPIcs.ECOOP.2020.22 Google ScholarGoogle ScholarCross RefCross Ref
  15. Alexandru Dutu, Matthew D. Sinclair, Bradford M. Beckmann, David A. Wood, and Marcus Chow. 2020. Independent Forward Progress of Work-groups. In International Symposium on Computer Architecture, ISCA. IEEE. https://doi.org/10.1109/ISCA45697.2020.00087 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ahmed ElTantawy and Tor M. Aamodt. 2016. MIMD synchronization on SIMT architectures. In 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO. IEEE Computer Society. https://doi.org/10.1109/MICRO.2016.7783714 Google ScholarGoogle ScholarCross RefCross Ref
  17. Dov M. Gabbay, Amir Pnueli, Saharon Shelah, and Jonathan Stavi. 1980. On the Temporal Basis of Fairness. In Symposium on Principles of Programming Languages, Paul W. Abrahams, Richard J. Lipton, and Stephen R. Bourne (Eds.). ACM Press, 163–173. https://doi.org/10.1145/567446.567462 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hubert Garavel, Frédéric Lang, Radu Mateescu, and Wendelin Serwe. 2013. CADP 2011: a toolbox for the construction and analysis of distributed processes. International Journal on Software Tools for Technology Transfer, 15, 2 (2013), 89–107. https://doi.org/10.1007/s10009-012-0244-z Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Google. 2020. Amber. https://github.com/google/amberGoogle ScholarGoogle Scholar
  20. Kshitij Gupta, Jeff Stuart, and John D. Owens. 2012. A Study of Persistent Threads Style GPU Programming for GPGPU Workloads. In InPar. 1–14. https://doi.org/10.1109/InPar.2012.6339596 Google ScholarGoogle Scholar
  21. Axel Habermaier and Alexander Knapp. 2012. On the Correctness of the SIMT Execution Model of GPUs. In ESOP, Helmut Seidl (Ed.) (Lecture Notes in Computer Science, Vol. 7211). Springer, 316–335. https://doi.org/10.1007/978-3-642-28869-2_16 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. HSA Foundation. 2017. HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer, and Object Format (BRIG). (rev 1.1.1).Google ScholarGoogle Scholar
  23. 2015. Heterogeneous System Architecture: A New Compute Platform Infrastructure, Wen-mei W. Hwu (Ed.). Morgan Kaufmann.Google ScholarGoogle Scholar
  24. Intel. 2021. oneAPI GPU Optimization Guide. https://software.intel.com/content/dam/develop/external/us/en/documents/oneapi-gpu-optimization-guide.pdfGoogle ScholarGoogle Scholar
  25. ISO/IEC. 2017. Working Draft, Standard for Programming Language C++ v. N4659. https://doi.org/jtc1/sc22/wg21/docs/papers/2017/n4659.pdfGoogle ScholarGoogle Scholar
  26. Daniel Jackson. 2012. Software Abstractions: Logic, Language, and Analysis. MIT Press. isbn:0262017156Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Khronos Group. 2020. Khronos Group Releases OpenCL 3.0. https://khr.io/ocl3pressreleaseGoogle ScholarGoogle Scholar
  28. Khronos Group. 2020. Vulkan 1.2.174 - A Specification. https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.htmlGoogle ScholarGoogle Scholar
  29. Yonghae Kim and Hyesoon Kim. 2019. Translating CUDA to OpenCL for Hardware Generation using Neural Machine Translation. In International Symposium on Code Generation and Optimization, CGO. IEEE. https://doi.org/10.1109/CGO.2019.8661172 Google ScholarGoogle ScholarCross RefCross Ref
  30. Dexter Kozen. 1983. Results on the Propositional mu-Calculus. Theor. Comput. Sci., 27 (1983), 333–354. https://doi.org/10.1016/0304-3975(82)90125-6 Google ScholarGoogle ScholarCross RefCross Ref
  31. Ori Lahav, Egor Namakonov, Jonas Oberhauser, Anton Podkopaev, and Viktor Vafeiadis. 2021. Making Weak Memory Models Fair. Proc. ACM Program. Lang., 6, OOPSLA (2021), https://doi.org/10.1145/3485475Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Raph Levien. 2020. Prefix sum on Vulkan. https://raphlinus.github.io/gpu/2020/04/30/prefix-sum.htmlGoogle ScholarGoogle Scholar
  33. Guodong Li, Peng Li, Geoffrey Sawaya, Ganesh Gopalakrishnan, Indradeep Ghosh, and Sreeranga P. Rajan. 2012. GKLEE: concolic verification and test generation for GPUs. In Principles and Practice of Parallel Programming, PPOPP. ACM. https://doi.org/10.1145/2145816.2145844 Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Daniel Lustig, Andrew Wright, Alexandros Papakonstantinou, and Olivier Giroux. 2017. Automated Synthesis of Comprehensive Memory Model Litmus Test Suites. In ASPLOS. ACM, 661–675. https://doi.org/10.1145/3037697.3037723 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Sepideh Maleki, Annie Yang, and Martin Burtscher. 2016. Higher-order and Tuple-based Massively-parallel Prefix Sums. In PLDI. ACM, 539–552. https://doi.org/10.1145/2908080.2908089 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Gabriel Martinez, Mark K. Gardner, and Wu-chun Feng. 2011. CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-Core Architectures. In International Conference on Parallel and Distributed Systems, ICPADS. IEEE Computer Society. https://doi.org/10.1109/ICPADS.2011.48 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Radu Mateescu and Wendelin Serwe. 2010. A Study of Shared-Memory Mutual Exclusion Protocols Using CADP. In Formal Methods for Industrial Critical Systems FMICS, Stefan Kowalewski and Marco Roveri (Eds.) (Lecture Notes in Computer Science, Vol. 6371). Springer, 180–197. https://doi.org/10.1007/978-3-642-15898-8_12 Google ScholarGoogle ScholarCross RefCross Ref
  38. Radu Mateescu and Damien Thivolle. 2008. A Model Checking Language for Concurrent Value-Passing Systems. In International Symposium on Formal Methods (FM), Jorge Cuéllar, T. S. E. Maibaum, and Kaisa Sere (Eds.) (Lecture Notes in Computer Science, Vol. 5014). Springer, 148–164. https://doi.org/10.1007/978-3-540-68237-0_12 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. MCL. 2008. MCL manual page. http://cadp.inria.fr/man/mcl4.htmlGoogle ScholarGoogle Scholar
  40. Bruce Merry. 2015. A Performance Comparison of Sort and Scan Libraries for GPUs. Parallel Process. Lett., 25, 4 (2015), 1550007:1–1550007:8. https://doi.org/10.1142/S0129626415500073 Google ScholarGoogle ScholarCross RefCross Ref
  41. Jacob Nelson and Roberto Palmieri. 2019. Don’t Forget About Synchronization!: A Case Study of K-Means on GPU. In International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM. ACM. https://doi.org/10.1145/3303084.3309488 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Nvidia. 2017. Nvidia Tesla V100 GPU ARCHITECTURE. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Whitepaper WP-08608-001_v1.1.Google ScholarGoogle Scholar
  43. Nvidia. 2021. CUB v1.12.0. https://github.com/NVlabs/cubGoogle ScholarGoogle Scholar
  44. Nvidia. 2021. CUDA C++ Programming Guide, Version 11.2.1. https://docs.nvidia.com/cuda/archive/11.2.1/cuda-c-programming-guide/Google ScholarGoogle Scholar
  45. Tyler Sorensen and Alastair F. Donaldson. 2016. Exposing errors related to weak memory in GPU applications. In Programming Language Design and Implementation, PLDI, Chandra Krintz and Emery Berger (Eds.). ACM, 100–113. https://doi.org/10.1145/2908080.2908114 Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Tyler Sorensen, Alastair F. Donaldson, Mark Batty, Ganesh Gopalakrishnan, and Zvonimir Rakamaric. 2016. Portable inter-workgroup barrier synchronisation for GPUs. In Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA. ACM. https://doi.org/10.1145/3022671.2984032 Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Tyler Sorensen, Hugues Evrard, and Alastair F. Donaldson. 2018. GPU Schedulers: How Fair Is Fair Enough? In 29th International Conference on Concurrency Theory, CONCUR 2018, September 4-7, 2018, Beijing, China, Sven Schewe and Lijun Zhang (Eds.) (LIPIcs, Vol. 118). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 23:1–23:17. https://doi.org/10.4230/LIPIcs.CONCUR.2018.23 Google ScholarGoogle ScholarCross RefCross Ref
  48. Tyler Sorensen, Sreepathi Pai, and Alastair F. Donaldson. 2019. One Size Doesn’t Fit All: Quantifying Performance Portability of Graph Applications on GPUs. In IEEE International Symposium on Workload Characterization, IISWC. IEEE. https://doi.org/10.1109/IISWC47752.2019.9042139 Google ScholarGoogle ScholarCross RefCross Ref
  49. Tyler Sorensen, Lucas F. Salvador, Harmit Raval, Hugues Evrard, John Wickerson, Margaret Martonosi, and Alastair F. Donaldson. 2021. Artifact for "Specifying and Testing GPU Workgroup Progress Models" (OOPSLA 2021). https://doi.org/10.5281/zenodo.5501522 also available at: Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Tyler Sorensen, Lucas F. Salvador, Harmit Raval, Hugues Evrard, John Wickerson, Margaret Martonosi, and Alastair F. Donaldson. 2021. Specifying and Testing GPU Workgroup Progress Models (Extended Version). CoRR, abs/2109.06132 (2021), https://doi.org/10.1145/3485508 arXiv:2109.06132.Google ScholarGoogle Scholar
  51. Tuan Ta, Xianwei Zhang, Anthony Gutierrez, and Bradford M. Beckmann. 2019. Autonomous Data-Race-Free GPU Testing. In International Symposium on Workload Characterization, IISWC. IEEE. https://doi.org/10.1109/IISWC47752.2019.9042019 Google ScholarGoogle ScholarCross RefCross Ref
  52. Stanley Tzeng, Anjul Patney, and John D. Owens. 2010. Task Management for Irregular-Parallel Workloads on the GPU. In HPG. 29–37. https://doi.org/10.5555/1921479.1921485Google ScholarGoogle Scholar
  53. Yangzihao Wang, Andrew A. Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2016. Gunrock: a high-performance graph processing library on the GPU. In Principles and Practice of Parallel Programming (PPoPP). ACM. https://doi.org/10.1145/2851141.2851145 Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. John Wickerson, Mark Batty, Tyler Sorensen, and George A. Constantinides. 2017. Automatically comparing memory consistency models. In Principles of Programming Languages, POPL. ACM. https://doi.org/10.1145/3093333.3009838 Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In International Symposium on Performance Analysis of Systems and Software, ISPASS. IEEE. https://doi.org/10.1109/ISPASS.2010.5452013 Google ScholarGoogle ScholarCross RefCross Ref
  56. Shucai Xiao and Wu-chun Feng. 2010. Inter-block GPU communication via fast barrier synchronization. In IPDPS. 1–12. https://doi.org/10.1109/IPDPS.2010.5470477 Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Specifying and testing GPU workgroup progress models

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!