skip to main content

Modeling and analyzing evaluation cost of CUDA kernels

Published:04 January 2021Publication History
Skip Abstract Section

Abstract

General-purpose programming on GPUs (GPGPU) is becoming increasingly in vogue as applications such as machine learning and scientific computing demand high throughput in vector-parallel applications. NVIDIA's CUDA toolkit seeks to make GPGPU programming accessible by allowing programmers to write GPU functions, called kernels, in a small extension of C/C++. However, due to CUDA's complex execution model, the performance characteristics of CUDA kernels are difficult to predict, especially for novice programmers.

This paper introduces a novel quantitative program logic for CUDA kernels, which allows programmers to reason about both functional correctness and resource usage of CUDA kernels, paying particular attention to a set of common but CUDA-specific performance bottlenecks. The logic is proved sound with respect to a novel operational cost semantics for CUDA kernels. The semantics, logic and soundness proofs are formalized in Coq. An inference algorithm based on LP solving automatically synthesizes symbolic resource bounds by generating derivations in the logic. This algorithm is the basis of RaCuda, an end-to-end resource-analysis tool for kernels, which has been implemented using an existing resource-analysis tool for imperative programs. An experimental evaluation on a suite of CUDA benchmarks shows that the analysis is effective in aiding the detection of performance bugs in CUDA kernels.

References

  1. Elvira Albert, Puri Arenas, Samir Genaim, Miguel Gómez-Zamalloa, and German Puebla. 2011. Cost Analysis of Concurrent OO Programs. In 9th Asian Symp. on Prog. Langs. and Systems (APLAS'11). https://doi.org/10.1007/978-3-642-25318-8_19 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ugo Dal Lago and Marco Gaboardi. 2011. Linear Dependent Types and Relative Completeness. In 26th IEEE Symp. on Logic in Computer Science (LICS'11). https://doi.org/10.1109/LICS. 2011.22 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Guodong Li and Ganesh Gopalakrishnan. 2010. SMT-Based Verification of GPU Kernel Functions. In International Symposium on the Foundations of Software Engineering (FSE) (FSE '10). https://doi.org/10.1145/1882291.1882320 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Guodong Li, Peng Li, Geof Sawaya, Ganesh Gopalakrishnan, Indradeep Ghosh, and Sreeranga P. Rajan. 2012. GKLEE: Concolic verification and test generation for GPUs. In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). https://doi.org/10.1145/2370036.2145844 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Stefan K. Muller and Umut A. Acar. 2016. Latency-Hiding Work Stealing: Scheduling Interacting Parallel Computations with Work Stealing. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '16). ACM, New York, NY, USA, 71-82. https://doi.org/10.1145/2935764.2935793 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Van Chan Ngo, Quentin Carbonneaux, and Jan Hofmann. 2018. Bounded Expectations: Resource Analysis for Probabilistic Programs. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018 ). ACM, New York, NY, USA, 496-512. https://doi.org/10.1145/3192366.3192394 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. NVIDIA Corporation. 2019. CUDA C Programming Guide v. 10.1.168.Google ScholarGoogle Scholar
  8. Yuanfeng Peng, Vinod Grover, and Joseph Devietti. 2018. CURD: A Dynamic CUDA Race Detector. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018 ). ACM, New York, NY, USA, 390-403. https://doi.org/10.1145/3192366.3192368 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Phillipe Pereira, Higo Albuquerque, Hendrio Marques, Isabela Silva, Celso Carvalho, Lucas Cordeiro, Vanessa Santos, and Ricardo Ferreira. 2016. Verifying CUDA Programs Using SMT-based Context-bounded Model Checking. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (SAC '16). ACM, New York, NY, USA, 1648-1653. https: //doi.org/10.1145/2851613.2851830 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ivan Radiček, Gilles Barthe, Marco Gaboardi, Deepak Garg, and Florian Zuleger. 2017. Monadic Refinements for Relational Cost Analysis. Proc. ACM Program. Lang. 2, POPL ( 2017 ). https://doi.org/10.1145/3158124 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ilia Shumailov, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, and Ross Anderson. 2020. Sponge Examples: Energy-Latency Attacks on Neural Networks. arXiv:cs.LG/ 2006.03463Google ScholarGoogle Scholar
  12. Nimit Singhania. 2018. Static Analysis for GPU Program Performance. Ph.D. Dissertation. Computer and Information Science, University of Pennsylvania.Google ScholarGoogle Scholar
  13. Moritz Sinn, Florian Zuleger, and Helmut Veith. 2014. A Simple and Scalable Approach to Bound Analysis and Amortized Complexity Analysis. In Computer Aided Verification-26th Int. Conf. (CAV'14). https://doi.org/10.1007/978-3-319-08867-9_50 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Mingyuan Wu, Husheng Zhou, Lingming Zhang, Cong Liu, and Yuqun Zhang. 2019. Characterizing and Detecting CUDA Program Bugs. CoRR abs/ 1905. 01833 ( 2019 ). arXiv: 1905. 01833 http://arxiv.org/abs/ 1905.01833Google ScholarGoogle Scholar
  15. Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. 2011. GRace: A Low-overhead Mechanism for Detecting Data Races in GPU Programs. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP '11). ACM, New York, NY, USA, 135-146. https://doi.org/10.1145/1941553.1941574 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Modeling and analyzing evaluation cost of CUDA kernels

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!