Abstract
As GPUs have become an integral part of nearly every pro- cessor, GPU programming has become increasingly popular. GPU programming requires a combination of extreme levels of parallelism and low-level programming, making it easy for concurrency bugs such as data races to arise. These con- currency bugs can be extremely subtle and di cult to debug due to the massive numbers of threads running concurrently on a modern GPU. While some tools exist to detect data races in GPU pro- grams, they are often prohibitively slow or focused only on a small class of data races in shared memory. Compared to prior work, our race detector, CURD, can detect data races precisely on both shared and global memory, selects an appropriate race detection algorithm based on the synchronization used in a program, and utilizes efficient compiler instrumentation to reduce performance overheads. Across 53 benchmarks, we find that using CURD incurs an aver- age slowdown of just 2.88x over native execution. CURD is 2.1x faster than Nvidia’s CUDA-Racecheck race detector, de- spite detecting a much broader class of races. CURD finds 35 races across our benchmarks, including bugs in established benchmark suites and in sample programs from Nvidia.
Supplemental Material
- Ethel Bardsley, Adam Betts, Nathan Chong, Peter Collingbourne, Pantazis Deligiannis, Alastair F. Donaldson, Jeroen Ketema, Daniel Liew, and Shaz Qadeer. 2014. Engineering a Static Verification Tool for GPU Kernels. In Proceedings of the 16th International Conference on Computer Aided Verification - Volume 8559. Springer-Verlag New York, Inc., New York, NY, USA, 226–242. Google Scholar
Digital Library
- Ethel Bardsley and Alastair F. Donaldson. 2014. Warps and Atomics: Beyond Barrier Synchronization in the Verification of GPU Kernels. In Proceedings of the 6th International Symposium on NASA Formal Methods - Volume 8430. Springer-Verlag New York, Inc., New York, NY, USA, 230–245. Google Scholar
Digital Library
- Michael A. Bender, Jeremy T. Fineman, Seth Gilbert, and Charles E. Leiserson. 2004. On-the-fly maintenance of series-parallel relationships in fork-join multithreaded programs. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures (SPAA ’04). ACM, New York, NY, USA, 133–144. Google Scholar
Digital Library
- Adam Betts, Nathan Chong, Alastair Donaldson, Shaz Qadeer, and Paul Thomson. 2012. GPUVerify: A Verifier for GPU Kernels. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA ’12). ACM, New York, NY, USA, 113–132. Google Scholar
Digital Library
- Adam Betts, Nathan Chong, Alastair F. Donaldson, Jeroen Ketema, Shaz Qadeer, Paul Thomson, and John Wickerson. 2015. The Design and Implementation of a Verification Technique for GPU Kernels. ACM Trans. Program. Lang. Syst. 37, 3, Article 10 (May 2015), 49 pages. Google Scholar
Digital Library
- Michael D. Bond, Katherine E. Coons, and Kathryn S. McKinley. 2010. PACER: Proportional Detection of Data Races. In Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation - PLDI ’10. Toronto, Ontario, Canada, 255. Google Scholar
Digital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. IEEE, 44–54. Google Scholar
Digital Library
- Wei-Fan Chiang, Ganesh Gopalakrishnan, Guodong Li, and Zvonimir Rakamarić. 2013. Formal Analysis of GPU Programs with Atomics via Conflict-Directed Delay-Bounding. Springer Berlin Heidelberg, Berlin, Heidelberg, 213–228.Google Scholar
- Nathan Chong, Alastair F. Donaldson, Paul H.J. Kelly, Jeroen Ketema, and Shaz Qadeer. 2013. Barrier Invariants: A Shared State Abstraction for the Analysis of Data-dependent GPU Kernels. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA ’13). ACM, New York, NY, USA, 605–622. Google Scholar
Digital Library
- Nathan Chong, Alastair F. Donaldson, and Jeroen Ketema. 2014. A Sound and Complete Abstraction for Reasoning About Parallel Prefix Sums. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’14). ACM, New York, NY, USA, 397–409. Google Scholar
Digital Library
- Peter Collingbourne, Cristian Cadar, and Paul H. J. Kelly. 2012. Symbolic Testing of OpenCL Code. In Proceedings of the 7th International Haifa Verification Conference on Hardware and Software: Verification and Testing (HVC’11). Springer-Verlag, Berlin, Heidelberg, 203–218. Google Scholar
Digital Library
- Peter Collingbourne, Alastair F. Donaldson, Jeroen Ketema, and Shaz Qadeer. 2013. Interleaving and Lock-step Semantics for Analysis and Verification of GPU Kernels. In Proceedings of the 22Nd European Conference on Programming Languages and Systems (ESOP’13). Springer-Verlag, Berlin, Heidelberg, 270–289. Google Scholar
Digital Library
- Dimitar Dimitrov, Martin Vechev, and Vivek Sarkar. 2015. Race Detection in Two Dimensions. In Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA ’15). ACM, New York, NY, USA, 101–110. Google Scholar
Digital Library
- Anne Dinning and Edith Schonberg. 1991. Detecting access anomalies in programs with critical sections. In Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging (PADD ’91). ACM, New York, NY, USA, 85–96. Google Scholar
Digital Library
- Laura Effinger-Dean, Brandon Lucia, Luis Ceze, Dan Grossman, and Hans-J. Boehm. 2012. IFRit: interference-free regions for dynamic data-race detection. In Proceedings of the ACM international conference on Object oriented programming systems languages and applications (OOPSLA ’12). ACM, New York, NY, USA, 467–484. Google Scholar
Digital Library
- Ariel Eizenberg, Yuanfeng Peng, Toma Pigli, William Mansky, and Joseph Devietti. 2017. BARRACUDA: Binary-level Analysis of Runtime RAces in CUDA Programs. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). ACM, New York, NY, USA, 126–140. Google Scholar
Digital Library
- Tayfun Elmas, Shaz Qadeer, and Serdar Tasiran. 2007. Goldilocks: a race and transaction-aware java runtime. In Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation. 245–255. Google Scholar
Digital Library
- Mingdong Feng and Charles E. Leiserson. 1997. Efficient detection of determinacy races in Cilk programs. In Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures - SPAA ’97. Newport, Rhode Island, United States, 1–11. Google Scholar
Digital Library
- Cormac Flanagan and Stephen N. Freund. 2009. FastTrack: Efficient and Precise Dynamic Race Detection. In Proceedings of the 2009 ACM SIG-PLAN conference on Programming language design and implementation -PLDI ’09. Dublin, Ireland, 121. Google Scholar
Digital Library
- Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 40). IEEE Computer Society, Washington, DC, USA, 407–420. Google Scholar
Digital Library
- Weixing Ji, Li Lu, and Michael L. Scott. 2013. TARDIS: Task-level Access Race Detection by Intersecting Sets. In Proceedings of the 4th Workshop on Determinism and Correctness in Parallel Programming (WODET ’13).Google Scholar
- John Mellor-Crummey. 1991. On-the-fly detection of data races for programs with nested fork-join parallelism. In Proceedings of the 1991 ACM/IEEE conference on Supercomputing - Supercomputing ’91. Albuquerque, New Mexico, United States, 24–33. Google Scholar
Digital Library
- Y. Lee, V. Grover, R. Krashinsky, M. Stephenson, S. W. Keckler, and K. Asanovic. 2014. Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. 101–113. Google Scholar
Digital Library
- Alan Leung, Manish Gupta, Yuvraj Agarwal, Rajesh Gupta, Ranjit Jhala, and Sorin Lerner. 2012. Verifying GPU Kernels by Test Amplification. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’12). ACM, New York, NY, USA, 383–394. Google Scholar
Digital Library
- Guodong Li, Peng Li, Geof Sawaya, Ganesh Gopalakrishnan, Indradeep Ghosh, and Sreeranga P. Rajan. 2012. GKLEE: Concolic Verification and Test Generation for GPUs. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’12). ACM, New York, NY, USA, 215–224. Google Scholar
Digital Library
- Pengcheng Li, Chen Ding, Xiaoyu Hu, and Tolga Soyata. 2014. LDetector: A Low Overhead Race Detector For GPU Programs. In Proceedings of the 5th Workshop on Determinism and Correctness in Parallel Programming (WODET ’14).Google Scholar
- Christopher Lidbury and Alastair F. Donaldson. 2017. Dynamic Race Detection for C++11. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages (POPL 2017). ACM, New York, NY, USA, 443–457. Google Scholar
Digital Library
- Daniel Marino, Madanlal Musuvathi, and Satish Narayanasamy. 2009. LiteRace: Effective Sampling for Lightweight Data-Race Detection. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation - PLDI ’09. Dublin, Ireland, 134. Google Scholar
Digital Library
- Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA ’10). ACM, New York, NY, USA, 235– 246. Google Scholar
Digital Library
- Michael Boyer, Kevin Skadron, and Westley Weimer. 2008. Automated Dynamic Analysis of CUDA Programs. In Workshop on Software Tools for MultiCore Systems.Google Scholar
- Abdullah Muzahid, Dario SuÃąrez, Shanxiang Qi, and Josep Torrellas. 2009. SigRace: Signature-Based Data Race Detection. In Proceedings of the 36th annual international symposium on Computer architecture (ISCA ’09). ACM, New York, NY, USA, 337–348. Google Scholar
Digital Library
- Nvidia. {n. d.}. Nvidia Tesla V100 GPU Architecture. http://images.nvidia.com/content/volta-architecture/pdf/ volta-architecture-whitepaper.pdfGoogle Scholar
- Nvidia. 2016. Racecheck Tool. http://docs.nvidia.com/cuda/ cuda-memcheck/index.html#racecheck-toolGoogle Scholar
- Eli Pozniansky and Assaf Schuster. 2003. Efficient on-the-fly data race detection in multithreaded C++ programs. In Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP ’03). ACM, New York, NY, USA, 179–190. Google Scholar
Digital Library
- Raghavan Raman, Jisheng Zhao, Vivek Sarkar, Martin Vechev, and Eran Yahav. 2012. Scalable and Precise Dynamic Datarace Detection for Structured Parallelism. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’12). ACM, New York, NY, USA, 531–542. Google Scholar
Digital Library
- Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. 1997. Eraser: a dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems 15, 4 (Nov. 1997), 391–411. Google Scholar
Digital Library
- Konstantin Serebryany and Timur Iskhodzhanov. 2009. ThreadSanitizer: Data Race Detection in Practice. In Proceedings of the Workshop on Binary Instrumentation and Applications (WBIA ’09). ACM, New York, NY, USA, 62–71. Google Scholar
Digital Library
- John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, vLi Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report IMPACT-12-01. University of Illinois at Urbana-Champaign. http://impact.crhc.illinois.edu/Shared/ Docs/impact-12-01.parboil.pdfGoogle Scholar
- Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2016. Gunrock: A High-performance Graph Processing Library on the GPU. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’16). ACM, New York, NY, USA, Article 11, 12 pages. Google Scholar
Digital Library
- Yuan Yu, Tom Rodeheffer, and Wei Chen. 2005. RaceTrack: efficient detection of data race conditions via adaptive tracking. In Proceedings of the twentieth ACM symposium on Operating systems principles (SOSP ’05). ACM, New York, NY, USA, 221–234. Google Scholar
Digital Library
- Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. 2011. GRace: A Low-overhead Mechanism for Detecting Data Races in GPU Programs. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP ’11). ACM, New York, NY, USA, 135–146. Google Scholar
Digital Library
- M. Zheng, V. T. Ravi, F. Qin, and G. Agrawal. 2014. GMRace: Detecting Data Races in GPU Programs via a Low-Overhead Scheme. IEEE Transactions on Parallel and Distributed Systems 25, 1 (Jan 2014), 104– 115. Google Scholar
Digital Library
Index Terms
CURD: a dynamic CUDA race detector
Recommendations
CURD: a dynamic CUDA race detector
PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and ImplementationAs GPUs have become an integral part of nearly every pro- cessor, GPU programming has become increasingly popular. GPU programming requires a combination of extreme levels of parallelism and low-level programming, making it easy for concurrency bugs ...
BARRACUDA: binary-level analysis of runtime RAces in CUDA programs
PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and ImplementationGPU programming models enable and encourage massively parallel programming with over a million threads, requiring extreme parallelism to achieve good performance. Massive parallelism brings significant correctness challenges by increasing the ...
BARRACUDA: binary-level analysis of runtime RAces in CUDA programs
PLDI '17GPU programming models enable and encourage massively parallel programming with over a million threads, requiring extreme parallelism to achieve good performance. Massive parallelism brings significant correctness challenges by increasing the ...







Comments