skip to main content
research-article

GRace: a low-overhead mechanism for detecting data races in GPU programs

Published:12 February 2011Publication History
Skip Abstract Section

Abstract

In recent years, GPUs have emerged as an extremely cost-effective means for achieving high performance. Many application developers, including those with no prior parallel programming experience, are now trying to scale their applications using GPUs. While languages like CUDA and OpenCL have eased GPU programming for non-graphical applications, they are still explicitly parallel languages. All parallel programmers, particularly the novices, need tools that can help ensuring the correctness of their programs. Like any multithreaded environment, data races on GPUs can severely affect the program reliability. Thus, tool support for detecting race conditions can significantly benefit GPU application developers. Existing approaches for detecting data races on CPUs or GPUs have one or more of the following limitations: 1) being illsuited for handling non-lock synchronization primitives on GPUs; 2) lacking of scalability due to the state explosion problem; 3) reporting many false positives because of simplified modeling; and/or 4) incurring prohibitive runtime and space overhead. In this paper, we propose GRace, a new mechanism for detecting races in GPU programs that combines static analysis with a carefully designed dynamic checker for logging and analyzing information at runtime. Our design utilizes GPUs memory hierarchy to log runtime data accesses efficiently. To improve the performance, GRace leverages static analysis to reduce the number of statements that need to be instrumented. Additionally, by exploiting the knowledge of thread scheduling and the execution model in the underlying GPUs, GRace can accurately detect data races with no false positives reported. Based on the above idea, we have built a prototype of GRace with two schemes, i.e., GRace-stmt and GRace-addr, for NVIDIA GPUs. Both schemes are integrated with the same static analysis. We have evaluated GRace-stmt and GRace-addr with three data race bugs in three GPU kernel functions and also have compared them with the existing approach, referred to as B-tool. Our experimental results show that both schemes of GRace are effective in detecting all evaluated cases with no false positives, whereas Btool reports many false positives for one evaluated case. On the one hand, GRace-addr incurs low runtime overhead, i.e., 22-116%, and low space overhead, i.e., 9-18MB, for the evaluated kernels. On the other hand, GRace-stmt offers more help in diagnosing data races with larger overhead.

References

  1. CUDA Community Showcase. http://www.nvidia.com/object/cuda_apps_flash_new.html.Google ScholarGoogle Scholar
  2. ATI Stream Technology. http://www.amd.com/stream.Google ScholarGoogle Scholar
  3. D. H. Ahn, B. R. de Supinski, I. Laguna, G. L. Lee, B. Liblit, B. P. Miller, and M. Schulz. Scalable temporal order analysis for large scale debugging. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. C. Arnold, D. H. Ahn, B. R. de Supinski, G. Lee, B. P. Miller, and M. Schulz. Stack trace analysis for large scale debugging. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2007.Google ScholarGoogle ScholarCross RefCross Ref
  5. S. M. Balle, B. R. Brett, C.-P. Chen, and D. LaFrance-Linden. Extending a traditional debugger to debug massively parallel applications. J. Parallel Distrib. Comput., 64 (5), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Boyapati, R. Lee, and M. Rinard. Ownership types for safe programming: preventing data races and deadlocks. In Proceedings of the International Conference on Object Oriented Programming, Systems, Languages and Applications (OOPSLA), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Boyer, K. Skadron, and W. Weimer. Automated dynamic analysis of CUDA programs. In Proceedings of the Third Workshop on Software Tools for MultiCore Systems (STMCS), 2008.Google ScholarGoogle Scholar
  8. H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum sum-squared residue co-clustering of gene expression data. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM), 2004.Google ScholarGoogle ScholarCross RefCross Ref
  9. J.-D. Choi, K. Lee, A. Loginov, R. O'Callahan, V. Sarkar, and M. Sridharan. Efficient and precise datarace detection for multithreaded object-oriented programs. In Proceedings of the ACM SIGPLAN Conference on Programming language design and implementation (PLDI), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39 (1): 1--38, 1977.Google ScholarGoogle Scholar
  11. J. DeSouza, B. Kuhn, B. R. de Supinski, V. Samofalov, S. Zheltov, and S. Bratanov. Automated, scalable debugging of MPI programs with Intel Message Checker. In Proceedings of the 2nd International workshop on Software engineering for high performance computing system applications (SE-HPCS), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Dinning and E. Schonberg. An empirical comparison of monitoring algorithms for access anomaly detection. In Proceedings of the ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Engler and K. Ashcraft. Racerx: effective, static detection of race conditions and deadlocks. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Etnus, LLC. TotalView. http://www.etnus.com/TotalView.Google ScholarGoogle Scholar
  15. C. Falzone, A. Chan, E. Lusk, and W. Gropp. A portable method for finding user errors in the usage of MPI collective operations. International Journal of High Performance Computing Applications, 21 (2): 155--165, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Feautrier. Parametric integer programming. RAIRO Recherche Opérationnelle, 22 (3): 243--268, 1988.Google ScholarGoogle ScholarCross RefCross Ref
  17. A. Fedorova, S. Blagodurov, and S. Zhuravlev. Managing contention for shared resources on multicore processors. Commun. ACM, 53 (2): 49--57, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Flanagan and S. N. Freund. Type-based race detection for java. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Q. Gao, F. Qin, and D. K. Panda. DMTracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements. In Proceedings of the ACM/IEEE Annual Conference on Supercomputing (SC), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. A. Henzinger, R. Jhala, and R. Majumdar. Race checking by context inference. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Hilbrich, B. R. de Supinski, M. Schulz, and M. S. Müller. A graph based approach for MPI deadlock detection. In Proceedings of the 23rd International Conference on Supercomputing(ICS), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M.-H. Kang, O.-K. Ha, S.-W. Jun, and Y.-K. Jun. A tool for detecting first races in openmp programs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Khronos Group. OpenCL: The Open Standdard for Heterogeneous Parallel Programming. http://www.khronos.org/opencl, 2008.Google ScholarGoogle Scholar
  24. B. Krammera, K. Bidmona, M. S. Muller, and M. M. Rescha. MARMOT: An MPI analysis and checking tool. In Parallel Computing (PARCO), 2003.Google ScholarGoogle Scholar
  25. L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21 (7): 558--565, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization. In Proceedings of the ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. Li and G. Gopalakrishnan. Scalable SMT-based verification of GPU kernel functions. In Proceedings of the ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes: a comprehensive study on real world concurrency bug characteristics. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. Luecke, H. Chen, J. Coyle, J. Hoekstra, M. Kraeva, and Y. Zou. MPI-CHECK: A tool for checking Fortran 90 MPI programs. Concurr. Comput. Pract. Exp., 15 (2), 2003.Google ScholarGoogle Scholar
  30. S. S. Lumetta and D. E. Culler. The mantis parallel debugger. In SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. W. Ma and G. Agrawal. A translation system for enabling data mining applications on GPUs. In Proceedings of the International Conference on Supercomputing(ICS), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. W. Ma and G. Agrawal. An Integer Programming Framework for Optimizing Shared Memory Use on GPUs. In Proceedings of International Conference on High Performance Computing(HiPC), 2010.Google ScholarGoogle Scholar
  33. A. D. Malony, S. Biersdorff, W. Spear, and S. Mayanglambam. An experimental approach to performance measurement of heterogeneous parallel applications using cuda. In Proceedings of the International Conference on Supercomputing (ICS), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. R. H. B. Netzer and B. P. Miller. Improving the accuracy of data race detection. In Proceedings of the ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. O'Callahan and J.-D. Choi. Hybrid dynamic data race detection. In Proceedings of the ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Odom, J. K. Hollingsworth, L. DeRose, K. Ekanadham, and S. Sbaraglia. Using dynamic tracing sampling to measure long running programs. In Proceedings of the ACM/IEEE Annual Conference on Supercomputing (SC), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. Perkovic and P. J. Keleher. Online data-race detection via coherency guarantees. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. E. Pozniansky and A. Schuster. Efficient on-the-fly data race detection in multithreaded C programs. In Proceedings of the ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: a dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst., 15 (4): 391--411, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. N. Sundaram, A. Raghunathan, and S. Chakradhar. A framework for efficient and scalable execution of domain-specific templates on GPUs. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. S. Vetter and B. R. de Supinski. Dynamic software testing of MPI applications with Umpire. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. Yang, A. R. Butt, Y. C. Hu, and S. P. Midkiff. Trust but verify: monitoring remotely executing programs for progress and correctness. In Proceedings of the ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Y. Yu, T. Rodeheffer, and W. Chen. Racetrack: efficient detection of data race conditions via adaptive tracking. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. S. zee Ueng, M. Lathara, S. S. Baghsorkhi, and W. mei W. Hwu. CUDA-lite: Reducing GPU Programming Complexity. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2008.Google ScholarGoogle Scholar
  45. A. Zhai, G. He, and M. Heimdahl. Hardware and compiler support for dynamic software monitoring. In Proceedings of the International Workshop on Runtime Verification (RV), 2009.Google ScholarGoogle Scholar
  46. E. Z. Zhang, Y. Jiang, Z. Guo, and X. Shen. Streamlining gpu applications on the fly: thread divergence elimination through runtime thread-data remapping. In Proceedings of the International Conference on Supercomputing (ICS), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. GRace: a low-overhead mechanism for detecting data races in GPU programs

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM SIGPLAN Notices
              ACM SIGPLAN Notices  Volume 46, Issue 8
              PPoPP '11
              August 2011
              300 pages
              ISSN:0362-1340
              EISSN:1558-1160
              DOI:10.1145/2038037
              Issue’s Table of Contents
              • cover image ACM Conferences
                PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
                February 2011
                326 pages
                ISBN:9781450301190
                DOI:10.1145/1941553
                • General Chair:
                • Calin Cascaval,
                • Program Chair:
                • Pen-Chung Yew

              Copyright © 2011 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 12 February 2011

              Check for updates

              Qualifiers

              • research-article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!