Abstract
In recent years, GPUs have emerged as an extremely cost-effective means for achieving high performance. Many application developers, including those with no prior parallel programming experience, are now trying to scale their applications using GPUs. While languages like CUDA and OpenCL have eased GPU programming for non-graphical applications, they are still explicitly parallel languages. All parallel programmers, particularly the novices, need tools that can help ensuring the correctness of their programs. Like any multithreaded environment, data races on GPUs can severely affect the program reliability. Thus, tool support for detecting race conditions can significantly benefit GPU application developers. Existing approaches for detecting data races on CPUs or GPUs have one or more of the following limitations: 1) being illsuited for handling non-lock synchronization primitives on GPUs; 2) lacking of scalability due to the state explosion problem; 3) reporting many false positives because of simplified modeling; and/or 4) incurring prohibitive runtime and space overhead. In this paper, we propose GRace, a new mechanism for detecting races in GPU programs that combines static analysis with a carefully designed dynamic checker for logging and analyzing information at runtime. Our design utilizes GPUs memory hierarchy to log runtime data accesses efficiently. To improve the performance, GRace leverages static analysis to reduce the number of statements that need to be instrumented. Additionally, by exploiting the knowledge of thread scheduling and the execution model in the underlying GPUs, GRace can accurately detect data races with no false positives reported. Based on the above idea, we have built a prototype of GRace with two schemes, i.e., GRace-stmt and GRace-addr, for NVIDIA GPUs. Both schemes are integrated with the same static analysis. We have evaluated GRace-stmt and GRace-addr with three data race bugs in three GPU kernel functions and also have compared them with the existing approach, referred to as B-tool. Our experimental results show that both schemes of GRace are effective in detecting all evaluated cases with no false positives, whereas Btool reports many false positives for one evaluated case. On the one hand, GRace-addr incurs low runtime overhead, i.e., 22-116%, and low space overhead, i.e., 9-18MB, for the evaluated kernels. On the other hand, GRace-stmt offers more help in diagnosing data races with larger overhead.
- CUDA Community Showcase. http://www.nvidia.com/object/cuda_apps_flash_new.html.Google Scholar
- ATI Stream Technology. http://www.amd.com/stream.Google Scholar
- D. H. Ahn, B. R. de Supinski, I. Laguna, G. L. Lee, B. Liblit, B. P. Miller, and M. Schulz. Scalable temporal order analysis for large scale debugging. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC), 2009. Google Scholar
Digital Library
- D. C. Arnold, D. H. Ahn, B. R. de Supinski, G. Lee, B. P. Miller, and M. Schulz. Stack trace analysis for large scale debugging. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2007.Google Scholar
Cross Ref
- S. M. Balle, B. R. Brett, C.-P. Chen, and D. LaFrance-Linden. Extending a traditional debugger to debug massively parallel applications. J. Parallel Distrib. Comput., 64 (5), 2004. Google Scholar
Digital Library
- C. Boyapati, R. Lee, and M. Rinard. Ownership types for safe programming: preventing data races and deadlocks. In Proceedings of the International Conference on Object Oriented Programming, Systems, Languages and Applications (OOPSLA), 2002. Google Scholar
Digital Library
- M. Boyer, K. Skadron, and W. Weimer. Automated dynamic analysis of CUDA programs. In Proceedings of the Third Workshop on Software Tools for MultiCore Systems (STMCS), 2008.Google Scholar
- H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum sum-squared residue co-clustering of gene expression data. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM), 2004.Google Scholar
Cross Ref
- J.-D. Choi, K. Lee, A. Loginov, R. O'Callahan, V. Sarkar, and M. Sridharan. Efficient and precise datarace detection for multithreaded object-oriented programs. In Proceedings of the ACM SIGPLAN Conference on Programming language design and implementation (PLDI), 2002. Google Scholar
Digital Library
- A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39 (1): 1--38, 1977.Google Scholar
- J. DeSouza, B. Kuhn, B. R. de Supinski, V. Samofalov, S. Zheltov, and S. Bratanov. Automated, scalable debugging of MPI programs with Intel Message Checker. In Proceedings of the 2nd International workshop on Software engineering for high performance computing system applications (SE-HPCS), 2005. Google Scholar
Digital Library
- A. Dinning and E. Schonberg. An empirical comparison of monitoring algorithms for access anomaly detection. In Proceedings of the ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 1990. Google Scholar
Digital Library
- D. Engler and K. Ashcraft. Racerx: effective, static detection of race conditions and deadlocks. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2003. Google Scholar
Digital Library
- Etnus, LLC. TotalView. http://www.etnus.com/TotalView.Google Scholar
- C. Falzone, A. Chan, E. Lusk, and W. Gropp. A portable method for finding user errors in the usage of MPI collective operations. International Journal of High Performance Computing Applications, 21 (2): 155--165, 2007. Google Scholar
Digital Library
- P. Feautrier. Parametric integer programming. RAIRO Recherche Opérationnelle, 22 (3): 243--268, 1988.Google Scholar
Cross Ref
- A. Fedorova, S. Blagodurov, and S. Zhuravlev. Managing contention for shared resources on multicore processors. Commun. ACM, 53 (2): 49--57, 2010. Google Scholar
Digital Library
- C. Flanagan and S. N. Freund. Type-based race detection for java. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2000. Google Scholar
Digital Library
- Q. Gao, F. Qin, and D. K. Panda. DMTracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements. In Proceedings of the ACM/IEEE Annual Conference on Supercomputing (SC), 2007. Google Scholar
Digital Library
- T. A. Henzinger, R. Jhala, and R. Majumdar. Race checking by context inference. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2004. Google Scholar
Digital Library
- T. Hilbrich, B. R. de Supinski, M. Schulz, and M. S. Müller. A graph based approach for MPI deadlock detection. In Proceedings of the 23rd International Conference on Supercomputing(ICS), 2009. Google Scholar
Digital Library
- M.-H. Kang, O.-K. Ha, S.-W. Jun, and Y.-K. Jun. A tool for detecting first races in openmp programs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 2009. Google Scholar
Digital Library
- Khronos Group. OpenCL: The Open Standdard for Heterogeneous Parallel Programming. http://www.khronos.org/opencl, 2008.Google Scholar
- B. Krammera, K. Bidmona, M. S. Muller, and M. M. Rescha. MARMOT: An MPI analysis and checking tool. In Parallel Computing (PARCO), 2003.Google Scholar
- L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21 (7): 558--565, 1978. Google Scholar
Digital Library
- S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization. In Proceedings of the ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 2009. Google Scholar
Digital Library
- G. Li and G. Gopalakrishnan. Scalable SMT-based verification of GPU kernel functions. In Proceedings of the ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE), 2010. Google Scholar
Digital Library
- S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes: a comprehensive study on real world concurrency bug characteristics. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2008. Google Scholar
Digital Library
- G. Luecke, H. Chen, J. Coyle, J. Hoekstra, M. Kraeva, and Y. Zou. MPI-CHECK: A tool for checking Fortran 90 MPI programs. Concurr. Comput. Pract. Exp., 15 (2), 2003.Google Scholar
- S. S. Lumetta and D. E. Culler. The mantis parallel debugger. In SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT), 1996. Google Scholar
Digital Library
- W. Ma and G. Agrawal. A translation system for enabling data mining applications on GPUs. In Proceedings of the International Conference on Supercomputing(ICS), 2009. Google Scholar
Digital Library
- W. Ma and G. Agrawal. An Integer Programming Framework for Optimizing Shared Memory Use on GPUs. In Proceedings of International Conference on High Performance Computing(HiPC), 2010.Google Scholar
- A. D. Malony, S. Biersdorff, W. Spear, and S. Mayanglambam. An experimental approach to performance measurement of heterogeneous parallel applications using cuda. In Proceedings of the International Conference on Supercomputing (ICS), 2010. Google Scholar
Digital Library
- R. H. B. Netzer and B. P. Miller. Improving the accuracy of data race detection. In Proceedings of the ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 1991. Google Scholar
Digital Library
- R. O'Callahan and J.-D. Choi. Hybrid dynamic data race detection. In Proceedings of the ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 2003. Google Scholar
Digital Library
- J. Odom, J. K. Hollingsworth, L. DeRose, K. Ekanadham, and S. Sbaraglia. Using dynamic tracing sampling to measure long running programs. In Proceedings of the ACM/IEEE Annual Conference on Supercomputing (SC), 2005. Google Scholar
Digital Library
- D. Perkovic and P. J. Keleher. Online data-race detection via coherency guarantees. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 1996. Google Scholar
Digital Library
- E. Pozniansky and A. Schuster. Efficient on-the-fly data race detection in multithreaded C programs. In Proceedings of the ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 2003. Google Scholar
Digital Library
- S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: a dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst., 15 (4): 391--411, 1997. Google Scholar
Digital Library
- N. Sundaram, A. Raghunathan, and S. Chakradhar. A framework for efficient and scalable execution of domain-specific templates on GPUs. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2009. Google Scholar
Digital Library
- J. S. Vetter and B. R. de Supinski. Dynamic software testing of MPI applications with Umpire. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC), 2000. Google Scholar
Digital Library
- S. Yang, A. R. Butt, Y. C. Hu, and S. P. Midkiff. Trust but verify: monitoring remotely executing programs for progress and correctness. In Proceedings of the ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 2005. Google Scholar
Digital Library
- Y. Yu, T. Rodeheffer, and W. Chen. Racetrack: efficient detection of data race conditions via adaptive tracking. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2005. Google Scholar
Digital Library
- S. zee Ueng, M. Lathara, S. S. Baghsorkhi, and W. mei W. Hwu. CUDA-lite: Reducing GPU Programming Complexity. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2008.Google Scholar
- A. Zhai, G. He, and M. Heimdahl. Hardware and compiler support for dynamic software monitoring. In Proceedings of the International Workshop on Runtime Verification (RV), 2009.Google Scholar
- E. Z. Zhang, Y. Jiang, Z. Guo, and X. Shen. Streamlining gpu applications on the fly: thread divergence elimination through runtime thread-data remapping. In Proceedings of the International Conference on Supercomputing (ICS), 2010. Google Scholar
Digital Library
Index Terms
GRace: a low-overhead mechanism for detecting data races in GPU programs
Recommendations
GRace: a low-overhead mechanism for detecting data races in GPU programs
PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programmingIn recent years, GPUs have emerged as an extremely cost-effective means for achieving high performance. Many application developers, including those with no prior parallel programming experience, are now trying to scale their applications using GPUs. ...
GMRace: Detecting Data Races in GPU Programs via a Low-Overhead Scheme
In recent years, GPUs have emerged as an extremely cost-effective means for achieving high performance. While languages like CUDA and OpenCL have eased GPU programming for nongraphical applications, they are still explicitly parallel languages. All ...
On the Programmability and Performance of Heterogeneous Platforms
ICPADS '13: Proceedings of the 2013 International Conference on Parallel and Distributed SystemsGeneral-purpose computing on an ever-broadening array of parallel devices has led to an increasingly complex and multi-dimensional landscape with respect to programmability and performance optimization. The growing diversity of parallel architectures ...







Comments