Abstract
We present the systematic design of a testing environment that uses stressing and fuzzing to reveal errors in GPU applications that arise due to weak memory effects. We evaluate our approach on seven GPUs spanning three Nvidia architectures, across ten CUDA applications that use fine-grained concurrency. Our results show that applications that rarely or never exhibit errors related to weak memory when executed natively can readily exhibit these errors when executed in our testing environment. Our testing environment also provides a means to help identify the root causes of such errors, and automatically suggests how to insert fences that harden an application against weak memory bugs. To understand the cost of GPU fences, we benchmark applications with fences provided by the hardening strategy as well as a more conservative, sound fencing strategy.
- T. M. Aamodt and W. W. Fung. GPGPU-Sim 3.x manual, 2015. http://gpgpu-sim.org/manual/index. php/GPGPU-Sim_3.x_Manual.Google Scholar
- D. A. F. Alcantara. Efficient Hash Tables on the GPU. PhD thesis, University of California, Davis, 2011. Google Scholar
Digital Library
- J. Alglave, L. Maranget, S. Sarkar, and P. Sewell. Litmus: Running tests against hardware. TACAS, pages 41–44. Springer, 2011. Google Scholar
Digital Library
- J. Alglave, D. Kroening, V. Nimal, and M. Tautschnig. Software verification for weak memory via program transformation. In ESOP, pages 512–532. Springer, 2013. Google Scholar
Digital Library
- J. Alglave, D. Kroening, and M. Tautschnig. Partial orders for efficient bounded model checking of concurrent software. In CAV, pages 141–157. Springer, 2013.Google Scholar
- J. Alglave, D. Kroening, V. Nimal, and D. Poetzl. Don’t sit on the fence - A static analysis approach to automatic fence insertion. In CAV, pages 508–524. Springer, 2014.Google Scholar
- J. Alglave, L. Maranget, and M. Tautschnig. Herding cats: Modelling, simulation, testing, and data mining for weak memory. ACM Trans. Program. Lang. Syst., 36(2):7:1–7:74, 2014. Google Scholar
Digital Library
- J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan, J. Ketema, D. Poetzl, T. Sorensen, and J. Wickerson. GPU concurrency: Weak behaviours and programming assumptions. In ASPLOS, pages 577–591. ACM, 2015. Google Scholar
Digital Library
- E. Bardsley and A. F. Donaldson. Warps and atomics: Beyond barrier synchronization in the verification of GPU kernels. In NFM, pages 230–245. Springer, 2014.Google Scholar
- A. Betts, N. Chong, A. Donaldson, S. Qadeer, and P. Thomson. GPUVerify: A verifier for GPU kernels. In OOPSLA, pages 113–132. ACM, 2012. Google Scholar
Digital Library
- A. Betts, N. Chong, A. F. Donaldson, J. Ketema, S. Qadeer, P. Thomson, and J. Wickerson. The design and implementation of a verification technique for GPU kernels. ACM Trans. Program. Lang. Syst., 37(3):10, 2015. Google Scholar
Digital Library
- M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on GPUs. In IISWC, pages 141–151. IEEE, 2012. Google Scholar
Digital Library
- M. Burtscher, I. Zecena, and Z. Zong. Measuring GPU power with the K20 built-in sensor. GPGPU, pages 28–36. ACM, 2014. Google Scholar
Digital Library
- W. Chiang, G. Gopalakrishnan, G. Li, and Z. Rakamaric. Formal analysis of GPU programs with atomics via conflictdirected delay-bounding. In NFM, pages 213–228. Springer, 2013.Google Scholar
- W. W. Collier. Reasoning About Parallel Architectures. Prentice-Hall, Inc., 1992. ISBN 0-13-767187-3. Google Scholar
Digital Library
- J. Coplin and M. Burtscher. Power characteristics of irregular GPGPU programs. Workshop on Green Programming, Computing, and Data Processing (GPCDP), 2014.Google Scholar
- W. Feng and S. Xiao. To GPU synchronize or not GPU synchronize? In International Symposium on Circuits and Systems (ISCAS), pages 3801–3804, 2010.Google Scholar
- C. Flanagan and S. N. Freund. Adversarial memory for detecting destructive races. PLDI, pages 244–254. ACM, 2010. Google Scholar
Digital Library
- S. Hangal, D. Vahia, C. Manovit, and J.-Y. J. Lu. Tsotool: A program for verifying memory systems using the memory consistency model. ISCA ’04. IEEE Computer Society, 2004. Google Scholar
Digital Library
- B. He and J. X. Yu. High-throughput transaction executions on graphics processors. Proc. VLDB Endow., 4(5):314–325, 2011. Google Scholar
Digital Library
- V.. HSA Foundation. HSA programmer’s reference manual, July 2015. http://www.hsafoundation.com/html/ Content/PRM/Topics/PRM_title_page.htm.Google Scholar
- W.-m. W. Hwu. GPU Computing Gems Jade Edition. Morgan Kaufmann, 2011. Google Scholar
Digital Library
- S. Joshi and D. Kroening. Property-driven fence insertion using reorder bounded model checking. In FM, pages 291–307. Springer, 2015.Google Scholar
- Khronos Group. OpenCL: Open Computing Language. http: //www.khronos.org/opencl.Google Scholar
- L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput., pages 690–691, Sept. 1979. Google Scholar
Digital Library
- N. M. Lê, A. Pop, A. Cohen, and F. Zappa Nardelli. Correct and efficient work-stealing for weak memory models. PPoPP ’13, pages 69–80. ACM, 2013. ISBN 978-1-4503-1922-5.Google Scholar
- J. Lee and D. A. Padua. Hiding relaxed memory consistency with a compiler. IEEE Trans. Comput., 50(8):824–833, Aug. 2001. Google Scholar
Digital Library
- G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan. GKLEE: concolic verification and test generation for GPUs. In PPoPP, pages 215–224. ACM, 2012. Google Scholar
Digital Library
- C. Lidbury, A. Lascu, N. Chong, and A. F. Donaldson. Manycore compiler fuzzing. In PLDI, pages 65–76, 2015. Google Scholar
Digital Library
- F. Liu, N. Nedev, N. Prisadnikov, M. Vechev, and E. Yahav. Dynamic synthesis for relaxed memory models. In PLDI, pages 429–440, 2014. Google Scholar
Digital Library
- X. Mei and X. Chu. Dissecting GPU memory hierarchy through microbenchmarking. CoRR, abs/1509.02308, 2015. URL http://arxiv.org/abs/1509.02308.Google Scholar
- D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In PPoPP, pages 117–128. ACM, 2012. Google Scholar
Digital Library
- P. Misra and M. Chaudhuri. Performance evaluation of concurrent lock-free data structures on GPUs. ICPADS, pages 53–60. IEEE, 2012. Google Scholar
Digital Library
- A. Morrison and Y. Afek. Temporally bounding TSO for fencefree asymmetric synchronization. In ASPLOS 2015, pages 45–58. ACM. Google Scholar
Digital Library
- M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger. Realtime 3D reconstruction at scale using voxel hashing. ACM Trans. Graph., 32(6):169:1–169:11, 2013. Google Scholar
Digital Library
- B. Norris and B. Demsky. CDSchecker: Checking concurrent data structures written with C/C++ atomics. In OOPSLA, pages 131–150. ACM, 2013. Google Scholar
Digital Library
- Nvidia. CUB, April 2015. http://nvlabs.github.io/ cub/.Google Scholar
- Nvidia. CUDA C programming guide, version 7, March 2015. http://docs.nvidia.com/cuda/pdf/CUDA_ C_Programming_Guide.pdf.Google Scholar
- Nvidia. CUDA Code Samples, 2015.Google Scholar
- https:// developer.nvidia.com/cuda-code-samples.Google Scholar
- Nvidia. CUDA-memcheck, 2015. https://developer. nvidia.com/CUDA-MEMCHECK.Google Scholar
- Nvidia. CUDA runtime API, March 2015. http://docs. nvidia.com/cuda/pdf/CUDA_Runtime_API.pdf.Google Scholar
- Nvidia. cuRAND, version 7, March 2015. http://docs. nvidia.com/cuda/pdf/CURAND_Library.pdf.Google Scholar
- Nvidia. NVML reference manual, March 2015. http://docs.nvidia.com/deploy/pdf/NVML_ API_Reference_Guide.pdf.Google Scholar
- Nvidia. Parallel thread execution ISA: Version 4.2, March 2015. http://docs.nvidia.com/cuda/pdf/ptx_ isa_4.2.pdf.Google Scholar
- J. Sanders and E. Kandrot. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, 2010. Google Scholar
Digital Library
- D. J. Sorin, M. D. Hill, and D. A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2011. Google Scholar
Digital Library
- J. A. Stuart and J. D. Owens. Efficient synchronization primitives for GPUs. CoRR, abs/1110.4623, 2011. URL http://arxiv.org/abs/1110.4623.Google Scholar
- S. Tzeng, A. Patney, and J. D. Owens. Task management for irregular-parallel workloads on the GPU. In HPG, pages 29–37, 2010. Google Scholar
Digital Library
- S. Xiao and W. Feng. Inter-block GPU communication via fast barrier synchronization. IPDPS, pages 1–12. IEEE, 2010.Google Scholar
- T. Yuki and S. Rajopadhye. Folklore confirmed: Compiling for speed = compiling for energy. In Languages and Compilers for Parallel Computing, pages 169–184. Springer, 2014.Google Scholar
Index Terms
Exposing errors related to weak memory in GPU applications
Recommendations
Exposing errors related to weak memory in GPU applications
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and ImplementationWe present the systematic design of a testing environment that uses stressing and fuzzing to reveal errors in GPU applications that arise due to weak memory effects. We evaluate our approach on seven GPUs spanning three Nvidia architectures, across ten ...
Architecture-Aware Mapping and Optimization on a 1600-Core GPU
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed SystemsThe graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-...
Galactica: A GPU Parallelized Database Accelerator
BigDataScience '14: Proceedings of the 2014 International Conference on Big Data Science and ComputingThe amount of business data generated and collected is increasing exponentially every year. A Graphics Processing Unit (GPU) is not used for only optimization of image filtering and video processing, but is also widely adopted for accelerating big data ...







Comments