skip to main content
article

Exposing errors related to weak memory in GPU applications

Published:02 June 2016Publication History
Skip Abstract Section

Abstract

We present the systematic design of a testing environment that uses stressing and fuzzing to reveal errors in GPU applications that arise due to weak memory effects. We evaluate our approach on seven GPUs spanning three Nvidia architectures, across ten CUDA applications that use fine-grained concurrency. Our results show that applications that rarely or never exhibit errors related to weak memory when executed natively can readily exhibit these errors when executed in our testing environment. Our testing environment also provides a means to help identify the root causes of such errors, and automatically suggests how to insert fences that harden an application against weak memory bugs. To understand the cost of GPU fences, we benchmark applications with fences provided by the hardening strategy as well as a more conservative, sound fencing strategy.

References

  1. T. M. Aamodt and W. W. Fung. GPGPU-Sim 3.x manual, 2015. http://gpgpu-sim.org/manual/index. php/GPGPU-Sim_3.x_Manual.Google ScholarGoogle Scholar
  2. D. A. F. Alcantara. Efficient Hash Tables on the GPU. PhD thesis, University of California, Davis, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Alglave, L. Maranget, S. Sarkar, and P. Sewell. Litmus: Running tests against hardware. TACAS, pages 41–44. Springer, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Alglave, D. Kroening, V. Nimal, and M. Tautschnig. Software verification for weak memory via program transformation. In ESOP, pages 512–532. Springer, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Alglave, D. Kroening, and M. Tautschnig. Partial orders for efficient bounded model checking of concurrent software. In CAV, pages 141–157. Springer, 2013.Google ScholarGoogle Scholar
  6. J. Alglave, D. Kroening, V. Nimal, and D. Poetzl. Don’t sit on the fence - A static analysis approach to automatic fence insertion. In CAV, pages 508–524. Springer, 2014.Google ScholarGoogle Scholar
  7. J. Alglave, L. Maranget, and M. Tautschnig. Herding cats: Modelling, simulation, testing, and data mining for weak memory. ACM Trans. Program. Lang. Syst., 36(2):7:1–7:74, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan, J. Ketema, D. Poetzl, T. Sorensen, and J. Wickerson. GPU concurrency: Weak behaviours and programming assumptions. In ASPLOS, pages 577–591. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Bardsley and A. F. Donaldson. Warps and atomics: Beyond barrier synchronization in the verification of GPU kernels. In NFM, pages 230–245. Springer, 2014.Google ScholarGoogle Scholar
  10. A. Betts, N. Chong, A. Donaldson, S. Qadeer, and P. Thomson. GPUVerify: A verifier for GPU kernels. In OOPSLA, pages 113–132. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Betts, N. Chong, A. F. Donaldson, J. Ketema, S. Qadeer, P. Thomson, and J. Wickerson. The design and implementation of a verification technique for GPU kernels. ACM Trans. Program. Lang. Syst., 37(3):10, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on GPUs. In IISWC, pages 141–151. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Burtscher, I. Zecena, and Z. Zong. Measuring GPU power with the K20 built-in sensor. GPGPU, pages 28–36. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. Chiang, G. Gopalakrishnan, G. Li, and Z. Rakamaric. Formal analysis of GPU programs with atomics via conflictdirected delay-bounding. In NFM, pages 213–228. Springer, 2013.Google ScholarGoogle Scholar
  15. W. W. Collier. Reasoning About Parallel Architectures. Prentice-Hall, Inc., 1992. ISBN 0-13-767187-3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Coplin and M. Burtscher. Power characteristics of irregular GPGPU programs. Workshop on Green Programming, Computing, and Data Processing (GPCDP), 2014.Google ScholarGoogle Scholar
  17. W. Feng and S. Xiao. To GPU synchronize or not GPU synchronize? In International Symposium on Circuits and Systems (ISCAS), pages 3801–3804, 2010.Google ScholarGoogle Scholar
  18. C. Flanagan and S. N. Freund. Adversarial memory for detecting destructive races. PLDI, pages 244–254. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Hangal, D. Vahia, C. Manovit, and J.-Y. J. Lu. Tsotool: A program for verifying memory systems using the memory consistency model. ISCA ’04. IEEE Computer Society, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. He and J. X. Yu. High-throughput transaction executions on graphics processors. Proc. VLDB Endow., 4(5):314–325, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. V.. HSA Foundation. HSA programmer’s reference manual, July 2015. http://www.hsafoundation.com/html/ Content/PRM/Topics/PRM_title_page.htm.Google ScholarGoogle Scholar
  22. W.-m. W. Hwu. GPU Computing Gems Jade Edition. Morgan Kaufmann, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Joshi and D. Kroening. Property-driven fence insertion using reorder bounded model checking. In FM, pages 291–307. Springer, 2015.Google ScholarGoogle Scholar
  24. Khronos Group. OpenCL: Open Computing Language. http: //www.khronos.org/opencl.Google ScholarGoogle Scholar
  25. L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput., pages 690–691, Sept. 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. N. M. Lê, A. Pop, A. Cohen, and F. Zappa Nardelli. Correct and efficient work-stealing for weak memory models. PPoPP ’13, pages 69–80. ACM, 2013. ISBN 978-1-4503-1922-5.Google ScholarGoogle Scholar
  27. J. Lee and D. A. Padua. Hiding relaxed memory consistency with a compiler. IEEE Trans. Comput., 50(8):824–833, Aug. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan. GKLEE: concolic verification and test generation for GPUs. In PPoPP, pages 215–224. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Lidbury, A. Lascu, N. Chong, and A. F. Donaldson. Manycore compiler fuzzing. In PLDI, pages 65–76, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. F. Liu, N. Nedev, N. Prisadnikov, M. Vechev, and E. Yahav. Dynamic synthesis for relaxed memory models. In PLDI, pages 429–440, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. X. Mei and X. Chu. Dissecting GPU memory hierarchy through microbenchmarking. CoRR, abs/1509.02308, 2015. URL http://arxiv.org/abs/1509.02308.Google ScholarGoogle Scholar
  32. D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In PPoPP, pages 117–128. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. Misra and M. Chaudhuri. Performance evaluation of concurrent lock-free data structures on GPUs. ICPADS, pages 53–60. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Morrison and Y. Afek. Temporally bounding TSO for fencefree asymmetric synchronization. In ASPLOS 2015, pages 45–58. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger. Realtime 3D reconstruction at scale using voxel hashing. ACM Trans. Graph., 32(6):169:1–169:11, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. B. Norris and B. Demsky. CDSchecker: Checking concurrent data structures written with C/C++ atomics. In OOPSLA, pages 131–150. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Nvidia. CUB, April 2015. http://nvlabs.github.io/ cub/.Google ScholarGoogle Scholar
  38. Nvidia. CUDA C programming guide, version 7, March 2015. http://docs.nvidia.com/cuda/pdf/CUDA_ C_Programming_Guide.pdf.Google ScholarGoogle Scholar
  39. Nvidia. CUDA Code Samples, 2015.Google ScholarGoogle Scholar
  40. https:// developer.nvidia.com/cuda-code-samples.Google ScholarGoogle Scholar
  41. Nvidia. CUDA-memcheck, 2015. https://developer. nvidia.com/CUDA-MEMCHECK.Google ScholarGoogle Scholar
  42. Nvidia. CUDA runtime API, March 2015. http://docs. nvidia.com/cuda/pdf/CUDA_Runtime_API.pdf.Google ScholarGoogle Scholar
  43. Nvidia. cuRAND, version 7, March 2015. http://docs. nvidia.com/cuda/pdf/CURAND_Library.pdf.Google ScholarGoogle Scholar
  44. Nvidia. NVML reference manual, March 2015. http://docs.nvidia.com/deploy/pdf/NVML_ API_Reference_Guide.pdf.Google ScholarGoogle Scholar
  45. Nvidia. Parallel thread execution ISA: Version 4.2, March 2015. http://docs.nvidia.com/cuda/pdf/ptx_ isa_4.2.pdf.Google ScholarGoogle Scholar
  46. J. Sanders and E. Kandrot. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. D. J. Sorin, M. D. Hill, and D. A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. A. Stuart and J. D. Owens. Efficient synchronization primitives for GPUs. CoRR, abs/1110.4623, 2011. URL http://arxiv.org/abs/1110.4623.Google ScholarGoogle Scholar
  49. S. Tzeng, A. Patney, and J. D. Owens. Task management for irregular-parallel workloads on the GPU. In HPG, pages 29–37, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. S. Xiao and W. Feng. Inter-block GPU communication via fast barrier synchronization. IPDPS, pages 1–12. IEEE, 2010.Google ScholarGoogle Scholar
  51. T. Yuki and S. Rajopadhye. Folklore confirmed: Compiling for speed = compiling for energy. In Languages and Compilers for Parallel Computing, pages 169–184. Springer, 2014.Google ScholarGoogle Scholar

Index Terms

  1. Exposing errors related to weak memory in GPU applications

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 51, Issue 6
          PLDI '16
          June 2016
          726 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/2980983
          • Editor:
          • Andy Gill
          Issue’s Table of Contents
          • cover image ACM Conferences
            PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation
            June 2016
            726 pages
            ISBN:9781450342612
            DOI:10.1145/2908080
            • General Chair:
            • Chandra Krintz,
            • Program Chair:
            • Emery Berger

          Copyright © 2016 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 2 June 2016

          Check for updates

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!