skip to main content
research-article

GPUDet: a deterministic GPU architecture

Published:16 March 2013Publication History
Skip Abstract Section

Abstract

Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one's ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a deterministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs.

Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thousands of threads on a GPU is a major challenge. This paper proposes a scalable hardware mechanism, GPUDet, to provide determinism in GPU architectures. In this paper we characterize the existing deterministic and nondeterministic aspects of current GPU execution models, and we use these observations to inform GPUDet's design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide determinism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow parallel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations.

Our simulation results indicate that GPUDet incurs only 2X slowdown on average over a baseline nondeterministic architecture, with runtime overheads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of overhead for deterministic execution on GPUs to provide insights for further optimizations.

References

  1. http://www.ece.ubc.ca/~aamodt/GPUDet.Google ScholarGoogle Scholar
  2. White Paper | AMD Graphics Cores Next (GCN) Architecture. AMD, June 2012.Google ScholarGoogle Scholar
  3. D. Arnold et al. Stack Trace Analysis for Large Scale Debugging. In IPDPS, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  4. A. Aviram, S.-C. Weng, S. Hu, and B. Ford. Efficient system-enforced deterministic parallelism. In OSDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  6. T. Bergan, O. Anderson, J. Devietti, L. Ceze, and D. Grossman. CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution. In ASPLOS, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Bergan, N. Hunt, L. Ceze, and S. D. Gribble. Deterministic Process Groups in dOS. In OSDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Betts, N. Chong, A. F. Donaldson, S. Qadeer, and P. Thomson. GPUVerify: a verifier for GPU kernels. In Proceedings of the 27th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'12). ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Blelloch. NESL: A Nested Data-Parallel Language (Version 3.1). Technical report, Carnegie Mellon University, Pittsburgh, PA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. L. Bocchino, Jr., V. S. Adve, D. Dig, S. V. Adve, S. Heumann, R. Komuravelli, J. Overbey, P. Simmons, H. Sung, and M. Vakilian. A Type and Effect System for Deterministic Parallel Java. In OOPSLA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Boyer, K. Skadron, and W. Weimer. Automated Dynamic Analysis of CUDA Programs. In Third Workshop on Software Tools for MultiCore Systems, 2008.Google ScholarGoogle Scholar
  12. A. Brownsword. Cloth in OpenCL, 2009.Google ScholarGoogle Scholar
  13. M. M. T. Chakravarty, R. Leshchinskiy, S. Peyton Jones, G. Keller, and S. Marlow. Data Parallel Haskell: A Status Report. In DAMP, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IISWC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. W. Coon et al. United States Patent#7,353,369: System and Method for Managing Divergent Threads in a SIMD Architecture (Assignee NVIDIA Corp.), April 2008.Google ScholarGoogle Scholar
  16. J. Devietti, B. Lucia, L. Ceze, and M. Oskin. DMP: Deterministic Shared Memory Multiprocessing. In ASPLOS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Devietti, J. Nelson, T. Bergan, L. Ceze, and D. Grossman. RCDC: A Relaxed Consistency Deterministic Computer. In ASPLOS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. A. Edwards and O. Tardieu. SHIM: A Deterministic Model for Heterogeneous Embedded Systems. In EMSOFT, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt. Hardware Transactional Memory for GPU Architectures. In MICRO-44, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Fung et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Harish and P. J. Narayanan. Accelerating Large Graph Algorithms on the GPU Using CUDA. In HiPC, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Hill and M. Xu. http://www.cs.wisc.edu/ markhill/racey.html, 2009.Google ScholarGoogle Scholar
  23. D. Hower, P. Dudnik, M. Hill, and D. Wood. Calvin: Deterministic or Not? Free Will to Choose. In HPCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Khronos Group. OpenCL. http://www.khronos.org/opencl/.Google ScholarGoogle Scholar
  25. S. Laine and T. Karras. High-Performance Software Rasterization on GPUs. In HPG, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. H. Lars Nyland, John R. Nickolls and T. Mandal. United States Patent#8,086,806: Systems and methods for coalescing memory accesses of parallel threads (Assignee NVIDIA Corp.), April 2011.Google ScholarGoogle Scholar
  27. C. E. Leiserson and T. B. Schardl. A Work-Efficient Parallel Breadth-First Search Algorithm (or How to Cope with the Nondeterminism of Reducers). In SPAA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Li and G. Gopalakrishnan. Scalable SMT-Based Verification of GPU Kernel Functions. In FSE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Liu, B. Jaiyen, R. Veras, and O. Multu. RAIDR: Retention-Aware Intelligent DRAM Refresh. In ISCA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T. Liu, C. Curtsinger, and E. D. Berger. DTHREADS: Efficient Deterministic Multithreading. In SOSP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. NVIDIA, October 2009.Google ScholarGoogle Scholar
  33. NVIDIA CUDA Programming Guide v3.1. NVIDIA Corp., 2010.Google ScholarGoogle Scholar
  34. NVML API Reference Manual v3.295.45. NVIDIA Corp., 2012.Google ScholarGoogle Scholar
  35. M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient deterministic multithreading in software. In ASPLOS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. C. Rinard and M. S. Lam. The design, implementation, and evaluation of Jade. ACM Trans. Program. Lang. Syst., 20 (3), May 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. Sanchez, L. Yen, M. D. Hill, and K. Sankaralingam. Implementing Signatures for Transactional Memory. In MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. R. Sarangi, B. Greskamp, and J. Torrellas. CADRE: Cycle-Accurate Deterministic Replay for Hardware Debugging. In DSN, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. I. Singh, A. Shriraman, W. W. L. Fung, M. O'Connor, and T. M. Aamodt. Cache Coherence for GPU Architectures. In HPCA, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. A. Stuart and J. D. Owens. Efficient Synchronization Primitives for GPUs. CoRR, abs/1110.4623, 2011.Google ScholarGoogle Scholar
  41. W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A Language for Streaming Applications. In CC '02, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. W. Thies, M. Karczmarek, J. Sermulins, R. Rabbah, and S. P. Amarasinghe. Teleport Messaging for Distributed Stream Programs. In PPoPP, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. T. J. Van Hook. United States Patent#6,630,933: Method and Apparatus for Compression and Decompression of Z Data (Assignee ATI Technologies Inc.), October 2003.Google ScholarGoogle Scholar
  44. S. Vangal et al. An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS. IEEE Journal of Solid-State Circuits, 43 (1): 29--41, Jan. 2008.Google ScholarGoogle ScholarCross RefCross Ref
  45. H. Wong et al. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  46. M. Zheng, V. T. Ravi, F. Qin, and G. Agrawal. GRace: A Low-Overhead Mechanism for Detecting Data Races in GPU Programs. In PPoPP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. GPUDet: a deterministic GPU architecture

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 48, Issue 4
          ASPLOS '13
          April 2013
          540 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/2499368
          Issue’s Table of Contents
          • cover image ACM Conferences
            ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
            March 2013
            574 pages
            ISBN:9781450318709
            DOI:10.1145/2451116

          Copyright © 2013 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 16 March 2013

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!