Abstract
Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one's ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a deterministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs.
Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thousands of threads on a GPU is a major challenge. This paper proposes a scalable hardware mechanism, GPUDet, to provide determinism in GPU architectures. In this paper we characterize the existing deterministic and nondeterministic aspects of current GPU execution models, and we use these observations to inform GPUDet's design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide determinism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow parallel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations.
Our simulation results indicate that GPUDet incurs only 2X slowdown on average over a baseline nondeterministic architecture, with runtime overheads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of overhead for deterministic execution on GPUs to provide insights for further optimizations.
- http://www.ece.ubc.ca/~aamodt/GPUDet.Google Scholar
- White Paper | AMD Graphics Cores Next (GCN) Architecture. AMD, June 2012.Google Scholar
- D. Arnold et al. Stack Trace Analysis for Large Scale Debugging. In IPDPS, 2007.Google Scholar
Cross Ref
- A. Aviram, S.-C. Weng, S. Hu, and B. Ford. Efficient system-enforced deterministic parallelism. In OSDI, 2010. Google Scholar
Digital Library
- A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.Google Scholar
Cross Ref
- T. Bergan, O. Anderson, J. Devietti, L. Ceze, and D. Grossman. CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution. In ASPLOS, 2010. Google Scholar
Digital Library
- T. Bergan, N. Hunt, L. Ceze, and S. D. Gribble. Deterministic Process Groups in dOS. In OSDI, 2010. Google Scholar
Digital Library
- A. Betts, N. Chong, A. F. Donaldson, S. Qadeer, and P. Thomson. GPUVerify: a verifier for GPU kernels. In Proceedings of the 27th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'12). ACM, 2012. Google Scholar
Digital Library
- G. Blelloch. NESL: A Nested Data-Parallel Language (Version 3.1). Technical report, Carnegie Mellon University, Pittsburgh, PA, 2007. Google Scholar
Digital Library
- R. L. Bocchino, Jr., V. S. Adve, D. Dig, S. V. Adve, S. Heumann, R. Komuravelli, J. Overbey, P. Simmons, H. Sung, and M. Vakilian. A Type and Effect System for Deterministic Parallel Java. In OOPSLA, 2009. Google Scholar
Digital Library
- M. Boyer, K. Skadron, and W. Weimer. Automated Dynamic Analysis of CUDA Programs. In Third Workshop on Software Tools for MultiCore Systems, 2008.Google Scholar
- A. Brownsword. Cloth in OpenCL, 2009.Google Scholar
- M. M. T. Chakravarty, R. Leshchinskiy, S. Peyton Jones, G. Keller, and S. Marlow. Data Parallel Haskell: A Status Report. In DAMP, 2007. Google Scholar
Digital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IISWC, 2009. Google Scholar
Digital Library
- B. W. Coon et al. United States Patent#7,353,369: System and Method for Managing Divergent Threads in a SIMD Architecture (Assignee NVIDIA Corp.), April 2008.Google Scholar
- J. Devietti, B. Lucia, L. Ceze, and M. Oskin. DMP: Deterministic Shared Memory Multiprocessing. In ASPLOS, 2009. Google Scholar
Digital Library
- J. Devietti, J. Nelson, T. Bergan, L. Ceze, and D. Grossman. RCDC: A Relaxed Consistency Deterministic Computer. In ASPLOS, 2011. Google Scholar
Digital Library
- S. A. Edwards and O. Tardieu. SHIM: A Deterministic Model for Heterogeneous Embedded Systems. In EMSOFT, 2005. Google Scholar
Digital Library
- W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt. Hardware Transactional Memory for GPU Architectures. In MICRO-44, 2011. Google Scholar
Digital Library
- W. Fung et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007. Google Scholar
Digital Library
- P. Harish and P. J. Narayanan. Accelerating Large Graph Algorithms on the GPU Using CUDA. In HiPC, 2007. Google Scholar
Digital Library
- M. Hill and M. Xu. http://www.cs.wisc.edu/ markhill/racey.html, 2009.Google Scholar
- D. Hower, P. Dudnik, M. Hill, and D. Wood. Calvin: Deterministic or Not? Free Will to Choose. In HPCA, 2011. Google Scholar
Digital Library
- Khronos Group. OpenCL. http://www.khronos.org/opencl/.Google Scholar
- S. Laine and T. Karras. High-Performance Software Rasterization on GPUs. In HPG, 2011. Google Scholar
Digital Library
- G. H. Lars Nyland, John R. Nickolls and T. Mandal. United States Patent#8,086,806: Systems and methods for coalescing memory accesses of parallel threads (Assignee NVIDIA Corp.), April 2011.Google Scholar
- C. E. Leiserson and T. B. Schardl. A Work-Efficient Parallel Breadth-First Search Algorithm (or How to Cope with the Nondeterminism of Reducers). In SPAA, 2010. Google Scholar
Digital Library
- G. Li and G. Gopalakrishnan. Scalable SMT-Based Verification of GPU Kernel Functions. In FSE, 2010. Google Scholar
Digital Library
- E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE, 2008. Google Scholar
Digital Library
- J. Liu, B. Jaiyen, R. Veras, and O. Multu. RAIDR: Retention-Aware Intelligent DRAM Refresh. In ISCA, 2012. Google Scholar
Digital Library
- T. Liu, C. Curtsinger, and E. D. Berger. DTHREADS: Efficient Deterministic Multithreading. In SOSP, 2011. Google Scholar
Digital Library
- NVIDIA's Next Generation CUDA Compute Architecture: Fermi. NVIDIA, October 2009.Google Scholar
- NVIDIA CUDA Programming Guide v3.1. NVIDIA Corp., 2010.Google Scholar
- NVML API Reference Manual v3.295.45. NVIDIA Corp., 2012.Google Scholar
- M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient deterministic multithreading in software. In ASPLOS, 2009. Google Scholar
Digital Library
- M. C. Rinard and M. S. Lam. The design, implementation, and evaluation of Jade. ACM Trans. Program. Lang. Syst., 20 (3), May 1998. Google Scholar
Digital Library
- D. Sanchez, L. Yen, M. D. Hill, and K. Sankaralingam. Implementing Signatures for Transactional Memory. In MICRO, 2007. Google Scholar
Digital Library
- S. R. Sarangi, B. Greskamp, and J. Torrellas. CADRE: Cycle-Accurate Deterministic Replay for Hardware Debugging. In DSN, 2006. Google Scholar
Digital Library
- I. Singh, A. Shriraman, W. W. L. Fung, M. O'Connor, and T. M. Aamodt. Cache Coherence for GPU Architectures. In HPCA, 2013.Google Scholar
Digital Library
- J. A. Stuart and J. D. Owens. Efficient Synchronization Primitives for GPUs. CoRR, abs/1110.4623, 2011.Google Scholar
- W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A Language for Streaming Applications. In CC '02, 2002. Google Scholar
Digital Library
- W. Thies, M. Karczmarek, J. Sermulins, R. Rabbah, and S. P. Amarasinghe. Teleport Messaging for Distributed Stream Programs. In PPoPP, 2005. Google Scholar
Digital Library
- T. J. Van Hook. United States Patent#6,630,933: Method and Apparatus for Compression and Decompression of Z Data (Assignee ATI Technologies Inc.), October 2003.Google Scholar
- S. Vangal et al. An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS. IEEE Journal of Solid-State Circuits, 43 (1): 29--41, Jan. 2008.Google Scholar
Cross Ref
- H. Wong et al. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS, 2010.Google Scholar
Cross Ref
- M. Zheng, V. T. Ravi, F. Qin, and G. Agrawal. GRace: A Low-Overhead Mechanism for Detecting Data Races in GPU Programs. In PPoPP, 2011. Google Scholar
Digital Library
Index Terms
GPUDet: a deterministic GPU architecture
Recommendations
GPUDet: a deterministic GPU architecture
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systemsNondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one's ability to test for ...
GPUDet: a deterministic GPU architecture
ASPLOS '13Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one's ability to test for ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...







Comments