Abstract
Heterogeneous system architecture (HSA) and OpenCL define scoped synchronization to facilitate low overhead communication across a subset of threads. Scoped synchronization works well for static sharing patterns, where consumer threads are known a priori. It works poorly for dynamic sharing patterns (e.g., work stealing) where programmers cannot use a faster small scope due to the rare possibility that the work is stolen by a thread in a distant slower scope. This puts programmers in a conundrum: optimize the common case by synchronizing at a faster small scope or use work stealing at a slower large scope. In this paper, we propose to extend scoped synchronization with remote-scope promotion. This allows the most frequent sharers to synchronize through a small scope. Infrequent sharers synchronize by promoting that remote small scope to a larger shared scope. Synchronization using remote-scope promotion provides performance robustness for dynamic workloads, where the benefits provided by scoped synchronization and work stealing are hard to anticipate. Compared to a naïve baseline, static scoped synchronization alone achieves a 1.07x speedup on average and dynamic work stealing alone achieves a 1.18x speedup on average. In contrast, synchronization using remote-scope promotion achieves a robust 1.25x speedup on average, across a diverse set of graph benchmarks and inputs.
- "OpenCL 2.0 Reference Pages." {Online}. Available: http://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/.Google Scholar
- "CUDA C Programming Guide." {Online}. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/.Google Scholar
- "HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer's Guide, and Object Format (BRIG) Version 1.0 Provisional," HSA Foundation, Spring 2013.Google Scholar
- T. Aila and S. Laine, "Understanding the Efficiency of Ray Traversal on GPUs," In Proceedings of the Conference on High Performance Graphics, New York, N.Y., USA, 2009, pp. 145--149. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall, "The Imple-mentation of the Cilk-5 Multithreaded Language," In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, New York, N.Y., USA, 1998, pp. 212--223. Google Scholar
Digital Library
- OpenMP Architecture Review Board, "OpenMP Application Program Interface Version 4.0," {Online}. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.Google Scholar
- "Intel Threading Building Blocks." {Online}. Available: http://www.threadingbuildingblocks.org/.Google Scholar
- D. Leijen, W. Schulte, and S. Burckhardt, "The design of a task parallel library," In Proceedings of the 24th ACM SIG-PLAN conference on Object oriented programming systems languages and applications, pp. 227--242, 2009. Google Scholar
Digital Library
- International Organization for Standardization, "Working Draft, Standard for Programming Language C++," {Online}. Available: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdfGoogle Scholar
- D.R. Hower, B.A. Hechtman, B.M. Beckmann, B.R. Gaster, M.D. Hill, S.K. Reinhardt, and D.A. Wood, "Heterogeneous-race-free Memory Models," In The 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-19), 2014. Google Scholar
Digital Library
- B.R. Gaster, D. Hower, and L. Howes, "HRF-Relaxed: Adapting HRF to the complexities of industrial heterogeneous memory models," In Transactions on Architecture and Code Optimization (TACO), 2015.Google Scholar
Digital Library
- AMD, "Southern Islands Series Instruction Set Architecture," 2012.Google Scholar
- S. Owens, S. Sarkar, and P. Sewell, "A Better x86 Memory Model: x86-TSO," In Proceedings of the Conference on Theorem Proving in Higher Order Logics, 2009. Google Scholar
Digital Library
- D. J. Sorin, M. D. Hill, and D. A. Wood, "A Primer on Memory Consistency and Cache Coherence," Morgan and Claypool, 2011. Google Scholar
Digital Library
- B. A. Hechtman, S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "QuickRelease: A Throughput-oriented Approach to Release Consistency on GPUs," presented at the 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014).Google Scholar
- N.S. Arora, R.D. Blumofe, and C. Greg Plaxton, "Thread scheduling for multiprogrammed multiprocessors," In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, ACM, Puerto Vallarta, Mexico, 1998, pp. 119--129. Google Scholar
Digital Library
- D. Cederman and P. Tsigas, "Dynamic Load-Balancing Using Work-Stealing," In GPU Computing Gems Jade Edition, Wen-Mei Hwu (Editor-in-Chief), Morgan Kaufmann.Google Scholar
- N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 Simulator," In SIGARCH Computer Arch. News, vol. 39, no. 2, pp. 1--7, Aug. 2011. Google Scholar
Digital Library
- S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, "Pannotia: Understanding Irregular GPGPU Graph Applications," In Proceedings of the International Symposium on Workload Characterizations, Sept. 2013.Google Scholar
- DIMACS Implementation Challenges. http://dimacs.rutgers.edu/Challenges/Google Scholar
- Web resource: http://www.sommer.jp/graphs/Google Scholar
- B. N. Bershad, M. J. Zekauskas, and W. A. Sawdon, "The Midway distributed shared memory system," In Proc. 38th IEEE Computer Society Int. Conf., pp. 528--537, 1993.Google Scholar
- L. Iftode, J. P. Singh, and K. Li, "Scope consistency: a bridge between release consistency and entry consistency," In Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures, p.277--287, June 24--26, 1996, Padua, Italy. Google Scholar
Digital Library
- D. Dice, M.S. Moir, and W.N. Scherer III, "Quickly reacquirable locks," US Patent 7,814,488, 2010.Google Scholar
- W.W.L. Fung and T.M. Aamodt, "Energy Efficient GPU Transactional Memory via Space-Time Optimizations," In Proceedings of the 46th IEEE/ACM International Symposium on Microarchitecture (MICRO-46), pp. 408--420, Davis, CA, Dec. 7--11, 2013. Google Scholar
Digital Library
- D. Cederman, P. Tsigas, and M.T. Chaudhry, "Towards a Software Transactional Memory for Graphics Processors," In Proceedings of the 10th Eurographics Symposium on Parallel Graphics and Visualization (EGPGV 2010). Google Scholar
Digital Library
- I. Singh, A. Shriraman, W.W.L. Fung, M. O'Connor, and T.M. Aamodt, "Cache Coherence for GPU Architectures," In Proceedings of the 19th IEEE International Symposium on High-Performance Computer Architecture (HPCA-19), pp. 578--590, Shenzhen, China, Feb. 23--27, 2013. Google Scholar
Digital Library
- S. Tzeng, A. Patney, and J.D. Owens, "Task Management for Irregular-Parallel Workloads on the GPU," In Proceedings of High Performance Graphics 2010, pp. 29--37. June 2010. Google Scholar
Digital Library
Index Terms
Synchronization Using Remote-Scope Promotion
Recommendations
Synchronization Using Remote-Scope Promotion
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating SystemsHeterogeneous system architecture (HSA) and OpenCL define scoped synchronization to facilitate low overhead communication across a subset of threads. Scoped synchronization works well for static sharing patterns, where consumer threads are known a ...
Synchronization Using Remote-Scope Promotion
ASPLOS '15Heterogeneous system architecture (HSA) and OpenCL define scoped synchronization to facilitate low overhead communication across a subset of threads. Scoped synchronization works well for static sharing patterns, where consumer threads are known a ...
Lazy release consistency for GPUs
MICRO-49: The 49th Annual IEEE/ACM International Symposium on MicroarchitectureThe heterogeneous-race-free (HRF) memory model has been embraced by the Heterogeneous System Architecture (HSA) Foundation and OpenCL™ because it clearly and precisely defines the behavior of current GPUs. However, compared to the simpler SC for DRF ...







Comments