Abstract
Modern accelerator programming frameworks, such as OpenCL, organise threads into work-groups. Remote-scope promotion (RSP) is a language extension recently proposed by AMD researchers that is designed to enable applications, for the first time, both to optimise for the common case of intra-work-group communication (using memory scopes to provide consistency only within a work-group) and to allow occasional inter-work-group communication (as required, for instance, to support the popular load-balancing idiom of work stealing). We present the first formal, axiomatic memory model of OpenCL extended with RSP. We have extended the Herd memory model simulator with support for OpenCL kernels that exploit RSP, and used it to discover bugs in several litmus tests and a work-stealing queue, that have been used previously in the study of RSP. We have also formalised the proposed GPU implementation of RSP. The formalisation process allowed us to identify bugs in the description of RSP that could result in well-synchronised programs experiencing memory inconsistencies. We present and prove sound a new implementation of RSP that incorporates bug fixes and requires less non-standard hardware than the original implementation. This work, a collaboration between academia and industry, clearly demonstrates how, when designing hardware support for a new concurrent language feature, the early application of formal tools and techniques can help to prevent errors, such as those we have found, from making it into silicon.
Supplemental Material
Available for Download
This archive contains (1) a virtual machine for replicating the results of simulating our litmus tests with Herd, and (2) our Isabelle formalisation of remote-scope promotion.
- J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan, J. Ketema, D. Poetzl, T. Sorensen, and J. Wickerson. GPU concurrency: weak behaviours and programming assumptions. In ASPLOS, 2015. Google Scholar
Digital Library
- J. Alglave, L. Maranget, S. Sarkar, and P. Sewell. Fences in weak memory models. In CAV, 2010. Google Scholar
Digital Library
- J. Alglave, L. Maranget, and M. Tautschnig. Herding cats: Modelling, simulation, testing, and data-mining for weak memory. ACM TOPLAS, 2014. Google Scholar
Digital Library
- M. Batty. The C11 and C++11 Concurrency Model. PhD thesis, University of Cambridge, October 2014.Google Scholar
- M. Batty, K. Memarian, S. Owens, S. Sarkar, and P. Sewell. Clarifying and compiling C/C++ concurrency: From C++11 to POWER. In POPL, 2012. Google Scholar
Digital Library
- M. Batty, S. Owens, S. Sarkar, P. Sewell, and T. Weber. Mathematizing C++ concurrency. In POPL, 2011. Google Scholar
Digital Library
- A. Betts, N. Chong, A. F. Donaldson, J. Ketema, S. Qadeer, P. Thomson, and J. Wickerson. The design and implementation of a verification technique for GPU kernels. ACM Trans. Program. Lang. Syst., 37(3):10, 2015. Google Scholar
Digital Library
- D. Cederman and P. Tsigas. Dynamic load balancing using work-stealing. In GPU Computing Gems. Elsevier, 2012.Google Scholar
Cross Ref
- S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. Pannotia: Understanding irregular GPGPU graph applications. In IISWC, 2013.Google Scholar
Cross Ref
- B. R. Gaster, D. R. Hower, and L. Howes. HRF-Relaxed: Adapting HRF to the complexities of industrial heterogeneous memory models. ACM TACO, 2015. Google Scholar
Digital Library
- B. A. Hechtman, S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood. QuickRelease: A throughput-oriented approach to release consistency on GPUs. In HPCA, 2014.Google Scholar
Cross Ref
- D. R. Hower, B. M. Beckmann, B. R. Gaster, B. A. Hechtman, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Heterogeneousrace-free memory models. In ASPLOS, 2014. Google Scholar
Digital Library
- G. Kyriazis. Heterogeneous system architecture: A technical review. Technical report, AMD, 2012.Google Scholar
- C. Lin, V. Nagarajan, and R. Gupta. Fence scoping. In SC, 2014. Google Scholar
Digital Library
- D. Lustig, M. Pellauer, and M. Martonosi. PipeCheck: Specifying and verifying microarchitectural enforcement of memory consistency models. In MICRO, 2014. Google Scholar
Digital Library
- S. Mador-Haim, L. Maranget, S. Sarkar, K. Memarian, J. Alglave, S. Owens, R. Alur, M. M. K. Martin, P. Sewell, and D. Williams. An axiomatic memory model for POWER multiprocessors. In CAV, 2012. Google Scholar
Digital Library
- A. Munshi. The OpenCL Specification (Version 2.0). Khronos OpenCL Working Group, November 2013.Google Scholar
- T. Nipkow, L. Paulson, and M. Wenzel. Isabelle/HOL - A Proof Assistant for Higher-Order Logic. Springer, 2002. Google Scholar
Digital Library
- M. S. Orr, S. Che, A. Yilmazer, B. M. Beckmann, M. D. Hill, and D. A. Wood. Synchronization using remote-scope promotion. In ASPLOS, 2015. Google Scholar
Digital Library
- B. Rajaram, V. Nagarajan, S. Sarkar, and M. Elver. Fast RMWs for TSO: Semantics and implementation. In PLDI, 2013. Google Scholar
Digital Library
- S. Sarkar, K. Memarian, S. Owens, M. Batty, P. Sewell, L. Maranget, J. Alglave, and D. Williams. Synchronising C/C++ and POWER. In PLDI, 2012. Google Scholar
Digital Library
- S. Sarkar, P. Sewell, J. Alglave, L. Maranget, and D. Williams. Understanding POWER multiprocessors. In PLDI, 2011. Google Scholar
Digital Library
- P. Sewell, S. Sarkar, S. Owens, F. Zappa Nardelli, and M. O. Myreen. x86-TSO: A rigorous and usable programmer’s model for x86 multiprocessors. CACM, 53(7), 2010. Google Scholar
Digital Library
- J. Wickerson, M. Batty, and A. F. Donaldson. Overhauling SC atomics in C11 and OpenCL. CoRR, July 2015. http://arxiv.org/abs/1503.07073.Google Scholar
Index Terms
Remote-scope promotion: clarified, rectified, and verified
Recommendations
Synchronization Using Remote-Scope Promotion
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating SystemsHeterogeneous system architecture (HSA) and OpenCL define scoped synchronization to facilitate low overhead communication across a subset of threads. Scoped synchronization works well for static sharing patterns, where consumer threads are known a ...
Remote-scope promotion: clarified, rectified, and verified
OOPSLA 2015: Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and ApplicationsModern accelerator programming frameworks, such as OpenCL, organise threads into work-groups. Remote-scope promotion (RSP) is a language extension recently proposed by AMD researchers that is designed to enable applications, for the first time, both to ...
Synchronization Using Remote-Scope Promotion
ASPLOS '15Heterogeneous system architecture (HSA) and OpenCL define scoped synchronization to facilitate low overhead communication across a subset of threads. Scoped synchronization works well for static sharing patterns, where consumer threads are known a ...






Comments