skip to main content
research-article

Remote-scope promotion: clarified, rectified, and verified

Published:23 October 2015Publication History
Skip Abstract Section

Abstract

Modern accelerator programming frameworks, such as OpenCL, organise threads into work-groups. Remote-scope promotion (RSP) is a language extension recently proposed by AMD researchers that is designed to enable applications, for the first time, both to optimise for the common case of intra-work-group communication (using memory scopes to provide consistency only within a work-group) and to allow occasional inter-work-group communication (as required, for instance, to support the popular load-balancing idiom of work stealing). We present the first formal, axiomatic memory model of OpenCL extended with RSP. We have extended the Herd memory model simulator with support for OpenCL kernels that exploit RSP, and used it to discover bugs in several litmus tests and a work-stealing queue, that have been used previously in the study of RSP. We have also formalised the proposed GPU implementation of RSP. The formalisation process allowed us to identify bugs in the description of RSP that could result in well-synchronised programs experiencing memory inconsistencies. We present and prove sound a new implementation of RSP that incorporates bug fixes and requires less non-standard hardware than the original implementation. This work, a collaboration between academia and industry, clearly demonstrates how, when designing hardware support for a new concurrent language feature, the early application of formal tools and techniques can help to prevent errors, such as those we have found, from making it into silicon.

Skip Supplemental Material Section

Supplemental Material

References

  1. J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan, J. Ketema, D. Poetzl, T. Sorensen, and J. Wickerson. GPU concurrency: weak behaviours and programming assumptions. In ASPLOS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Alglave, L. Maranget, S. Sarkar, and P. Sewell. Fences in weak memory models. In CAV, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Alglave, L. Maranget, and M. Tautschnig. Herding cats: Modelling, simulation, testing, and data-mining for weak memory. ACM TOPLAS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Batty. The C11 and C++11 Concurrency Model. PhD thesis, University of Cambridge, October 2014.Google ScholarGoogle Scholar
  5. M. Batty, K. Memarian, S. Owens, S. Sarkar, and P. Sewell. Clarifying and compiling C/C++ concurrency: From C++11 to POWER. In POPL, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Batty, S. Owens, S. Sarkar, P. Sewell, and T. Weber. Mathematizing C++ concurrency. In POPL, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Betts, N. Chong, A. F. Donaldson, J. Ketema, S. Qadeer, P. Thomson, and J. Wickerson. The design and implementation of a verification technique for GPU kernels. ACM Trans. Program. Lang. Syst., 37(3):10, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Cederman and P. Tsigas. Dynamic load balancing using work-stealing. In GPU Computing Gems. Elsevier, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  9. S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. Pannotia: Understanding irregular GPGPU graph applications. In IISWC, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  10. B. R. Gaster, D. R. Hower, and L. Howes. HRF-Relaxed: Adapting HRF to the complexities of industrial heterogeneous memory models. ACM TACO, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. A. Hechtman, S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood. QuickRelease: A throughput-oriented approach to release consistency on GPUs. In HPCA, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  12. D. R. Hower, B. M. Beckmann, B. R. Gaster, B. A. Hechtman, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Heterogeneousrace-free memory models. In ASPLOS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Kyriazis. Heterogeneous system architecture: A technical review. Technical report, AMD, 2012.Google ScholarGoogle Scholar
  14. C. Lin, V. Nagarajan, and R. Gupta. Fence scoping. In SC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Lustig, M. Pellauer, and M. Martonosi. PipeCheck: Specifying and verifying microarchitectural enforcement of memory consistency models. In MICRO, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Mador-Haim, L. Maranget, S. Sarkar, K. Memarian, J. Alglave, S. Owens, R. Alur, M. M. K. Martin, P. Sewell, and D. Williams. An axiomatic memory model for POWER multiprocessors. In CAV, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Munshi. The OpenCL Specification (Version 2.0). Khronos OpenCL Working Group, November 2013.Google ScholarGoogle Scholar
  18. T. Nipkow, L. Paulson, and M. Wenzel. Isabelle/HOL - A Proof Assistant for Higher-Order Logic. Springer, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. S. Orr, S. Che, A. Yilmazer, B. M. Beckmann, M. D. Hill, and D. A. Wood. Synchronization using remote-scope promotion. In ASPLOS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Rajaram, V. Nagarajan, S. Sarkar, and M. Elver. Fast RMWs for TSO: Semantics and implementation. In PLDI, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Sarkar, K. Memarian, S. Owens, M. Batty, P. Sewell, L. Maranget, J. Alglave, and D. Williams. Synchronising C/C++ and POWER. In PLDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Sarkar, P. Sewell, J. Alglave, L. Maranget, and D. Williams. Understanding POWER multiprocessors. In PLDI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Sewell, S. Sarkar, S. Owens, F. Zappa Nardelli, and M. O. Myreen. x86-TSO: A rigorous and usable programmer’s model for x86 multiprocessors. CACM, 53(7), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Wickerson, M. Batty, and A. F. Donaldson. Overhauling SC atomics in C11 and OpenCL. CoRR, July 2015. http://arxiv.org/abs/1503.07073.Google ScholarGoogle Scholar

Index Terms

  1. Remote-scope promotion: clarified, rectified, and verified

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!