Abstract
Modern memory consistency models are complex, and it is difficult to reason about the relaxed behaviors that current systems allow. Programming languages, such as C and OpenCL, offer a memory model interface that developers can use to safely write concurrent applications. This abstraction provides functional portability across any platform that implements the interface, regardless of differences in the underlying systems. This powerful abstraction hinges on the ability of the system to correctly implement the interface. Many techniques for memory consistency model validation use empirical testing, which has been effective at uncovering undocumented behaviors and even finding bugs in trusted compilation schemes. Memory model testing consists of small concurrent unit tests called “litmus tests”. In these tests, certain observations, including potential bugs, are exceedingly rare, as they may only be triggered by precise interleaving of system steps in a complex processor, which is probabilistic in nature. Thus, each test must be run many times in order to provide a high level of confidence in its coverage.
In this work, we rigorously investigate empirical memory model testing. In particular, we propose methodologies for navigating complex stressing routines and analyzing large numbers of testing observations. Using these insights, we can more efficiently tune stressing parameters, which can lead to higher confidence results at a faster rate. We emphasize the need for such approaches by performing a meta-study of prior work, which reveals results with low reproducibility and inefficient use of testing time.
Our investigation is presented alongside empirical data. We believe that OpenCL targeting GPUs is a pragmatic choice in this domain as there exists a variety of different platforms to test, from large HPC servers to power-efficient edge devices. The tests presented in the work span 3 GPUs from 3 different vendors. We show that our methodologies are applicable across the GPUs, despite significant variances in the results. Concretely, our results show: lossless speedups of more than 5× in tuning using data peeking; a definition of portable stressing parameters which loses only 12% efficiency when generalized across our domain; a priority order of litmus tests for tuning. We stress test a conformance test suite for the OpenCL 2.0 memory model and discover a bug in Intel’s compiler. Our methods are evaluated on the other two GPUs using mutation testing. We end with recommendations for official memory model conformance tests.
Supplemental Material
- Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. 2015. GPU Concurrency: Weak Behaviours and Programming Assumptions. In Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM. https://doi.org/10.1145/2694344.2694391 Google Scholar
Digital Library
- Jade Alglave, Luc Maranget, Susmit Sarkar, and Peter Sewell. 2010. Fences in Weak Memory Models. In Computer Aided Verification (CAV). Springer. https://doi.org/10.1007/978-3-642-14295-6_25 Google Scholar
Digital Library
- Jade Alglave, Luc Maranget, Susmit Sarkar, and Peter Sewell. 2011. Litmus: Running Tests against Hardware. In Tools and Algorithms for the Construction and Analysis of Systems (TACAS). Springer. https://doi.org/10.1007/978-3-642-19835-9_5 Google Scholar
- Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory. ACM Trans. Program. Lang. Syst. 36, 2 ( 2014 ), 7 : 1-7 : 74. https://doi.org/10.1145/2627752 Google Scholar
Digital Library
- ARM. 2011. Cortex-A9 MPCore, programmer advice notice, read-after-read hazards.Google Scholar
- Mark Batty, Alastair F. Donaldson, and John Wickerson. 2016. Overhauling SC atomics in C11 and OpenCL. In Principles of Programming Languages (POPL). ACM. https://doi.org/10.1145/2837614.2837637 Google Scholar
Digital Library
- Mark Batty, Kayvan Memarian, Scott Owens, Susmit Sarkar, and Peter Sewell. 2012. Clarifying and compiling C/C++ concurrency: from C+ +11 to POWER. In Principles of Programming Languages (POPL). ACM. https://doi.org/10.1145/ 1926385.1926394 Google Scholar
Digital Library
- Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. 2011. Mathematizing C+ + concurrency. In Principles of Programming Languages (POPL). ACM. https://doi.org/10.1145/1926385.1926394 Google Scholar
Digital Library
- Joonwon Choi, Muralidaran Vijayaraghavan, Benjamin Sherman, Adam Chlipala, and Arvind. 2017. Kami: a platform for high-level parametric hardware specification and its modular verification. Proc. ACM Program. Lang. 1, ICFP ( 2017 ), 24 : 1-24 : 30. https://doi.org/10.1145/3110268 Google Scholar
Digital Library
- Nathan Chong, Tyler Sorensen, and John Wickerson. 2018. The semantics of transactions and weak memory in x86, Power, ARM, and C++. In Programming Language Design and Implementation (PLDI). ACM. https://doi.org/10.1145/3192366. 3192373 Google Scholar
Digital Library
- William W. Collier. 1992. Reasoning About Parallel Architectures. Prentice-Hall. http://www.mpdiag.com/.Google Scholar
Digital Library
- R. A. DeMillo, R. J. Lipton, and F. G. Sayward. 1978. Hints on Test Data Selection: Help for the Practicing Programmer. Computer 11, 4 ( 1978 ), 34-41. https://doi.org/10.1109/ C-M. 1978.218136 Google Scholar
Digital Library
- Shaked Flur, Susmit Sarkar, Christopher Pulte, Kyndylan Nienhuis, Luc Maranget, Kathryn E. Gray, Ali Sezgin, Mark Batty, and Peter Sewell. 2017. Mixed-size concurrency: ARM, POWER, C/C++ 11, and SC. In Principles of Programming Languages (POPL). ACM. https://doi.org/10.1145/3093333.3009839 Google Scholar
Digital Library
- Kshitij Gupta, Jef Stuart, and John D. Owens. 2012. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing (InPar). IEEE Computer Society. https://doi.org/10.1109/InPar. 2012.6339596 Google Scholar
- Sudheendra Hangal, Durgam Vahia, Chaiyasit Manovit, Juin-Yeu Joseph Lu, and Sridhar Narayanan. 2004. TSOtool: A program for verifying memory systems using the memory consistency model. In International Symposium on Computer Architecture (ISCA). IEEE Computer Society. https://doi.org/10.1145/1028176.1006710 Google Scholar
Digital Library
- Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-race-free memory models. In Architectural Support for Programming Languages and Operating Systems (ASPLOS). 427-440. https://doi.org/10.1145/2541940.2541981 Google Scholar
Digital Library
- Dan Iorga, Tyler Sorensen, John Wickerson, and Alastair F. Donaldson. 2020. Slow and Steady: Measuring and Tuning Multicore Interference. In Real-Time and Embedded Technology and Applications Symposium (RTAS). IEEE. https: //doi.org/10.1109/RTAS48715. 2020. 000-6 Google Scholar
Cross Ref
- Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing. IEEE Trans. Software Eng. 37, 5 ( 2011 ), 649-678. https://doi.org/10.1109/TSE. 2010.62 Google Scholar
Digital Library
- Khronos Group. 2015. The OpenCL Specification Version: 2.0 (rev. 29).Google Scholar
- Khronos Group. 2019. The OpenCL Specification Version: 2.2 (rev. 2. 2-11 ).Google Scholar
- L. Lamport. 1979. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Trans. Comput. 28, 9 ( 1979 ), 690-691. https://doi.org/10.1109/TC. 1979.1675439 Google Scholar
Digital Library
- N. G. Leveson and C. S. Turner. 1993. An investigation of the Therac-25 accidents. IEEE Computer 26, 7 ( 1993 ), 18-41. https://doi.org/10.1109/ MC. 1993.274940 Google Scholar
Digital Library
- Daniel Lustig, Sameer Sahasrabuddhe, and Olivier Giroux. 2019. A Formal Analysis of the NVIDIA PTX Memory Consistency Model. In Architectural Support for Programming Languages and Operating Systems, ASPLOS. ACM. https://doi.org/10. 1145/3297858.3304043 Google Scholar
Digital Library
- Yatin A. Manerkar, Caroline Trippel, Daniel Lustig, Michael Pellauer, and Margaret Martonosi. 2016. Counterexamples and proof loophole for the C/C++ to POWER and ARMv7 trailing-sync compiler mappings. arXiv:1611.01507 2016.Google Scholar
- Brian Norris and Brian Demsky. 2013. CDSChecker: Checking concurrent data structures written with C/C++ atomics. In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2013. ACM. https://doi.org/10.1145/ 2544173.2509514 Google Scholar
Digital Library
- Nvidia. 2019. CUDA C Programming Guide, Version 10.2.Google Scholar
- Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2018. Simplifying ARM concurrency: multicopy-atomic axiomatic and operational models for ARMv8. Proc. ACM Program. Lang. POPL ( 2018 ). https: //doi.org/10.1145/3158107 Google Scholar
Digital Library
- Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. 2010. x86-TSO: a rigorous and usable programmer's model for x86 multiprocessors. Commun. ACM 53, 7 ( 2010 ), 89-97. https://doi.org/10.1145/ 1785414.1785443 Google Scholar
Digital Library
- F J Sigworth and S M Sine. 1987. Data transformations for improved display and fitting of single-channel dwell time histograms. Biophysical journal 52, 6 (Dec 1987 ), 1047-54. https://doi.org/10.1016/S0006-3495 ( 87 ) 83298-8 Google Scholar
Cross Ref
- Tyler Sorensen and Alastair F. Donaldson. 2016. Exposing Errors Related to Weak Memory in GPU Applications. In Programming Language Design and Implementation (PLDI). ACM. https://doi.org/10.1145/2908080.2908114 Google Scholar
Digital Library
- Tyler Sorensen, Alastair F. Donaldson, Mark Batty, Ganesh Gopalakrishnan, and Zvonimir Rakamaric. 2016. Portable inter-workgroup barrier synchronisation for GPUs. In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). ACM. https://doi.org/10.1145/2983990.2984032 Google Scholar
Digital Library
- Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2011. A Primer on Memory Consistency and Cache Coherence (1st ed.). Morgan & Claypool Publishers. https://doi.org/10.2200/S00346ED1V01Y201104CAC016 Google Scholar
Cross Ref
- Tuan Ta, Xianwei Zhang, Anthony Gutierrez, and Bradford M. Beckmann. 2019. Autonomous data-race-free GPU testing. In International Symposium on Workload Characterization (IISWC). IEEE Computer Society. https://doi.org/10.1109/ IISWC47752. 2019.9042019 Google Scholar
Cross Ref
- United States Department of Energy. 2004. Final Report on the August 14, 2003 Blackout in the United States and Canada: Causes and Recommendations.Google Scholar
- Conrad Watt, Christopher Pulte, Anton Podkopaev, Guillaume Barbier, Stephen Dolan, Shaked Flur, Jean Pichon-Pharabod, and Shu yu Guo. 2020. Repairing and mechanising the JavaScript relaxed memory model. In Programming Language Design and Implementation (PLDI). ACM. https://doi.org/10.1145/3385412.3385973 Google Scholar
Digital Library
- John Wickerson, Mark Batty, Tyler Sorensen, and George A. Constantinides. 2017. Automatically comparing memory consistency models. In Principles of Programming Languages (POPL). ACM. https://doi.org/10.1145/3009837.3009838 Google Scholar
Digital Library
- Shucai Xiao and Wu-chun Feng. 2010. Inter-block GPU communication via fast barrier synchronization. In International Symposium on Parallel and Distributed Processing (IPDPS). IEEE Computer Society. https://doi.org/10.1109/IPDPS. 2010. 5470477 Google Scholar
Cross Ref
- Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and understanding bugs in C compilers. In Programming Language Design and Implementation (PLDI). ACM. https://doi.org/10.1145/1993316.1993532 Google Scholar
Digital Library
Index Terms
Foundations of empirical memory consistency testing
Recommendations
A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems
PDP '14: Proceedings of the 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based ProcessingOpenCL is a vendor neutral and portable interface for programming parallel compute devices such as GPUs. Tuning OpenCL implementations of important library functions such as dense general matrix multiply (GEMM) for a particular device is a difficult ...
A Comparison of Performance Tunabilities between OpenCL and OpenACC
MCSOC '13: Proceedings of the 2013 IEEE 7th International Symposium on Embedded Multicore/Manycore System-on-ChipTo design and develop any auto tuning mechanisms for OpenACC, it is important to clarify the differences between conventional GPU programming models and OpenACC in terms of available programming and tuning techniques, called performance tunabilities. ...
The Loop-of-Stencil-Reduce Paradigm
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03In this paper we advocate the Loop-of-stencil-reduce pattern as a way to simplify the parallel programming of heterogeneous platforms (multicore+GPUs). Loop-of-Stencil-reduce is general enough to subsume map, reduce, map-reduce, stencil, stencil-reduce, ...






Comments