Abstract
To achieve good multi-core performance, modern microprocessors have weak memory models, rather than enforce sequential consistency. This gives the programmer a wide scope for choosing exactly how to implement various aspects of inter-thread communication through the system's shared memory. However, these choices come with both semantic and performance consequences, often in tension with each other. In this paper, we focus on the performance side, and define techniques for evaluating the impact of various choices in using weak memory models, such as where to put fences, and which fences to use. We make no attempt to judge certain strategies as best or most efficient, and instead provide the techniques that will allow the programmer to understand the performance implications when identifying and resolving any semantic/performance trade-offs. In particular, our technique supports the reasoned selection of macrobenchmarks to use in investigating trade-offs in using weak memory models. We demonstrate our technique on both synthetic benchmarks and real-world applications for the Linux Kernel and OpenJDK Hotspot Virtual Machine on the ARMv8 and POWERv7 architectures.
- J. Alglave, L. Maranget, and M. Tautschnig. Herding cats: Modelling, simulation, testing, and data mining for weak memory. ACM Trans. Program. Lang. Syst., 36(2):7:1--7:74, 2014. doi:10.1145/2627752. Google Scholar
Digital Library
- Apache Software Foundation. Apache Spark: lightning-fast cluster computing, 2014. URL http://spark.apache.org.Google Scholar
- ARM Limited. ARM Architecture Reference Manual. ARMv8, for ARMv8-A architecture profile. ARM Limited, 2015.Google Scholar
- M. Batty, S. Owens, S. Sarkar, P. Sewell, and T. Weber. Mathematizing C++ concurrency. In POPL '11: Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 55--66, 2011. doi:10.1145/1926385.1926394. Google Scholar
Digital Library
- M. Batty, K. Memarian, K. Nienhuis, J. Pichon-Pharabod, and P. Sewell. The problem of programming language concurrency semantics. In Programming Languages and Systems - 24th European Symposium on Programming, ESOP 2015. Proceedings, pages 283--307, 2015. doi:10.1007/978-3-662-46669-8 12.Google Scholar
- S. M. Blackburn, R. Garner, C. Hoffmann, A. M. Khan, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. L. Hosking, M. Jump, H. B. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanovic, T. VanDrunen, D. von Dincklage, and B. Wiedermann. The DaCapo benchmarks: Java benchmarking development and analysis. In OOPSLA '06: Proceedings of the 21th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 169--190, 2006. doi:10.1145/1167473.1167488. Google Scholar
Digital Library
- H. Boehm and S. V. Adve. Foundations of the C++ concurrency memory model. In PLDI '08: Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 68--78, 2008. doi:10.1145/1375581.1375591. Google Scholar
Digital Library
- H. Boehm and B. Demsky. Outlawing ghosts: avoiding out-of-thin-air results. In MSPC '14: Proceedings of the workshop on Memory Systems Performance and Correctness, pages 7:1--7:6, 2014. doi:10.1145/2618128.2618134. Google Scholar
Digital Library
- R. R. Branco and V. Henson. ebizzy benchmark, 2008. URL http://ebizzy.sourceforge.net.Google Scholar
- C. Curtsinger and E. D. Berger. Coz: finding code that counts with causal profiling. In SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles, pages 184--197, 2015. doi:10.1145/2815400.2815409. Google Scholar
Digital Library
- D. Demange, V. Laporte, L. Zhao, S. Jagannathan, D. Pichardie, and J. Vitek. Plan B: a buffered memory model for Java. In POPL '13: Proceedings of the 40th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 329--342, 2013. doi:10.1145/2429069.2429110. Google Scholar
Digital Library
- S. Flur, K. E. Gray, C. Pulte, S. Sarkar, A. Sezgin, L. Maranget, W. Deacon, and P. Sewell. Modelling the ARMv8 architecture, operationally: Concurrency and ISA. In POPL '16: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2016. Google Scholar
Digital Library
- L. Gidra, G. Thomas, J. Sopena, M. Shapiro, and N. Nguyen. NumaGiC: a garbage collector for big data on big NUMA machines. In ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 661--673, 2015. doi:10.1145/2694344.2694361. Google Scholar
Digital Library
- K. E. Gray, G. Kerneis, D. Mulligan, C. Pulte, S. Sarkar, and P. Sewell. An integrated concurrency and core-ISA architectural envelope definition, and test oracle, for IBM POWER multiprocessors. In MICRO-48: Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture, 2015. doi:10.1145/2830772.2830775. Google Scholar
Digital Library
- A. Haley. OpenJDK RFR: 8135187: DMB elimination in AArch64 C2 synchronization implementation, 2015. URL http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-September/018899.html.Google Scholar
- D. Howells and P. E. McKenney. Linux kernel memory barriers, 2015. URL https://www.kernel.org/doc/Documentation/memory-barriers.txt.Google Scholar
- E. Jones, T. Oliphant, P. Peterson, et al. SciPy: Open source scientific tools for Python, 2001--. URL http://www.scipy.org/.Google Scholar
- R. Jones. Netperf benchmark, 2015. URL http://www.netperf.org.Google Scholar
- T. Kalibera, M. Mole, R. E. Jones, and J. Vitek. A black-box approach to understanding concurrency in DaCapo. In OOPSLA '12: Proceedings of the 27th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 335--354, 2012. doi:10.1145/2384616.2384641. Google Scholar
Digital Library
- J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection, 2014. URL http://snap.stanford.edu/data.Google Scholar
- J. Manson, W. Pugh, and S. V. Adve. The Java memory model. In POPL '05: Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 378--391, 2005. doi:10.1145/1040305.1040336. Google Scholar
Digital Library
- D. Marino, A. Singh, T. D. Millstein, M. Musuvathi, and S. Narayanasamy. A case for an SC-preserving compiler. In PLDI '11: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 199--210, 2011. doi:10.1145/1993498.1993522. Google Scholar
Digital Library
- P. E. McKenney. Linux-kernel memory model. Technical report, ISO IEC JTC1/SC22/WG21, 2015. URL http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4444.html.Google Scholar
- L. McVoy and C. Staelin. LMbench - tools for performance analysis, 2012. URL http://www.bitmover.com/lmbench/.Google Scholar
- S. Sarkar, P. Sewell, J. Alglave, L. Maranget, and D. Williams. Understanding POWER multiprocessors. In PLDI '11: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 175--186, 2011. doi:10.1145/1993498.1993520. Google Scholar
Digital Library
- A. Singh, S. Narayanasamy, D. Marino, T. D. Millstein, and M. Musuvathi. End-to-end sequential consistency. In 39th International Symposium on Computer Architecture, ISCA 2012, pages 524--535, 2012. doi:10.1109/ISCA.2012.6237045. Google Scholar
Digital Library
- V. Vafeiadis, T. Balabonski, S. Chakraborty, R. Morisset, and F. Zappa Nardelli. Common compiler optimisations are invalid in the C11 memory model and what we can do about it. In POPL '15: Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 209--220, 2015. doi:10.1145/2676726.2676995. Google Scholar
Digital Library
Recommendations
Benchmarking weak memory models
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingTo achieve good multi-core performance, modern microprocessors have weak memory models, rather than enforce sequential consistency. This gives the programmer a wide scope for choosing exactly how to implement various aspects of inter-thread ...
Micro-benchmarking Flash Memory File-System Wear Leveling and Garbage Collection: A Focus on Initial State Impact
CSE '12: Proceedings of the 2012 IEEE 15th International Conference on Computational Science and EngineeringNAND flash memories are currently the de facto secondary storage technology in the embedded system domain thanks to their benefits mainly in terms of energy consumption, I/O performance, and data storage density. This Non-Volatile Memory (NVM) ...
Performance of memory reclamation for lockless synchronization
Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, ...






Comments