Abstract
Read-Modify-Write (RMW) instructions are widely used as the building blocks of a variety of higher level synchronization constructs, including locks, barriers, and lock-free data structures. Unfortunately, they are expensive in architectures such as x86 and SPARC which enforce (variants of) Total-Store-Order (TSO). A key reason is that RMWs in these architectures are ordered like a memory barrier, incurring the cost of a write-buffer drain in the critical path. Such strong ordering semantics are dictated by the requirements of the strict atomicity definition (type-1) that existing TSO RMWs use. Programmers often do not need such strong semantics. Besides, weakening the atomicity definition of TSO RMWs, would also weaken their ordering -- thereby leading to more efficient hardware implementations.
In this paper we argue for TSO RMWs to use weaker atomicity definitions -- we consider two weaker definitions: type-2 and type-3, with different relaxed ordering differences. We formally specify how such weaker RMWs would be ordered, and show that type-2 RMWs, in particular, can seamlessly replace existing type-1 RMWs in common synchronization idioms -- except in situations where a type-1 RMW is used as a memory barrier. Recent work has shown that the new C/C++11 concurrency model can be realized by generating conventional (type-1) RMWs for C/C++11 SC-atomic-writes and/or SC-atomic-reads. We formally prove that this is equally valid using the proposed type-2 RMWs; type-3 RMWs, on the other hand, could be used for SC-atomic-reads (and optionally SC-atomic-writes). We further propose efficient microarchitectural implementations for type-2 (type-3) RMWs -- simulation results show that our implementation reduces the cost of an RMW by up to 58.9% (64.3%), which translates into an overall performance improvement of up to 9.0% (9.2%) on a set of parallel programs, including those from the SPLASH-2, PARSEC, and STAMP benchmarks.
- S. V. Adve. Designing memory consistency models for shared-memory multiprocessors. PhD thesis, Madison, WI, USA, 1993. UMI Order No. GAX94-07354. Google Scholar
Digital Library
- J. Alglave. A Shared Memory Poetics. PhD thesis, 2010.Google Scholar
- H. Attiya, R. Guerraoui, D. Hendler, P. Kuznetsov, M. M. Michael, and M. T. Vechev. Laws of order: expensive synchronization in concurrent algorithms cannot be eliminated. In POPL, pages 487--498, 2011. Google Scholar
Digital Library
- D. A. Bader and G. Cong. A fast, parallel spanning tree algorithm for symmetric multiprocessors (smps). J. Parallel Distrib. Comput., 65(9):994--1006, 2005. Google Scholar
Digital Library
- M. Batty, K. Memarian, S. Owens, S. Sarkar, and P. Sewell. Clarifying and compiling C/C++ concurrency: from C++11 to POWER. In Proc. POPL, 2012. Google Scholar
Digital Library
- M. Batty, S. Owens, S. Sarkar, P. Sewell, and T. Weber. Mathematizing C++concurrency. In POPL, pages 55--66, 2011. Google Scholar
Digital Library
- P. Becker, editor. Programming Languages -- C++.2011. ISO/IEC 14882:2011. A non-final recent version is available at http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2011/n3242.pdf.Google Scholar
- B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422--426, 1970. Google Scholar
Digital Library
- C. Blundell, M. M. K. Martin, and T. F. Wenisch. Invisifence: performance-transparent memory ordering in conventional multiprocessors. In ISCA, 2009. Google Scholar
Digital Library
- Programming Languages -- C. 2011. ISO/IEC 9899:2011. A non-final recent version is available at http://www.open-std.org/jtc1/sc22/wg14/docs/n1539.pdf.Google Scholar
- D. Dice, O. Shalev, and N. Shavit. Transactional locking ii. In DISC, pages 194--208, 2006. Google Scholar
Digital Library
- K. Gharachorloo, S. Adve, A. Gupta, J. Hennessy, and M. Hill. Specifying system requirements for memory consistency models. Computer Systems Laboratory, Stanford University, 1993.Google Scholar
- K. Gharachorloo, A. Gupta, and J. L. Hennessy. Two techniques to enhance the performance of memory consistency models. In ICPP (1), pages 355--364, 1991.Google Scholar
- C. Gniady, B. Falsafi, and T. N. Vijaykumar. Is sc + ilp=rc? In ISCA, pages 162--171, 1999. Google Scholar
Digital Library
- M. Herlihy. Wait-free synchronization. ACM Trans. Program. Lang. Syst., 13:124--149, January 1991. Google Scholar
Digital Library
- Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual. Number 253669-033US. December 2009.Google Scholar
- E. Ladan-Mozes, I.-T. A. Lee, and D. Vyukov. Location-based memory fences. In SPAA, pages 75--84, 2011. Google Scholar
Digital Library
- C. Lin, V. Nagarajan, R. Gupta, and B. Rajaram. Efficient sequential consistency via conflict ordering. In ASPLOS, pages 273--286, 2012. Google Scholar
Digital Library
- I. B. Machine and A. C. I. Staff. PowerPC Microprocessor Common Hardware Reference Platform: A System Architecture. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1995. Google Scholar
Digital Library
- M. Michael and M. Scott. Implementation of atomic primitives on distributed shared memory multiprocessors. In Proc. HPCA, 1995.Google Scholar
Cross Ref
- N. Muralimanohar and R. Balasubramonian. Cacti 6.0: A tool to understand large caches.Google Scholar
- S. Owens, S. Sarkar, and P. Sewell. A better x86 memory model: x86-TSO. In Proc. TPHOLs, 2009. Google Scholar
Digital Library
- A. Singh, S. Narayanasamy, D. Marino, T. D. Millstein, and M. Musuvathi. End-to-end sequential consistency. In ISCA, pages 524--535, 2012. Google Scholar
Digital Library
- D. J. Sorin, M. D. Hill, and D. A. Wood. A Primer on Memory Consistency and Cache Coherence. Morgan and ClayPool Publishers, 2011. Google Scholar
Digital Library
- C. SPARC International, Inc. The SPARC architecture manual (version 8). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1992. Google Scholar
Digital Library
- C. SPARC International, Inc. The SPARC architecture manual (version 9). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1994. Google Scholar
Digital Library
- A. Terekhov. Brief tentative example x86 implementation for C/C++ memory model.textttcpp-threads mailing list, http://www.decadent.org.uk/pipermail/cpp-threads/2008-December/001933.html, Dec. 2008.Google Scholar
- E. Vallejo, R. Beivide, A. Cristal, T. Harris, F. Vallejo, O. Unsal, and M. Valero. Architectural support for fair reader-writer locking. In Proc. MICRO, 2010. Google Scholar
Digital Library
Index Terms
Fast RMWs for TSO: semantics and implementation
Recommendations
Free atomics: hardware atomic operations without fences
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer ArchitectureAtomic Read-Modify-Write (RMW) instructions are primitive synchronization operations implemented in hardware that provide the building blocks for higher-abstraction synchronization mechanisms to programmers. According to publicly available documentation,...
x86-TSO: a rigorous and usable programmer's model for x86 multiprocessors
Exploiting the multiprocessors that have recently become ubiquitous requires high-performance and reliable concurrent systems code, for concurrent data structures, operating system kernels, synchronization libraries, compilers, and so on. However, ...
Fast RMWs for TSO: semantics and implementation
PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and ImplementationRead-Modify-Write (RMW) instructions are widely used as the building blocks of a variety of higher level synchronization constructs, including locks, barriers, and lock-free data structures. Unfortunately, they are expensive in architectures such as x86 ...







Comments