Abstract
There have been several recent efforts to improve the performance of fences. The most aggressive designs allow post-fence accesses to retire and complete before the fence completes. Unfortunately, such designs present implementation difficulties due to their reliance on global state and structures.
This paper's goal is to optimize both the performance and the implementability of fences. We start-off with a design like the most aggressive ones but without the global state. We call it Weak Fence or wF. Since the concurrent execution of multiple wFs can deadlock, we combine wFs with a conventional fence (i.e., Strong Fence or sF) for the less performance-critical thread(s). We call the result an Asymmetric fence group. We also propose a taxonomy of Asymmetric fence groups under TSO. Compared to past aggressive fences, Asymmetric fence groups both are substantially easier to implement and have higher average performance. The two main designs presented (WS+ and W+) speed-up workloads under TSO by an average of 13% and 21%, respectively, over conventional fences.
- Rochester Software Transactional Memory. http://www.cs.rochester.edu/research/synchronization/rstm/.Google Scholar
- ARM. ARMv8-A Reference Manual, Issue A.d. http://infocenter.arm.com.Google Scholar
- Colin Blundell, Milo M. K. Martin, and Thomas F. Wenisch. InvisiFence: Performance-Transparent Memory Ordering in Conventional Multiprocessors. In International Symposium on Computer Architecture, June 2009. Google Scholar
Digital Library
- Luis Ceze, James Tuck, Pablo Montesinos, and Josep Torrellas. BulkSC: Bulk Enforcement of Sequential Consistency. In International Symposium on Computer Architecture, June 2007. Google Scholar
Digital Library
- Dave Dice, Mark Moir, and William Scherer. Quickly Reacquirable Locks. Technical Report, Sun Microsystems Inc., 2003.Google Scholar
- Dave Dice and Nir Shavit. TLRW: Return of the Read-write Lock. In Symposium on Parallelism in Algorithms and Architectures, June 2010. Google Scholar
Digital Library
- Yuelu Duan, Xiaobing Feng, Lei Wang, Chao Zhang, and Pen-Chung Yew. Detecting and Eliminating Potential Violations of Sequential Consistency for Concurrent C/C++ Programs. In International Symposium on Code Generation and Optimization, March 2009. Google Scholar
Digital Library
- Yuelu Duan, Abdullah Muzahid, and Josep Torrellas. WeeFence: Toward Making Fences Free in TSO. In International Symposium on Computer Architecture, June 2013. Google Scholar
Digital Library
- Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The Implementation of the Cilk-5 Multithreaded Language. In Conference on Programming Language Design and Implementation, June 1998. Google Scholar
Digital Library
- Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. Memory Consistency and Event Ordering in Scalable Shared-memory Multi-processors. In International Symposium on Computer Architecture, June 1990. Google Scholar
Digital Library
- Chris Gniady, Babak Falsafi, and T. N. Vijaykumar. Is SC + ILP = RC? In International Symposium on Computer Architecture, June 1999. Google Scholar
Digital Library
- Intel. Intel Itanium Architecture Software Developer's Manual, Revision 2.3. http://www.intel.com/design/itanium/manuals/iiasdmanual.htm, May 2010.Google Scholar
- Intel Corp. IA-32 Intel Architecture Software Developer Manual, Volume 2: Instruction Set Reference. 2002.Google Scholar
- Kiyokuni Kawachiya, Akira Koseki, and Tamiya Onodera. Lock Reservation: Java Locks Can Mostly Do without Atomic Operations. In Conference on Object-Oriented Programming, Systems, Language, and Applications, November 2002. Google Scholar
Digital Library
- Edya Ladan-Mozes, I-Ting Angelina Lee, and Dmitry Vyukov. Location-Based Memory Fences. In Symposium on Parallelism in Algorithms and Architectures, June 2011. Google Scholar
Digital Library
- L. Lamport. A New Solution of Dijkstra's Concurrent Programming Problem. Communications of the ACM, August 1974. Google Scholar
Digital Library
- L. Lamport. How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs. IEEE Transactions on Computers, July 1979. Google Scholar
Digital Library
- Jaejin Lee and D.A. Padua. Hiding Relaxed Memory Consistency with Compilers. In International Conference on Parallel Architectures and Compilation Techniques, October 2000. Google Scholar
Digital Library
- C. Lin, V. Nagarajan, and R. Gupta. Address-aware Fences. In International Conference on Supercomputing, June 2013. Google Scholar
Digital Library
- Changhui Lin, Vijay Nagarajan, and Rajiv Gupta. Efficient Sequential Consistency using Conditional Fences. In International Conference on Parallel Architectures and Compilation Techniques, September 2010. Google Scholar
Digital Library
- Changhui Lin, Vijay Nagarajan, Rajiv Gupta, and Bharghava Rajaram. Efficient Sequential Consistency via Conflict Ordering. In International Conference on Architectural Support for Programming Languages and Operating Systems, March 2012. Google Scholar
Digital Library
- Daniel Marino, Abhayendra Singh, Todd Millstein, Madanlal Musuvathi, and Satish Narayanasamy. A Case for an SC-preserving Compiler. In Conference on Programming Language Design and Implementation, June 2011. Google Scholar
Digital Library
- Chi Cao Minh, Jaewoong Chung, Christos Kozyrakis, and Kunle Olukotun. STAMP: Stanford Transactional Applications for Multi-Processing. In International Symposium on Work- load Characterization, September 2008.Google Scholar
- Abdullah Muzahid, Shanxiang Qi, and Josep Torrellas. Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically. In International Symposium on Microarchitecture, December 2012. Google Scholar
Digital Library
- Xuehai Qian, Benjamin Sahelices, Josep Torrellas, and Depei Qian. Volition: Scalable and Precise Sequential Consistency Violation Detection. In International Conference on Architectural Support for Programming Languages and Operating Systems, March 2013. Google Scholar
Digital Library
- Parthasarathy Ranganathan, Vijay S. Pai, and Sarita V. Adve. Using Speculative Retirement and Larger Instruction Windows to Narrow the Performance Gap Between Memory Consistency Models. In Symposium on Parallelism in Algorithms and Architectures, June 1997. Google Scholar
Digital Library
- James Reinders. Intel Threading Building Blocks. O'Reilly & Associates, Inc., 2007. Google Scholar
Digital Library
- Douglas C. Schmidt and Tim Harrison. Double-Checked Locking: An Optimization Pattern for Efficiently Initializing and Accessing Thread-Safe Objects. In Conference on Pattern Languages of Programming, 1996. Google Scholar
Digital Library
- D. Shasha and M. Snir. Efficient and Correct Execution of Parallel Programs that Share Memory. ACM Transactions on Programming Languages and Systems, April 1988. Google Scholar
Digital Library
- Nir Shavit and Dan Touitou. Software Transactional Memory. In Symposium on Principles of Distributed Computing, August 1995. Google Scholar
Digital Library
- Abhayendra Singh, Satish Narayanasamy, Daniel Marino, Todd D. Millstein, and Madanlal Musuvathi. End-to-End Sequential Consistency. In International Symposium on Computer Architecture, June 2012. Google Scholar
Digital Library
- SPARC International, Inc. The SPARC Architecture Manual (Version 9). 1994. Google Scholar
Digital Library
- Zehra Sura, Xing Fang, Chi-Leung Wong, Samuel P. Midkiff, Jaejin Lee, and David Padua. Compiler Techniques for High Performance Sequentially Consistent Java Programs. In Symposium on Principles and Practice of Parallel Programming, June 2005. Google Scholar
Digital Library
Index Terms
Asymmetric Memory Fences: Optimizing Both Performance and Implementability
Recommendations
Asymmetric Memory Fences: Optimizing Both Performance and Implementability
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating SystemsThere have been several recent efforts to improve the performance of fences. The most aggressive designs allow post-fence accesses to retire and complete before the fence completes. Unfortunately, such designs present implementation difficulties due to ...
Asymmetric Memory Fences: Optimizing Both Performance and Implementability
ASPLOS'15There have been several recent efforts to improve the performance of fences. The most aggressive designs allow post-fence accesses to retire and complete before the fence completes. Unfortunately, such designs present implementation difficulties due to ...
WeeFence: toward making fences free in TSO
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer ArchitectureAlthough fences are designed for low-overhead concurrency coordination, they can be expensive in current machines. If fences were largely free, faster fine-grained concurrent algorithms could be devised, and compilers could guarantee Sequential ...







Comments