Abstract
Over the last decade, the growing use of cache-coherent NUMA architectures has spurred the development of numerous locality-preserving mutual exclusion algorithms. NUMA-aware locks such as HCLH, HMCS, and cohort locks exploit locality of reference among nearby threads to deliver high lock throughput under high contention. However, the hierarchical nature of these locality-aware locks increases latency, which reduces the throughput of uncontended or lightly-contended critical sections. To date, no lock design for NUMA systems has delivered both low latency under low contention and high throughput under high contention.
In this paper, we describe the design and evaluation of an adaptive mutual exclusion scheme (AHMCS lock), which employs several orthogonal strategies---a hierarchical MCS (HMCS) lock for high throughput under high contention, Lamport's fast path approach for low latency under low contention, an adaptation mechanism that employs hysteresis to balance latency and throughput under moderate contention, and hardware transactional memory for lowest latency in the absence of contention. The result is a top performing lock that has most properties of an ideal mutual exclusion algorithm. AHMCS exploits the strengths of multiple contention management techniques to deliver high performance over a broad range of contention levels. Our empirical evaluations demonstrate the effectiveness of AHMCS over prior art.
- L. Adhianto et al. HPCToolkit: Tools for Performance Analysis of Optimized Parallel Programs. Concurrency and Computation: Practice and Experience, 22(6):685--701, Apr. 2010. Google Scholar
Digital Library
- S. Boyd-Wickizer, M. F. Kaashoek, R. Morris, and N. Zeldovich. Non-scalable Locks are Dangerous. In Proc. Linux Symposium, 2012.Google Scholar
- M. Chabbi. Software Support For Efficient Use of Modern Computer Architectures. PhD thesis, The Department of Computer Science, Rice University, Houston, Texas, USA, 8 2015.Google Scholar
- M. Chabbi et al. High Performance Locks for Multi-level NUMA Systems. In Proc. of the 20th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 215--226, 2015. Google Scholar
Digital Library
- D. Dice et al. Applications of the adaptive transactional memory test platform. In 3rd ACM SIGPLAN Workshop on Transactional Computing, pages 1--10, 2008.Google Scholar
- D. Dice et al. Flat-combining NUMA Locks. In Proc. of the 23rd Annual ACM Symp. on Parallelism in Algorithms and Architectures, SPAA '11, pages 65--74, 2011. Google Scholar
Digital Library
- D. Dice et al. Lock Cohorting: A General Technique for Designing NUMA Locks. In Proc. of the 17th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 247--256, 2012. Google Scholar
Digital Library
- D. Dice et al. Lightweight Contention Management for Efficient Compare-and-swap Operations. In Proc. of the 19th Intl. Conf. on Parallel Processing, Euro-Par'13, pages 595--606, 2013. Google Scholar
Digital Library
- F. Ellen et al. SNZI: Scalable NonZero Indicators. In Proc. of the 26th Annual ACM Symposium on Principles of Distributed Computing, PODC '07, pages 13--22, 2007. Google Scholar
Digital Library
- P. Fatourou and N. D. Kallimanis. Revisiting the Combining Synchronization Technique. In Proc. of the 17th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 257--266, 2012. Google Scholar
Digital Library
- J. R. Goodman et al. Efficient Synchronization Primitives for Large-scale Cache-coherent Multiprocessors. In Proc. of the Third Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS III, pages 64--75, 1989. Google Scholar
Digital Library
- R. Guerraoui et al. Toward a Theory of Transactional Contention Managers. In Proc. of the 24th Annual ACM Symp. on Principles of Distributed Computing, PODC '05, pages 258--264, 2005. Google Scholar
Digital Library
- M. Herlihy et al. Software Transactional Memory for Dynamic-sized Data Structures. In Proc. of the 22nd Annual Symp. on Principles of Distributed Computing, PODC '03, pages 92--101, 2003. Google Scholar
Digital Library
- Hewlett Packard Enterprise. HP Integrity Superdome X. http://www8.hp.com/h20195/v2/GetPDF.aspx/c04383189.pdf.Google Scholar
- Intel Corp. An Introduction to the Intel® Quick-Path Interconnect. http://www.intel.com/content/www/us/en/io/quickpath-technology/quick-path-interconnect-introduction-paper.html, 2009.Google Scholar
- R. Johnson et al. Improving OLTP Scalability Using Speculative Lock Inheritance. Proc. VLDB Endow., 2(1):479--489, Aug. 2009. Google Scholar
Digital Library
- C. Kim et al. Nonuniform cache architectures for wire-delay dominated on-chip caches. Micro, IEEE, 23(6):99--107, Nov 2003. Google Scholar
Digital Library
- A. Kleen. Lock elision in the GNU C library. https://lwn.net/Articles/534758/, 2013.Google Scholar
- L. Lamport. A Fast Mutual Exclusion Algorithm. ACM Transactions on Computer Systems (TOCS), 5(1):1--11, Jan. 1987. Google Scholar
Digital Library
- P.-A. Larson et al. High-performance Concurrency Control Mechanisms for Main-memory Databases. Proc. VLDB Endow., 5(4):298--309, Dec. 2011. Google Scholar
Digital Library
- B.-H. Lim and A. Agarwal. Reactive Synchronization Algorithms for Multiprocessors. In Proc. of the 6th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS VI, pages 25--35, 1994. Google Scholar
Digital Library
- J.-P. Lozi. Towards More Scalable Mutual Exclusion for Multicore Architectures. PhD thesis, Universite Pierre et Marie Curie, Paris, France, 2014. http://rclrepository.gforge.inria.fr/rcl_benchmarks_23-11-15.tar.gz.Google Scholar
- J.-P. Lozi et al. Remote Core Locking: Migrating Critical-section Execution to Improve the Performance of Multithreaded Applications. In Proc. of the 2012 USENIX Annual Technical Conf., USENIX ATC'12, pages 6--6, 2012. Google Scholar
Digital Library
- V. Luchangco et al. A Hierarchical CLH Queue Lock. In Proc. of the 12th Intl. Conf. on Parallel Processing, Euro-Par'06, pages 801--810, 2006. Google Scholar
Digital Library
- P. S. Magnusson et al. Queue Locks on Cache Coherent Multiprocessors. In Proc. of the 8th Intl. Symp. on Parallel Processing, pages 165--171, 1994. Google Scholar
Digital Library
- J. M. Mellor-Crummey and M. L. Scott. Algorithms for Scalable Synchronization on Shared-memory Multiprocessors. ACM Transactions on Computer Systems, 9(1):21--65, February 1991. Google Scholar
Digital Library
- T. Nakaike et al. Quantitative Comparison of Hardware Transactional Memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8. In Proc. of the 42nd Annual Intl. Symp. on Computer Architecture, ISCA '15, pages 144--157, 2015. Google Scholar
Digital Library
- R. Narayanan et al. MineBench: A Benchmark Suite for Data Mining Workloads. In Proc. of the IEEE Intl. Symp. on Workload Characterization, pages 182--188, 2006.Google Scholar
- I. Pandis et al. Data-oriented Transaction Execution. Proc. VLDB Endow., 3(1-2):928--939, Sept. 2010. Google Scholar
Digital Library
- K. K. Pusukuri et al. Shuffling: A Framework for Lock Contention Aware Thread Scheduling for Multicore Multiprocessor Systems. In Proc. of the 23rd Intl. Conf. on Parallel Architectures and Compilation, PACT '14, pages 289--300, 2014. Google Scholar
Digital Library
- Z. Radovic and E. Hagersten. Hierarchical Backoff Locks for Nonuniform Communication Architectures. In Proc. of the 9th Intl. Symp. on High-Performance Computer Architecture, HPCA '03, 2003. Google Scholar
Digital Library
- R. Rajwar and J. R. Goodman. Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution. In Proc. of the 34th Annual ACM/IEEE Intl. Symp. on Microarchitecture, MICRO 34, pages 294--305, 2001. Google Scholar
Digital Library
- W. N. Scherer, III and M. L. Scott. Advanced Contention Management for Dynamic Software Transactional Memory. In Proc. of the 24th Annual ACM Symp. on Principles of Distributed Computing, PODC '05, pages 240--248, 2005. Google Scholar
Digital Library
- SGI. SGI Altix UV 1000 System User's Guide. http://techpubs.sgi.com/library/manuals/5000/007-5663-003/pdf/007-5663-003.pdf.Google Scholar
- N. Shavit and A. Zemach. Combining Funnels. J. Parallel Distrib. Comput., 60(11):1355--1387, Nov. 2000. Google Scholar
Digital Library
- D. D. Sleator and R. E. Tarjan. Self-adjusting Binary Search Trees. Journal of the ACM, 32(3):652--686, July 1985. Google Scholar
Digital Library
- M. F. Spear et al. A Comprehensive Strategy for Contention Management in Software Transactional Memory. In Proc. of the 14th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 141--150, 2009. Google Scholar
Digital Library
- T. Usui et al. Adaptive Locks: Combining Transactions and Locks for Efficient Concurrency. In Proc. of the 2009 18th Intl. Conf. on Parallel Architectures and Compilation Techniques, PACT '09, pages 3--14, Washington, DC, USA, 2009. IEEE Computer Society. Google Scholar
Digital Library
- J.-H. Yang and J. Anderson. A fast, scalable mutual exclusion algorithm. Distributed Computing, 9(1):51--60, 1995.Google Scholar
Digital Library
- X. Yu et al. On Adaptive Contention Management Strategies for Software Transactional Memory. In 2012 IEEE 10th Intl. Symp. on Parallel and Distributed Processing with Applications (ISPA), pages 24--31, July 2012. Google Scholar
Digital Library
Index Terms
Contention-conscious, locality-preserving locks
Recommendations
High performance locks for multi-level NUMA systems
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingEfficient locking mechanisms are critically important for high performance computers. On highly-threaded systems with a deep memory hierarchy, the throughput of traditional queueing locks, e.g., MCS locks, falls off due to NUMA effects. Two-level ...
Contention-conscious, locality-preserving locks
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingOver the last decade, the growing use of cache-coherent NUMA architectures has spurred the development of numerous locality-preserving mutual exclusion algorithms. NUMA-aware locks such as HCLH, HMCS, and cohort locks exploit locality of reference among ...
Lock Cohorting: A General Technique for Designing NUMA Locks
Special Issue on PPOPP 2012Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machine's nonuniform memory and caching hierarchy, ever more important. This article presents lock ...






Comments