Abstract
In multicores, performance-critical synchronization is increasingly performed in a lock-free manner using atomic instructions such as CAS or LL/SC. However, when many processors synchronize on the same variable, performance can still degrade significantly. Contending writes get serialized, creating a non-scalable condition. Past proposals that build hardware queues of synchronizing processors do not fundamentally solve this problem---at best, they help to efficiently serialize the contending writes.
This paper proposes a novel architecture that breaks the serialization of hardware queues and enables the queued processors to perform lock-free synchronization in parallel. The architecture, called CASPAR, is able to (1) execute the CASes in the queued-up processors in parallel through eager forwarding of expected values, and (2) validate the CASes in parallel and dequeue groups of processors at a time. The result is highly-scalable synchronization. We evaluate CASPAR with simulations of a 64-core chip. Compared to existing proposals with hardware queues, CASPAR improves the throughput of kernels by 32% on average, and reduces the execution time of the sections considered in lock-free versions of applications by 47% on average. This makes these sections 2.5x faster than in the original applications.
- The Go Programming Language. http://golang.org, 2014.Google Scholar
- ]LFapp2NetBSD producer/consumer queue. ftp://ftp.netbsd.org/pub/NetBSD/NetBSD-current/src/sys/kern/subr_pcq.c, 2014.Google Scholar
- 8: High Performance C+ FIX Framework. http://fix8.org, 2014.Google Scholar
- MySQL Concurrent Allocator. https://github.com/twitter/mysql/blob/master/mysys/lf_alloc-pin.c, 2014.Google Scholar
- Y. Afek, G. Korland, and E. Yanovsky. Quasi-Linearizability: Relaxed Consistency for Improved Concurrency. In phProceedings of the 14th International Conference On Principles Of Distributed Systems (OPODIS 2010), volume 6490 of LNCS, pages 395--410. 2010.Google Scholar
- D. Alistarh, J. Kopinsky, J. Li, and N. Shavit. The SprayList: A Scalable Relaxed Priority Queue. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pages 11--20, 2015.Google Scholar
Digital Library
- R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The Tera Computer System. In Proceedings of the 4th International Conference on Supercomputing, ICS '90, pages 1--6, 1990.Google Scholar
Digital Library
- U. Aydonat and T. S. Abdelrahman. Hardware Support for Relaxed Concurrency Control in Transactional Memory. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '10, pages 15--26, 2010.Google Scholar
Digital Library
- S. Bansal and D. S. Modha. CAR: Clock with Adaptive Replacement. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies, FAST '04, pages 187--200, 2004.Google Scholar
- G. Blake, R. G. Dreslinski, and T. Mudge. Proactive Transaction Scheduling for Contention Management. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '09, pages 156--167, 2009.Google Scholar
Digital Library
- C. Blundell, A. Raghavan, and M. M. Martin. RETCON: Transactional Repair Without Replay. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, pages 258--269, 2010.Google Scholar
Digital Library
- S. Boyd-Wickizer, M. F. Kaashoek, R. Morris, and N. Zeldovich. OpLog: a library for scaling update-heavy data structures. Technical Report MIT-CSAIL-TR-2014-019, MIT Computer Science and Artificial Intelligence Laboratory, September 2014.Google Scholar
- J. D. Brouer. Qdisc lockless FIFO. In phNetfilter Workshop, 2014.Google Scholar
- T. E. Carlson, W. Heirman, and L. Eeckhout. Sniper: Exploring the Level of Abstraction for Scalable and Accurate Parallel Multi-core Simulation. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 52:1--52:12, 2011.Google Scholar
Digital Library
- C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson, P. Mittal, R. Stonecipher, N. Verma, and M. Zwilling. Hekaton: SQL Server's Memory-optimized OLTP Engine. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 1243--1254, 2013.Google Scholar
Digital Library
- N. Diegues, P. Romano, and L. Rodrigues. Virtues and Limitations of Commodity Hardware Transactional Memory. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT '14, pages 3--14, 2014.Google Scholar
Digital Library
- A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP. In Proceedings of the 2009 International Conference on Parallel Processing, ICPP '09, pages 124--131, 2009.Google Scholar
Digital Library
- K. Fraser. Practical lock-freedom. PhD thesis, University of Cambridge, Computer Laboratory, University of Cambridge, Computer Laboratory, February 2004.Google Scholar
- L. Gidra, G. Thomas, J. Sopena, and M. Shapiro. A Study of the Scalability of Stop-the-world Garbage Collectors on Multicores. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 229--240, 2013.Google Scholar
Digital Library
- J. R. Goodman, M. K. Vernon, and P. J. Woest. Efficient Synchronization Primitives for Large-scale Cache-coherent Multiprocessors. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '89, pages 64--75, 1989.Google Scholar
Digital Library
- A. Gottlieb, R. Grishman, C. Kruskal, K. McAuliffe, L. Rudolph, and M. Snir. The NYU Ultracomputer -- Designing an MIMD, Shared Memory Parallel Machine. In IEEE Transactions on Computers, February 1983.Google Scholar
- A. Haas, M. Lippautz, T. A. Henzinger, H. Payer, A. Sokolova, C. M. Kirsch, and A. Sezgin. Distributed queues in shared memory: multicore performance and scalability through quantitative relaxation. In Proceedings of the ACM International Conference on Computing Frontiers, CF '13, pages 17:1--17:9, 2013.Google Scholar
Digital Library
- D. Hendler, N. Shavit, and L. Yerushalmi. A scalable lock-free stack algorithm. In Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2004, pages 206--215, 2004.Google Scholar
Digital Library
- M. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems (TOPLAS), 13: 124--149, January 1991.Google Scholar
- M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008.Google Scholar
Digital Library
- M. Hoffman, O. Shalev, and N. Shavit. The baskets queue. In Proceedings of the 11th International Conference on Principles of Distributed Systems, OPODIS'07, pages 401--414, 2007.Google Scholar
Digital Library
- Intel. Intel Xeon Phi Coprocessor. https://software.intel.com/en-us/mic-developer, 2013.Google Scholar
- S. A. R. Jafri, G. Voskuilen, and T. N. Vijaykumar. Wait-n-GoTM: Improving HTM Performance by Serializing Cyclic Dependencies. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 521--534, 2013.Google Scholar
Digital Library
- E. H. Jensen, G. W. Hagensen, and J. M. Broughton. A New Approach to Exclusive Data Access in Shared Memory Multiprocessors. Technical Report UCRL-97663, Lawrence Livermore National Laboratory, 1987.Google Scholar
- R. Jones, A. Hosking, and E. Moss. The Garbage Collection Handbook: The Art of Automatic Memory Management. Chapman & Hall/CRC, 1st edition, 2011.Google Scholar
- H. Jordan. Performance Measurements on HEP A Pipelined MIMD Computer. In International Symposium on Computer Architecture (ISCA), June 1983.Google Scholar
- A. Kagi, D. Burger, and J. R. Goodman. Efficient Synchronization: Let Them Eat QOLB. In Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pages 170--180, 1997.Google Scholar
Digital Library
- C. Kirsch, H. Payer, H. Röck, and A. Sokolova. Performance, Scalability, and Semantics of Concurrent FIFO Queues. In Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP '12), volume 7439 of phLNCS, pages 273--287. 2012.Google Scholar
Digital Library
- C. M. Kirsch, M. Lippautz, and H. Payer. Fast and Scalable, Lock-Free k-FIFO Queues. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pages 241--252, 2013.Google Scholar
Digital Library
- A. Kivity, D. Laor, G. Costa, P. Enberg, N. Har'El, D. Marti, and V. Zolotarov. OSv-Optimizing the Operating System for Virtual Machines. In Proceedings of the 2014 USENIX Annual Technical Conference, ATC '14, pages 61--72, June 2014.Google Scholar
- A. Kogan and E. Petrank. Wait-free queues with multiple enqueuers and dequeuers. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 223--234, 2011.Google Scholar
Digital Library
- D. A. Koufaty, X. Chen, D. K. Poulsen, and J. Torrellas. Data Forwarding in Scalable Shared-memory Multiprocessors. In Proceedings of the 9th International Conference on Supercomputing, ICS '95, pages 255--264, 1995.Google Scholar
Digital Library
- D. Kuck, E. Davidson, D. Lawrie, A. Sameh, C. Q. Zhu, A. Veidenbaum, J. Konicek, P. Yew, K. Gallivan, W. Jalby, H. Wijshoff, R. Bramley, U. M. Yang, P. Emrath, D. Padua, R. Eigenmann, J. Hoeflinger, G. Jaxon, Z. Li, T. Murphy, and J. Andrews. The Cedar System and an Initial Performance Study. In Proceedings of the 20th Annual International Symposium on Computer Architecture, ISCA '93, pages 213--223, 1993.Google Scholar
Digital Library
- M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '07, pages 211--222, 2007.Google Scholar
Digital Library
- E. Ladan-Mozes and N. Shavit. An optimistic approach to lock-free FIFO queues. In Proceedings of the 18th International Symposium on Distributed Computing (DISC 2004), volume 3274 of phLNCS, pages 117--131, 2004.Google Scholar
Cross Ref
- P.-A. Larson and M. Krishnan. Memory Allocation for Long-running Server Applications. In Proceedings of the 1st International Symposium on Memory Management, ISMM '98, pages 176--185, 1998.Google Scholar
Digital Library
- J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pages 241--251, 1997.Google Scholar
Digital Library
- D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford DASH Multiprocessor. In IEEE Computer, March 1992.Google Scholar
Digital Library
- I. Lotan and N. Shavit. Skiplist-Based Concurrent Priority Queues. In Proceedings of the 14th International Parallel and Distributed Processing Symposium, IPDPS '00, pages 263--268, 2000.Google Scholar
- N. D. Matsakis and F. S. Klock, II. The Rust Language. In Proceedings of the 2014 ACM SIGAda Annual Conference on High Integrity Language Technology, HILT '14, pages 103--104, 2014.Google Scholar
Digital Library
- M. M. Michael. Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects. IEEE Transactions on Parallel and Distributed System, 15 (6): 491--504, June 2004.Google Scholar
Digital Library
- M. M. Michael. Scalable lock-free dynamic memory allocation. In Proceedings of the 2004 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '04, pages 35--46, 2004.Google Scholar
Digital Library
- M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing, PODC '96, pages 267--275, 1996.Google Scholar
Digital Library
- M. M. Michael and M. L. Scott. Nonblocking algorithms and preemption-safe locking on multiprogrammed shared memory multiprocessors. Journal of Parallel and Distributed Computing, 51: 1--26, May 1998.Google Scholar
Digital Library
- D. Nguyen, A. Lenharth, and K. Pingali. A Lightweight Infrastructure for Graph Analytics. In Proceedings of the 24th ACM Symposium on Operating Systems Principles, SOSP '13, pages 456--471, 2013.Google Scholar
Digital Library
- S. L. Olivier, A. K. Porterfield, K. B. Wheeler, M. Spiegel, and J. F. Prins. OpenMP Task Scheduling Strategies for Multicore NUMA Systems. International Journal of High Performance Computing Applications, 26 (2): 110--124, May 2012.Google Scholar
Digital Library
- G. Pfister, W. Brantley, D. George, S. Harvey, W. Kleinfelder, K. McAuliffe, E. Melton, V. Norton, and J. Weiss. The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture. In Proceedings of the 1985 International Conference on Parallel Processing, ICPP '85, pages 764--771, 1985.Google Scholar
- X. Qian, B. Sahelices, and J. Torrellas. BulkSMT: Designing SMT Processors for Atomic-block Execution. In Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture, HPCA '12, pages 1--12, 2012.Google Scholar
Digital Library
- X. Qian, B. Sahelices, and J. Torrellas. OmniOrder: Directory-based Conflict Serialization of Transactions. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA '14, pages 421--432, 2014.Google Scholar
Digital Library
- R. Rajwar, A. Kagi, and J. R. Goodman. Improving the throughput of synchronization by insertion of delays. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture, HPCA '00, pages 168--179, January 2000.Google Scholar
- R. Rajwar, A. Kagi, and J. R. Goodman. Inferential Queueing and Speculative Push for Reducing Critical Communication Latencies. In Proceedings of the 17th Annual International Conference on Supercomputing, ICS '03, pages 273--284, 2003.Google Scholar
Digital Library
- H. E. Ramadan, C. J. Rossbach, and E. Witchel. Dependence-aware Transactional Memory for Increased Concurrency. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '08, pages 246--257, 2008.Google Scholar
Cross Ref
- A. Ros and S. Kaxiras. Complexity-effective Multicore Coherence. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 241--252, 2012.Google Scholar
Digital Library
- S. Schneider, C. D. Antonopoulos, and D. S. Nikolopoulos. Scalable Locality-conscious Multithreaded Memory Allocation. In Proceedings of the 5th International Symposium on Memory Management, ISMM '06, pages 84--94, 2006.Google Scholar
Digital Library
- S. L. Scott. Synchronization and Communication in the T3E Multiprocessor. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '96, pages 26--36, 1996.Google Scholar
- S. Seo, J. Kim, and J. Lee. SFMalloc: A Lock-Free and Mostly Synchronization-Free Dynamic Memory Allocator for Manycores. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques, PACT '11, pages 253--263, 2011.Google Scholar
Digital Library
- N. Shavit. Data Structures in the Multicore Age. Communications of the ACM, 54 (3): 76--84, Mar. 2011.Google Scholar
Digital Library
- M. Stonebraker, A. Pavlo, R. Taft, and M. L. Brodie. Enterprise Database Applications and the Cloud: A Difficult Road Ahead. In Proceedings of the 2014 IEEE International Conference on Cloud Engineering, IC2E '14, pages 1--6, 2014.Google Scholar
Digital Library
- M. E. Thomadakis. The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms. Resource, 3: 2, 2011.Google Scholar
- P. Trancoso and J. Torrellas. The Impact of Speeding Up Critical Sections with Data Prefetching and Forwarding. In Proceedings of the 1996 International Conference on Parallel Processing, ICPP '96, pages 79--86.Google Scholar
- R. K. Treiber. Systems Programming: Coping with Parallelism. Technical Report RJ5118, IBM Almaden Research Center, 2006.Google Scholar
- L. G. Valiant. A Bridging Model for Parallel Computation. Communications of the ACM, 33 (8): 103--111, Aug. 1990.Google Scholar
Digital Library
- E. Vallejo, R. Beivide, A. Cristal, T. Harris, F. Vallejo, O. Unsal, and M. Valero. Architectural Support for Fair Reader-Writer Locking. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '10, pages 275--286, 2010.Google Scholar
Digital Library
- K. B. Wheeler, R. C. Murphy, and D. Thain. Qthreads: An API for programming with millions of lightweight threads. In Proceedings of the 22nd International Parallel and Distributed Processing Symposium, IPDPS '08, pages 1--8, 2008.Google Scholar
Cross Ref
Index Terms
CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization
Recommendations
CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization
ASPLOS'16In multicores, performance-critical synchronization is increasingly performed in a lock-free manner using atomic instructions such as CAS or LL/SC. However, when many processors synchronize on the same variable, performance can still degrade ...
CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating SystemsIn multicores, performance-critical synchronization is increasingly performed in a lock-free manner using atomic instructions such as CAS or LL/SC. However, when many processors synchronize on the same variable, performance can still degrade ...
Lock-free and scalable multi-version software transactional memory
PPoPP '11Software Transactional Memory (STM) was initially proposed as a lock-free mechanism for concurrency control. Early implementations had efficiency limitations, and soon obstruction-free proposals appeared, to tackle this problem, often simplifying STM ...







Comments