skip to main content
research-article
Public Access

CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization

Authors Info & Claims
Published:25 March 2016Publication History
Skip Abstract Section

Abstract

In multicores, performance-critical synchronization is increasingly performed in a lock-free manner using atomic instructions such as CAS or LL/SC. However, when many processors synchronize on the same variable, performance can still degrade significantly. Contending writes get serialized, creating a non-scalable condition. Past proposals that build hardware queues of synchronizing processors do not fundamentally solve this problem---at best, they help to efficiently serialize the contending writes.

This paper proposes a novel architecture that breaks the serialization of hardware queues and enables the queued processors to perform lock-free synchronization in parallel. The architecture, called CASPAR, is able to (1) execute the CASes in the queued-up processors in parallel through eager forwarding of expected values, and (2) validate the CASes in parallel and dequeue groups of processors at a time. The result is highly-scalable synchronization. We evaluate CASPAR with simulations of a 64-core chip. Compared to existing proposals with hardware queues, CASPAR improves the throughput of kernels by 32% on average, and reduces the execution time of the sections considered in lock-free versions of applications by 47% on average. This makes these sections 2.5x faster than in the original applications.

References

  1. The Go Programming Language. http://golang.org, 2014.Google ScholarGoogle Scholar
  2. ]LFapp2NetBSD producer/consumer queue. ftp://ftp.netbsd.org/pub/NetBSD/NetBSD-current/src/sys/kern/subr_pcq.c, 2014.Google ScholarGoogle Scholar
  3. 8: High Performance C+ FIX Framework. http://fix8.org, 2014.Google ScholarGoogle Scholar
  4. MySQL Concurrent Allocator. https://github.com/twitter/mysql/blob/master/mysys/lf_alloc-pin.c, 2014.Google ScholarGoogle Scholar
  5. Y. Afek, G. Korland, and E. Yanovsky. Quasi-Linearizability: Relaxed Consistency for Improved Concurrency. In phProceedings of the 14th International Conference On Principles Of Distributed Systems (OPODIS 2010), volume 6490 of LNCS, pages 395--410. 2010.Google ScholarGoogle Scholar
  6. D. Alistarh, J. Kopinsky, J. Li, and N. Shavit. The SprayList: A Scalable Relaxed Priority Queue. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pages 11--20, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The Tera Computer System. In Proceedings of the 4th International Conference on Supercomputing, ICS '90, pages 1--6, 1990.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. U. Aydonat and T. S. Abdelrahman. Hardware Support for Relaxed Concurrency Control in Transactional Memory. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '10, pages 15--26, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Bansal and D. S. Modha. CAR: Clock with Adaptive Replacement. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies, FAST '04, pages 187--200, 2004.Google ScholarGoogle Scholar
  10. G. Blake, R. G. Dreslinski, and T. Mudge. Proactive Transaction Scheduling for Contention Management. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '09, pages 156--167, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Blundell, A. Raghavan, and M. M. Martin. RETCON: Transactional Repair Without Replay. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, pages 258--269, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Boyd-Wickizer, M. F. Kaashoek, R. Morris, and N. Zeldovich. OpLog: a library for scaling update-heavy data structures. Technical Report MIT-CSAIL-TR-2014-019, MIT Computer Science and Artificial Intelligence Laboratory, September 2014.Google ScholarGoogle Scholar
  13. J. D. Brouer. Qdisc lockless FIFO. In phNetfilter Workshop, 2014.Google ScholarGoogle Scholar
  14. T. E. Carlson, W. Heirman, and L. Eeckhout. Sniper: Exploring the Level of Abstraction for Scalable and Accurate Parallel Multi-core Simulation. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 52:1--52:12, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson, P. Mittal, R. Stonecipher, N. Verma, and M. Zwilling. Hekaton: SQL Server's Memory-optimized OLTP Engine. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 1243--1254, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. Diegues, P. Romano, and L. Rodrigues. Virtues and Limitations of Commodity Hardware Transactional Memory. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT '14, pages 3--14, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP. In Proceedings of the 2009 International Conference on Parallel Processing, ICPP '09, pages 124--131, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. Fraser. Practical lock-freedom. PhD thesis, University of Cambridge, Computer Laboratory, University of Cambridge, Computer Laboratory, February 2004.Google ScholarGoogle Scholar
  19. L. Gidra, G. Thomas, J. Sopena, and M. Shapiro. A Study of the Scalability of Stop-the-world Garbage Collectors on Multicores. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 229--240, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. R. Goodman, M. K. Vernon, and P. J. Woest. Efficient Synchronization Primitives for Large-scale Cache-coherent Multiprocessors. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '89, pages 64--75, 1989.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Gottlieb, R. Grishman, C. Kruskal, K. McAuliffe, L. Rudolph, and M. Snir. The NYU Ultracomputer -- Designing an MIMD, Shared Memory Parallel Machine. In IEEE Transactions on Computers, February 1983.Google ScholarGoogle Scholar
  22. A. Haas, M. Lippautz, T. A. Henzinger, H. Payer, A. Sokolova, C. M. Kirsch, and A. Sezgin. Distributed queues in shared memory: multicore performance and scalability through quantitative relaxation. In Proceedings of the ACM International Conference on Computing Frontiers, CF '13, pages 17:1--17:9, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Hendler, N. Shavit, and L. Yerushalmi. A scalable lock-free stack algorithm. In Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2004, pages 206--215, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems (TOPLAS), 13: 124--149, January 1991.Google ScholarGoogle Scholar
  25. M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Hoffman, O. Shalev, and N. Shavit. The baskets queue. In Proceedings of the 11th International Conference on Principles of Distributed Systems, OPODIS'07, pages 401--414, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Intel. Intel Xeon Phi Coprocessor. https://software.intel.com/en-us/mic-developer, 2013.Google ScholarGoogle Scholar
  28. S. A. R. Jafri, G. Voskuilen, and T. N. Vijaykumar. Wait-n-GoTM: Improving HTM Performance by Serializing Cyclic Dependencies. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 521--534, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. E. H. Jensen, G. W. Hagensen, and J. M. Broughton. A New Approach to Exclusive Data Access in Shared Memory Multiprocessors. Technical Report UCRL-97663, Lawrence Livermore National Laboratory, 1987.Google ScholarGoogle Scholar
  30. R. Jones, A. Hosking, and E. Moss. The Garbage Collection Handbook: The Art of Automatic Memory Management. Chapman & Hall/CRC, 1st edition, 2011.Google ScholarGoogle Scholar
  31. H. Jordan. Performance Measurements on HEP A Pipelined MIMD Computer. In International Symposium on Computer Architecture (ISCA), June 1983.Google ScholarGoogle Scholar
  32. A. Kagi, D. Burger, and J. R. Goodman. Efficient Synchronization: Let Them Eat QOLB. In Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pages 170--180, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. C. Kirsch, H. Payer, H. Röck, and A. Sokolova. Performance, Scalability, and Semantics of Concurrent FIFO Queues. In Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP '12), volume 7439 of phLNCS, pages 273--287. 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. M. Kirsch, M. Lippautz, and H. Payer. Fast and Scalable, Lock-Free k-FIFO Queues. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pages 241--252, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. Kivity, D. Laor, G. Costa, P. Enberg, N. Har'El, D. Marti, and V. Zolotarov. OSv-Optimizing the Operating System for Virtual Machines. In Proceedings of the 2014 USENIX Annual Technical Conference, ATC '14, pages 61--72, June 2014.Google ScholarGoogle Scholar
  36. A. Kogan and E. Petrank. Wait-free queues with multiple enqueuers and dequeuers. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 223--234, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. A. Koufaty, X. Chen, D. K. Poulsen, and J. Torrellas. Data Forwarding in Scalable Shared-memory Multiprocessors. In Proceedings of the 9th International Conference on Supercomputing, ICS '95, pages 255--264, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D. Kuck, E. Davidson, D. Lawrie, A. Sameh, C. Q. Zhu, A. Veidenbaum, J. Konicek, P. Yew, K. Gallivan, W. Jalby, H. Wijshoff, R. Bramley, U. M. Yang, P. Emrath, D. Padua, R. Eigenmann, J. Hoeflinger, G. Jaxon, Z. Li, T. Murphy, and J. Andrews. The Cedar System and an Initial Performance Study. In Proceedings of the 20th Annual International Symposium on Computer Architecture, ISCA '93, pages 213--223, 1993.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '07, pages 211--222, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. E. Ladan-Mozes and N. Shavit. An optimistic approach to lock-free FIFO queues. In Proceedings of the 18th International Symposium on Distributed Computing (DISC 2004), volume 3274 of phLNCS, pages 117--131, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  41. P.-A. Larson and M. Krishnan. Memory Allocation for Long-running Server Applications. In Proceedings of the 1st International Symposium on Memory Management, ISMM '98, pages 176--185, 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pages 241--251, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford DASH Multiprocessor. In IEEE Computer, March 1992.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. I. Lotan and N. Shavit. Skiplist-Based Concurrent Priority Queues. In Proceedings of the 14th International Parallel and Distributed Processing Symposium, IPDPS '00, pages 263--268, 2000.Google ScholarGoogle Scholar
  45. N. D. Matsakis and F. S. Klock, II. The Rust Language. In Proceedings of the 2014 ACM SIGAda Annual Conference on High Integrity Language Technology, HILT '14, pages 103--104, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. M. M. Michael. Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects. IEEE Transactions on Parallel and Distributed System, 15 (6): 491--504, June 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. M. M. Michael. Scalable lock-free dynamic memory allocation. In Proceedings of the 2004 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '04, pages 35--46, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing, PODC '96, pages 267--275, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. M. M. Michael and M. L. Scott. Nonblocking algorithms and preemption-safe locking on multiprogrammed shared memory multiprocessors. Journal of Parallel and Distributed Computing, 51: 1--26, May 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. D. Nguyen, A. Lenharth, and K. Pingali. A Lightweight Infrastructure for Graph Analytics. In Proceedings of the 24th ACM Symposium on Operating Systems Principles, SOSP '13, pages 456--471, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. S. L. Olivier, A. K. Porterfield, K. B. Wheeler, M. Spiegel, and J. F. Prins. OpenMP Task Scheduling Strategies for Multicore NUMA Systems. International Journal of High Performance Computing Applications, 26 (2): 110--124, May 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. G. Pfister, W. Brantley, D. George, S. Harvey, W. Kleinfelder, K. McAuliffe, E. Melton, V. Norton, and J. Weiss. The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture. In Proceedings of the 1985 International Conference on Parallel Processing, ICPP '85, pages 764--771, 1985.Google ScholarGoogle Scholar
  53. X. Qian, B. Sahelices, and J. Torrellas. BulkSMT: Designing SMT Processors for Atomic-block Execution. In Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture, HPCA '12, pages 1--12, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. X. Qian, B. Sahelices, and J. Torrellas. OmniOrder: Directory-based Conflict Serialization of Transactions. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA '14, pages 421--432, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. R. Rajwar, A. Kagi, and J. R. Goodman. Improving the throughput of synchronization by insertion of delays. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture, HPCA '00, pages 168--179, January 2000.Google ScholarGoogle Scholar
  56. R. Rajwar, A. Kagi, and J. R. Goodman. Inferential Queueing and Speculative Push for Reducing Critical Communication Latencies. In Proceedings of the 17th Annual International Conference on Supercomputing, ICS '03, pages 273--284, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. H. E. Ramadan, C. J. Rossbach, and E. Witchel. Dependence-aware Transactional Memory for Increased Concurrency. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '08, pages 246--257, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  58. A. Ros and S. Kaxiras. Complexity-effective Multicore Coherence. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 241--252, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. S. Schneider, C. D. Antonopoulos, and D. S. Nikolopoulos. Scalable Locality-conscious Multithreaded Memory Allocation. In Proceedings of the 5th International Symposium on Memory Management, ISMM '06, pages 84--94, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. S. L. Scott. Synchronization and Communication in the T3E Multiprocessor. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '96, pages 26--36, 1996.Google ScholarGoogle Scholar
  61. S. Seo, J. Kim, and J. Lee. SFMalloc: A Lock-Free and Mostly Synchronization-Free Dynamic Memory Allocator for Manycores. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques, PACT '11, pages 253--263, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. N. Shavit. Data Structures in the Multicore Age. Communications of the ACM, 54 (3): 76--84, Mar. 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. M. Stonebraker, A. Pavlo, R. Taft, and M. L. Brodie. Enterprise Database Applications and the Cloud: A Difficult Road Ahead. In Proceedings of the 2014 IEEE International Conference on Cloud Engineering, IC2E '14, pages 1--6, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. M. E. Thomadakis. The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms. Resource, 3: 2, 2011.Google ScholarGoogle Scholar
  65. P. Trancoso and J. Torrellas. The Impact of Speeding Up Critical Sections with Data Prefetching and Forwarding. In Proceedings of the 1996 International Conference on Parallel Processing, ICPP '96, pages 79--86.Google ScholarGoogle Scholar
  66. R. K. Treiber. Systems Programming: Coping with Parallelism. Technical Report RJ5118, IBM Almaden Research Center, 2006.Google ScholarGoogle Scholar
  67. L. G. Valiant. A Bridging Model for Parallel Computation. Communications of the ACM, 33 (8): 103--111, Aug. 1990.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. E. Vallejo, R. Beivide, A. Cristal, T. Harris, F. Vallejo, O. Unsal, and M. Valero. Architectural Support for Fair Reader-Writer Locking. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '10, pages 275--286, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. K. B. Wheeler, R. C. Murphy, and D. Thain. Qthreads: An API for programming with millions of lightweight threads. In Proceedings of the 22nd International Parallel and Distributed Processing Symposium, IPDPS '08, pages 1--8, 2008.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 51, Issue 4
        ASPLOS '16
        April 2016
        774 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2954679
        • Editor:
        • Andy Gill
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
          March 2016
          824 pages
          ISBN:9781450340915
          DOI:10.1145/2872362
          • General Chair:
          • Tom Conte,
          • Program Chair:
          • Yuanyuan Zhou

        Copyright © 2016 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 March 2016

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!