skip to main content
research-article

Lease/release: architectural support for scaling contended data structures

Published:27 February 2016Publication History
Skip Abstract Section

Abstract

High memory contention is generally agreed to be a worst-case scenario for concurrent data structures. There has been a significant amount of research effort spent investigating designs which minimize contention, and several programming techniques have been proposed to mitigate its effects. However, there are currently few architectural mechanisms to allow scaling contended data structures at high thread counts.

In this paper, we investigate hardware support for scalable contended data structures. We propose Lease/Release, a simple addition to standard directory-based MSI cache coherence protocols, allowing participants to lease memory, at the granularity of cache lines, by delaying coherence messages for a short, bounded period of time. Our analysis shows that Lease/Release can significantly reduce the overheads of contention for both non-blocking (lock-free) and lock-based data structure implementations, while ensuring that no deadlocks are introduced. We validate Lease/Release empirically on the Graphite multiprocessor simulator, on a range of data structures, including queue, stack, and priority queue implementations, as well as on transactional applications. Results show that Lease/Release consistently improves both throughput and energy usage, by up to 5x, both for lock-free and lock-based data structure designs.

References

  1. Y. Afek, M. Hakimi, and A. Morrison. Fast and scalable rendezvousing. Distributed computing, 26(4):243--269, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Ahmad, F. Hijaz, Q. Shi, and O. Khan. Crono: A benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In Workload Characterization (IISWC), 2015 IEEE International Symposium on, pages 44--55. IEEE, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Alistarh, J. Aspnes, K. Censor-Hillel, S. Gilbert, and R. Guerraoui. Tight bounds for asynchronous renaming. Journal of the ACM (JACM), 61(3):18, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Alistarh, J. Kopinsky, J. Li, and N. Shavit. The spraylist: A scalable relaxed priority queue. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pages 11--20, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou. Denovo: Rethinking the memory hierarchy for disciplined parallelism. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 155--166. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. Craig. Building fifo and priorityqueuing spin locks from atomic swap. Technical report, Technical Report 93-02-02, University of Washington, Seattle, Washington, 1994.Google ScholarGoogle Scholar
  7. T. Crain, V. Gramoli, and M. Raynal. A speculation-friendly binary search tree. ACM SIGPLAN Notices, 47(8):161--170, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. David, R. Guerraoui, and V. Trigonakis. Asynchronized concurrency: The secret to scaling concurrent search data structures. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 631--644. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Dice, D. Hendler, and I. Mirsky. Lightweight contention management for efficient compare-and-swap operations. In Euro-Par 2013 Parallel Processing, pages 595--606. Springer, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Dice, V. J. Marathe, and N. Shavit. Lock cohorting: A general technique for designing numa locks. ACM Trans. Parallel Comput., 1(2):13:1--13:42, Feb. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Dice, O. Shalev, and N. Shavit. Transactional locking ii. In Distributed Computing, pages 194--208. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. F. Ellen, P. Fatourou, E. Ruppert, and F. van Breugel. Non-blocking binary search trees. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, PODC '10, pages 131--140, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. F. Ellen, D. Hendler, and N. Shavit. On the inherent sequentiality of concurrent objects. SIAM J. Comput., 41(3):519--536, 2012.Google ScholarGoogle Scholar
  14. P. Fatourou and N. D. Kallimanis. A highly-efficient wait-free universal construction. In Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures, pages 325--334. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. Fraser. Practical lock-freedom. PhD thesis, PhD thesis, Cambridge University Computer Laboratory, 2003. Also available as Technical Report UCAM-CL-TR-579, 2004.Google ScholarGoogle Scholar
  16. J. R. Goodman, M. K. Vernon, and P. J. Woest. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. SIGARCH Comput. Archit. News, 17(2):64--75, Apr. 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. L. Harris. A pragmatic implementation of non-blocking linked-lists. In Proceedings of the 15th International Conference on Distributed Computing, DISC '01, pages 300--314, London, UK, UK, 2001. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures, pages 355--364. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. A. Henzinger, C. M. Kirsch, H. Payer, A. Sezgin, and A. Sokolova. Quantitative relaxation of concurrent data structures. In Proceedings of the 40th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '13, pages 317--328, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Herlihy and N. Shavit. The art of multiprocessor programming. Morgan Kaufmann, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Kägi, D. Burger, and J. R. Goodman. Efficient synchronization: Let them eat qolb. SIGARCH Comput. Archit. News, 25(2):170--180, May 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Leiserson. A simple deterministic algorithm for guaranteeing the forward progress of transactions. Transact 2015.Google ScholarGoogle Scholar
  23. I. Lotan and N. Shavit. Skiplist-based concurrent priority queues. In Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International, pages 263--268. IEEE, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Magnusson, A. Landin, and E. Hagersten. Queue locks on cache coherent multiprocessors. In Parallel Processing Symposium, 1994. Proceedings., Eighth International, pages 165--171. IEEE, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. SIGPLAN Not., 26(4):269--278, Apr. 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. M. Michael. High performance dynamic lock-free hash tables and list-based sets. In Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures, pages 73--82. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing, PODC '96, pages 267--275, New York, NY, USA, 1996. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald III, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal. Graphite: A distributed parallel simulator for multicores. In High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on, pages 1--12. IEEE, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Morrison and Y. Afek. Fast concurrent queues for x86 processors. In ACM SIGPLAN Notices, volume 48, pages 103--112. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. Nakaike, R. Odaira, M. Gaudet, M. M. Michael, and H. Tomari. Quantitative comparison of hardware transactional memory for blue gene/q, zenterprise ec12, intel core, and power8. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA '15, pages 144--157, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Natarajan and N. Mittal. Fast concurrent lock-free binary search trees. In ACM SIGPLAN Notices, volume 49, pages 317--328. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, and V. Venkataramani. Scaling memcache at facebook. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), pages 385--398, Lombard, IL, 2013. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. W. Pugh. Concurrent maintenance of skip lists. 1998.Google ScholarGoogle Scholar
  34. R. Rajwar, A. Kagi, and J. R. Goodman. Improving the throughput of synchronization by insertion of delays. In High-Performance Computer Architecture, 2000. HPCA-6. Proceedings. Sixth International Symposium on, pages 168--179. IEEE, 2000.Google ScholarGoogle Scholar
  35. R. Rajwar, A. Kägi, and J. R. Goodman. Inferential queueing and speculative push for reducing critical communication latencies. In Proceedings of the 17th Annual International Conference on Supercomputing, ICS '03, pages 273--284, New York, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. H. Rihani, P. Sanders, and R. Dementiev. Brief announcement: Multiqueues: Simple relaxed concurrent priority queues. In Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures, SPAA '15, pages 80--82, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. L. Scott. Shared-Memory Synchronization. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. O. Shalev and N. Shavit. Transient blocking synchronization. Technical report, Mountain View, CA, USA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. N. Shavit and D. Touitou. Elimination trees and the construction of pools and stacks: preliminary version. In Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, pages 54--63. ACM, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. D. J. Sorin, M. D. Hill, and D. A. Wood. A Primer on Memory Consistency and Cache Coherence. Morgan & Claypool Publishers, 1st edition, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. R. K. Treiber. Systems programming: Coping with parallelism. Technical Report RJ 5118, IBM Almaden Research Center, 1986.Google ScholarGoogle Scholar
  42. X. Yu and S. Devadas. Tardis: Timestamp based coherence algorithm for distributed shared memory. arXiv preprint arXiv:1501.04504, 2015.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 51, Issue 8
    PPoPP '16
    August 2016
    405 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/3016078
    Issue’s Table of Contents
    • cover image ACM Conferences
      PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
      February 2016
      420 pages
      ISBN:9781450340922
      DOI:10.1145/2851141

    Copyright © 2016 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 27 February 2016

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!