Abstract
High memory contention is generally agreed to be a worst-case scenario for concurrent data structures. There has been a significant amount of research effort spent investigating designs which minimize contention, and several programming techniques have been proposed to mitigate its effects. However, there are currently few architectural mechanisms to allow scaling contended data structures at high thread counts.
In this paper, we investigate hardware support for scalable contended data structures. We propose Lease/Release, a simple addition to standard directory-based MSI cache coherence protocols, allowing participants to lease memory, at the granularity of cache lines, by delaying coherence messages for a short, bounded period of time. Our analysis shows that Lease/Release can significantly reduce the overheads of contention for both non-blocking (lock-free) and lock-based data structure implementations, while ensuring that no deadlocks are introduced. We validate Lease/Release empirically on the Graphite multiprocessor simulator, on a range of data structures, including queue, stack, and priority queue implementations, as well as on transactional applications. Results show that Lease/Release consistently improves both throughput and energy usage, by up to 5x, both for lock-free and lock-based data structure designs.
- Y. Afek, M. Hakimi, and A. Morrison. Fast and scalable rendezvousing. Distributed computing, 26(4):243--269, 2013.Google Scholar
Digital Library
- M. Ahmad, F. Hijaz, Q. Shi, and O. Khan. Crono: A benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In Workload Characterization (IISWC), 2015 IEEE International Symposium on, pages 44--55. IEEE, 2015. Google Scholar
Digital Library
- D. Alistarh, J. Aspnes, K. Censor-Hillel, S. Gilbert, and R. Guerraoui. Tight bounds for asynchronous renaming. Journal of the ACM (JACM), 61(3):18, 2014. Google Scholar
Digital Library
- D. Alistarh, J. Kopinsky, J. Li, and N. Shavit. The spraylist: A scalable relaxed priority queue. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pages 11--20, New York, NY, USA, 2015. ACM. Google Scholar
Digital Library
- B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou. Denovo: Rethinking the memory hierarchy for disciplined parallelism. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 155--166. IEEE, 2011. Google Scholar
Digital Library
- T. Craig. Building fifo and priorityqueuing spin locks from atomic swap. Technical report, Technical Report 93-02-02, University of Washington, Seattle, Washington, 1994.Google Scholar
- T. Crain, V. Gramoli, and M. Raynal. A speculation-friendly binary search tree. ACM SIGPLAN Notices, 47(8):161--170, 2012. Google Scholar
Digital Library
- T. David, R. Guerraoui, and V. Trigonakis. Asynchronized concurrency: The secret to scaling concurrent search data structures. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 631--644. ACM, 2015. Google Scholar
Digital Library
- D. Dice, D. Hendler, and I. Mirsky. Lightweight contention management for efficient compare-and-swap operations. In Euro-Par 2013 Parallel Processing, pages 595--606. Springer, 2013. Google Scholar
Digital Library
- D. Dice, V. J. Marathe, and N. Shavit. Lock cohorting: A general technique for designing numa locks. ACM Trans. Parallel Comput., 1(2):13:1--13:42, Feb. 2015. Google Scholar
Digital Library
- D. Dice, O. Shalev, and N. Shavit. Transactional locking ii. In Distributed Computing, pages 194--208. Springer, 2006. Google Scholar
Digital Library
- F. Ellen, P. Fatourou, E. Ruppert, and F. van Breugel. Non-blocking binary search trees. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, PODC '10, pages 131--140, New York, NY, USA, 2010. ACM. Google Scholar
Digital Library
- F. Ellen, D. Hendler, and N. Shavit. On the inherent sequentiality of concurrent objects. SIAM J. Comput., 41(3):519--536, 2012.Google Scholar
- P. Fatourou and N. D. Kallimanis. A highly-efficient wait-free universal construction. In Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures, pages 325--334. ACM, 2011. Google Scholar
Digital Library
- K. Fraser. Practical lock-freedom. PhD thesis, PhD thesis, Cambridge University Computer Laboratory, 2003. Also available as Technical Report UCAM-CL-TR-579, 2004.Google Scholar
- J. R. Goodman, M. K. Vernon, and P. J. Woest. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. SIGARCH Comput. Archit. News, 17(2):64--75, Apr. 1989. Google Scholar
Digital Library
- T. L. Harris. A pragmatic implementation of non-blocking linked-lists. In Proceedings of the 15th International Conference on Distributed Computing, DISC '01, pages 300--314, London, UK, UK, 2001. Springer-Verlag. Google Scholar
Digital Library
- D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures, pages 355--364. ACM, 2010. Google Scholar
Digital Library
- T. A. Henzinger, C. M. Kirsch, H. Payer, A. Sezgin, and A. Sokolova. Quantitative relaxation of concurrent data structures. In Proceedings of the 40th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '13, pages 317--328, New York, NY, USA, 2013. ACM. Google Scholar
Digital Library
- M. Herlihy and N. Shavit. The art of multiprocessor programming. Morgan Kaufmann, 2008. Google Scholar
Digital Library
- A. Kägi, D. Burger, and J. R. Goodman. Efficient synchronization: Let them eat qolb. SIGARCH Comput. Archit. News, 25(2):170--180, May 1997. Google Scholar
Digital Library
- C. Leiserson. A simple deterministic algorithm for guaranteeing the forward progress of transactions. Transact 2015.Google Scholar
- I. Lotan and N. Shavit. Skiplist-based concurrent priority queues. In Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International, pages 263--268. IEEE, 2000. Google Scholar
Digital Library
- P. Magnusson, A. Landin, and E. Hagersten. Queue locks on cache coherent multiprocessors. In Parallel Processing Symposium, 1994. Proceedings., Eighth International, pages 165--171. IEEE, 1994. Google Scholar
Digital Library
- J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. SIGPLAN Not., 26(4):269--278, Apr. 1991. Google Scholar
Digital Library
- M. M. Michael. High performance dynamic lock-free hash tables and list-based sets. In Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures, pages 73--82. ACM, 2002. Google Scholar
Digital Library
- M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing, PODC '96, pages 267--275, New York, NY, USA, 1996. ACM. Google Scholar
Digital Library
- J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald III, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal. Graphite: A distributed parallel simulator for multicores. In High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on, pages 1--12. IEEE, 2010.Google Scholar
Digital Library
- A. Morrison and Y. Afek. Fast concurrent queues for x86 processors. In ACM SIGPLAN Notices, volume 48, pages 103--112. ACM, 2013. Google Scholar
Digital Library
- T. Nakaike, R. Odaira, M. Gaudet, M. M. Michael, and H. Tomari. Quantitative comparison of hardware transactional memory for blue gene/q, zenterprise ec12, intel core, and power8. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA '15, pages 144--157, New York, NY, USA, 2015. ACM. Google Scholar
Digital Library
- A. Natarajan and N. Mittal. Fast concurrent lock-free binary search trees. In ACM SIGPLAN Notices, volume 49, pages 317--328. ACM, 2014. Google Scholar
Digital Library
- R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, and V. Venkataramani. Scaling memcache at facebook. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), pages 385--398, Lombard, IL, 2013. USENIX. Google Scholar
Digital Library
- W. Pugh. Concurrent maintenance of skip lists. 1998.Google Scholar
- R. Rajwar, A. Kagi, and J. R. Goodman. Improving the throughput of synchronization by insertion of delays. In High-Performance Computer Architecture, 2000. HPCA-6. Proceedings. Sixth International Symposium on, pages 168--179. IEEE, 2000.Google Scholar
- R. Rajwar, A. Kägi, and J. R. Goodman. Inferential queueing and speculative push for reducing critical communication latencies. In Proceedings of the 17th Annual International Conference on Supercomputing, ICS '03, pages 273--284, New York, NY, USA, 2003. ACM. Google Scholar
Digital Library
- H. Rihani, P. Sanders, and R. Dementiev. Brief announcement: Multiqueues: Simple relaxed concurrent priority queues. In Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures, SPAA '15, pages 80--82, New York, NY, USA, 2015. ACM. Google Scholar
Digital Library
- M. L. Scott. Shared-Memory Synchronization. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2013. Google Scholar
Digital Library
- O. Shalev and N. Shavit. Transient blocking synchronization. Technical report, Mountain View, CA, USA, 2005. Google Scholar
Digital Library
- N. Shavit and D. Touitou. Elimination trees and the construction of pools and stacks: preliminary version. In Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, pages 54--63. ACM, 1995. Google Scholar
Digital Library
- D. J. Sorin, M. D. Hill, and D. A. Wood. A Primer on Memory Consistency and Cache Coherence. Morgan & Claypool Publishers, 1st edition, 2011. Google Scholar
Digital Library
- R. K. Treiber. Systems programming: Coping with parallelism. Technical Report RJ 5118, IBM Almaden Research Center, 1986.Google Scholar
- X. Yu and S. Devadas. Tardis: Timestamp based coherence algorithm for distributed shared memory. arXiv preprint arXiv:1501.04504, 2015.Google Scholar
Recommendations
Lease/Release: Architectural Support for Scaling Contended Data Structures
Special Issue: Invited papers from PPoPP 2016, Part 2High memory contention is generally agreed to be a worst-case scenario for concurrent data structures. There has been a significant amount of research effort spent investigating designs that minimize contention, and several programming techniques have ...
Lease/release: architectural support for scaling contended data structures
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingHigh memory contention is generally agreed to be a worst-case scenario for concurrent data structures. There has been a significant amount of research effort spent investigating designs which minimize contention, and several programming techniques have ...
Miss behavior for caching with lease
Caching with lease is to evict the data record from cache after its associated lease term expires. This policy differs from the traditional caching algorithms, e.g., LRU, by introducing a dimension of time to the data record stored in the cache. This ...






Comments