Abstract
Fine-grain thread synchronization has been proved, in several cases, to be outperformed by efficient implementations of the combining technique where a single thread, called the combiner, holding a coarse-grain lock, serves, in addition to its own synchronization request, active requests announced by other threads while they are waiting by performing some form of spinning. Efficient implementations of this technique significantly reduce the cost of synchronization, so in many cases they exhibit much better performance than the most efficient finely synchronized algorithms.
In this paper, we revisit the combining technique with the goal to discover where its real performance power resides and whether or how ensuring some desired properties (e.g., fairness in serving requests) would impact performance. We do so by presenting two new implementations of this technique; the first (CC-Synch) addresses systems that support coherent caches, whereas the second (DSM-Synch) works better in cacheless NUMA machines. In comparison to previous such implementations, the new implementations (1) provide bounds on the number of remote memory references (RMRs) that they perform, (2) support a stronger notion of fairness, and (3) use simpler and less basic primitives than previous approaches. In all our experiments, the new implementations outperform by far all previous state-of-the-art combining-based and fine-grain synchronization algorithms. Our experimental analysis sheds light to the questions we aimed to answer.
Several modern multi-core systems organize the cores into clusters and provide fast communication within the same cluster and much slower communication across clusters. We present an hierarchical version of CC-Synch, called H-Synch which exploits the hierarchical communication nature of such systems to achieve better performance. Experiments show that H-Synch significantly outper forms previous state-of-the-art hierarchical approaches.
We provide new implementations of common shared data structures (like stacks and queues) based on CC-Synch, DSM-Synch and H-Synch. Our experiments show that these implementations outperform by far all previous (fine-grain or combined-based) implementations of shared stacks and queues.
- G. Amdahl. Validity of the single processor approach to achieving large-scale computing capabilities. In AFIPS Conference Proceedings, page 483--485, 1967. Google Scholar
Digital Library
- E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A scalable memory allocator for multithreaded applications. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 117--128, 2000. Google Scholar
Digital Library
- P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes. Blade Computing with the AMD Opteron Processor (Magny-Cours). Hot chips 21, August 2009.Google Scholar
- I. Corporation. Intel(R) 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide, Part1, January 2011.Google Scholar
- T. S. Craig. Building FIFO and priority-queueing spin locks from atomic swap. Technical Report TR 93-02-02, Department of Computer Science, University of Washington, February 1993.Google Scholar
- D. Dice, V. J. Marathe, and N. Shavit. Flat-Combining NUMA Locks. In Proceedings of the 23nd Annual ACM Symposium on Parallel Algorithms and Architectures, pages 65 -- 74, 2011. Google Scholar
Digital Library
- P. Fatourou and N. D. Kallimanis. A highly-efficient wait-free universal construction. In Proceedings of the 23nd Annual ACM Symposium on Parallel Algorithms and Architectures, pages 325 -- 334, 2011. Google Scholar
Digital Library
- J. R. Goodman, M. Vernon, and P. J. Woest. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 64--75, April 1989. Google Scholar
Digital Library
- A. Gupta, A. Tucker, and S. Urushibara. The impact of operating system scheduling policies and synchronization methods of performance of parallel applications. SIGMETRICS Perform. Eval. Rev., 19:120--132, April 1991. Google Scholar
Digital Library
- R. Gupta and C. R. Hill. A scalable implementation of barrier synchronization using an adaptive combining tree. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS III), pages 54--63, 1989.Google Scholar
Digital Library
- D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. The code for Flat-Combining. http://github.com/mit-carbon/flat-combining.Google Scholar
- D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the 22nd Annual ACM Symposium on Parallel Algorithms and Architectures, pages 355--364, 2010. Google Scholar
Digital Library
- D. Hendler, N. Shavit, and L. Yerushalmi. A scalable lock-free stack algorithm. In Proceedings of the 16th ACM Symposium on Parallel Algorithms and Architectures, pages 206--215, 2004. Google Scholar
Digital Library
- M. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems (TOPLAS), 13:124--149, jan 1991. Google Scholar
Digital Library
- M. P. Herlihy and J. M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems (TOPLAS), 12:463--492, 1990. Google Scholar
Digital Library
- N. D. Kallimanis and P. Fatourou. The code for sim universal construction. http://code.google.com/p/sim-universal-construction/.Google Scholar
- V. Luchangco, D. Nussbaum, and N. Shavit. A Hierarchical CLH Queue Lock. In Proceedings of the 12th International Euro-Par Conference, pages 801--810, 2006. Google Scholar
Digital Library
- P. S. Magnusson, A. Landin, and E. Hagersten. Queue locks on cache coherent multiprocessors. In Proceedings of the 8th International Parallel Processing Symposium, pages 165--171, 1994. Google Scholar
Digital Library
- P. E. McKenney. Memory Barriers: a Hardware View for Software Hackers, June 2010.Google Scholar
- J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21--65, 1991. Google Scholar
Digital Library
- M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th ACM Symposium on Principles of Distributed Computing, pages 267--275, 1996. Google Scholar
Digital Library
- A. Micro Devices. AMD64 Architecture Programmer's Manual Volume 2: System Programming, June 2010.Google Scholar
- Y. Oyama, K. Taura, and A. Yonezawa. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications (PDSIA '99), pages 182 -- 204, 1999.Google Scholar
- Z. Radovic and E. Hagersten. Hierarchical backoff locks for nonuniform communication architectures. In Proceedings of the 9th IEEE International Symposium on High-Performance Computer Architecture, pages 241--252, 2003. Google Scholar
Digital Library
- N. Shavit and A. Zemach. Diffracting trees. ACM Transactions on Computer Systems, 14(4):385--428, 1996. Google Scholar
Digital Library
- N. Shavit and A. Zemach. Combining funnels: A dynamic approach to software combining. Journal of Parallel and Distributed Computing, 60(11):1355--1387, 2000. Google Scholar
Digital Library
- R. K. Treiber. Systems programming: Coping with parallelism. Technical Report RJ 5118, IBM Almaden Research Center, April 1986.Google Scholar
- D. L. Weaver and T. Germond. The SPARC Architecture Manual, Version 9, 1994.Google Scholar
- P.-C. Yew, N.-F. Tzeng, and D. H. Lawrie. Distributing hot-spot addressing in large-scale multiprocessors. IEEE Trans. Computers, 36(4):388--395, 1987. Google Scholar
Digital Library
Index Terms
Revisiting the combining synchronization technique
Recommendations
Revisiting the combining synchronization technique
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingFine-grain thread synchronization has been proved, in several cases, to be outperformed by efficient implementations of the combining technique where a single thread, called the combiner, holding a coarse-grain lock, serves, in addition to its own ...
Highly-Efficient Wait-Free Synchronization
We study a simple technique, originally presented by Herlihy (ACM Trans. Program. Lang. Syst. 15(5):745---770, 1993 ), for executing concurrently, in a wait-free manner, blocks of code that have been programmed for sequential execution and require ...
Transactional Lock Elision Meets Combining
PODC '17: Proceedings of the ACM Symposium on Principles of Distributed ComputingFlat combining (FC) and transactional lock elision (TLE) are two techniques that facilitate efficient multi-thread access to a sequentially implemented data structure protected by a lock. FC allows threads to delegate their operations to another (...







Comments