skip to main content
research-article

Revisiting the combining synchronization technique

Authors Info & Claims
Published:25 February 2012Publication History
Skip Abstract Section

Abstract

Fine-grain thread synchronization has been proved, in several cases, to be outperformed by efficient implementations of the combining technique where a single thread, called the combiner, holding a coarse-grain lock, serves, in addition to its own synchronization request, active requests announced by other threads while they are waiting by performing some form of spinning. Efficient implementations of this technique significantly reduce the cost of synchronization, so in many cases they exhibit much better performance than the most efficient finely synchronized algorithms.

In this paper, we revisit the combining technique with the goal to discover where its real performance power resides and whether or how ensuring some desired properties (e.g., fairness in serving requests) would impact performance. We do so by presenting two new implementations of this technique; the first (CC-Synch) addresses systems that support coherent caches, whereas the second (DSM-Synch) works better in cacheless NUMA machines. In comparison to previous such implementations, the new implementations (1) provide bounds on the number of remote memory references (RMRs) that they perform, (2) support a stronger notion of fairness, and (3) use simpler and less basic primitives than previous approaches. In all our experiments, the new implementations outperform by far all previous state-of-the-art combining-based and fine-grain synchronization algorithms. Our experimental analysis sheds light to the questions we aimed to answer.

Several modern multi-core systems organize the cores into clusters and provide fast communication within the same cluster and much slower communication across clusters. We present an hierarchical version of CC-Synch, called H-Synch which exploits the hierarchical communication nature of such systems to achieve better performance. Experiments show that H-Synch significantly outper forms previous state-of-the-art hierarchical approaches.

We provide new implementations of common shared data structures (like stacks and queues) based on CC-Synch, DSM-Synch and H-Synch. Our experiments show that these implementations outperform by far all previous (fine-grain or combined-based) implementations of shared stacks and queues.

References

  1. G. Amdahl. Validity of the single processor approach to achieving large-scale computing capabilities. In AFIPS Conference Proceedings, page 483--485, 1967. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A scalable memory allocator for multithreaded applications. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 117--128, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes. Blade Computing with the AMD Opteron Processor (Magny-Cours). Hot chips 21, August 2009.Google ScholarGoogle Scholar
  4. I. Corporation. Intel(R) 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide, Part1, January 2011.Google ScholarGoogle Scholar
  5. T. S. Craig. Building FIFO and priority-queueing spin locks from atomic swap. Technical Report TR 93-02-02, Department of Computer Science, University of Washington, February 1993.Google ScholarGoogle Scholar
  6. D. Dice, V. J. Marathe, and N. Shavit. Flat-Combining NUMA Locks. In Proceedings of the 23nd Annual ACM Symposium on Parallel Algorithms and Architectures, pages 65 -- 74, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Fatourou and N. D. Kallimanis. A highly-efficient wait-free universal construction. In Proceedings of the 23nd Annual ACM Symposium on Parallel Algorithms and Architectures, pages 325 -- 334, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. R. Goodman, M. Vernon, and P. J. Woest. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 64--75, April 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Gupta, A. Tucker, and S. Urushibara. The impact of operating system scheduling policies and synchronization methods of performance of parallel applications. SIGMETRICS Perform. Eval. Rev., 19:120--132, April 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Gupta and C. R. Hill. A scalable implementation of barrier synchronization using an adaptive combining tree. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS III), pages 54--63, 1989.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. The code for Flat-Combining. http://github.com/mit-carbon/flat-combining.Google ScholarGoogle Scholar
  12. D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the 22nd Annual ACM Symposium on Parallel Algorithms and Architectures, pages 355--364, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Hendler, N. Shavit, and L. Yerushalmi. A scalable lock-free stack algorithm. In Proceedings of the 16th ACM Symposium on Parallel Algorithms and Architectures, pages 206--215, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems (TOPLAS), 13:124--149, jan 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. P. Herlihy and J. M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems (TOPLAS), 12:463--492, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. D. Kallimanis and P. Fatourou. The code for sim universal construction. http://code.google.com/p/sim-universal-construction/.Google ScholarGoogle Scholar
  17. V. Luchangco, D. Nussbaum, and N. Shavit. A Hierarchical CLH Queue Lock. In Proceedings of the 12th International Euro-Par Conference, pages 801--810, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. S. Magnusson, A. Landin, and E. Hagersten. Queue locks on cache coherent multiprocessors. In Proceedings of the 8th International Parallel Processing Symposium, pages 165--171, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. E. McKenney. Memory Barriers: a Hardware View for Software Hackers, June 2010.Google ScholarGoogle Scholar
  20. J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21--65, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th ACM Symposium on Principles of Distributed Computing, pages 267--275, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Micro Devices. AMD64 Architecture Programmer's Manual Volume 2: System Programming, June 2010.Google ScholarGoogle Scholar
  23. Y. Oyama, K. Taura, and A. Yonezawa. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications (PDSIA '99), pages 182 -- 204, 1999.Google ScholarGoogle Scholar
  24. Z. Radovic and E. Hagersten. Hierarchical backoff locks for nonuniform communication architectures. In Proceedings of the 9th IEEE International Symposium on High-Performance Computer Architecture, pages 241--252, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. N. Shavit and A. Zemach. Diffracting trees. ACM Transactions on Computer Systems, 14(4):385--428, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. N. Shavit and A. Zemach. Combining funnels: A dynamic approach to software combining. Journal of Parallel and Distributed Computing, 60(11):1355--1387, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. K. Treiber. Systems programming: Coping with parallelism. Technical Report RJ 5118, IBM Almaden Research Center, April 1986.Google ScholarGoogle Scholar
  28. D. L. Weaver and T. Germond. The SPARC Architecture Manual, Version 9, 1994.Google ScholarGoogle Scholar
  29. P.-C. Yew, N.-F. Tzeng, and D. H. Lawrie. Distributing hot-spot addressing in large-scale multiprocessors. IEEE Trans. Computers, 36(4):388--395, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Revisiting the combining synchronization technique

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!