skip to main content
research-article

Leveraging hardware message passing for efficient thread synchronization

Published:06 February 2014Publication History
Skip Abstract Section

Abstract

As the level of parallelism in manycore processors keeps increasing, providing efficient mechanisms for thread synchronization in concurrent programs is becoming a major concern. On cache-coherent shared-memory processors, synchronization efficiency is ultimately limited by the performance of the underlying cache coherence protocol. This paper studies how hardware support for message passing can improve synchronization performance. Considering the ubiquitous problem of mutual exclusion, we adapt two state-of-the-art solutions used on shared-memory processors, namely the server approach and the combining approach, to leverage the potential of hardware message passing. We propose HybComb, a novel combining algorithm that uses both message passing and shared memory features of emerging hybrid processors. We also introduce MP-Server, a straightforward adaptation of the server approach to hardware message passing. Evaluation on Tilera's TILE-Gx processor shows that MP-Server can execute contended critical sections with unprecedented throughput, as stalls related to cache coherence are removed from the critical path. HybComb can achieve comparable performance, while avoiding the need to dedicate server cores. Consequently, our queue and stack implementations, based on MP-Server and HybComb, largely outperform their most efficient pure-shared-memory counterparts.

References

  1. Kalray. http://www.kalray.eu. Accessed: 15-12-2013.Google ScholarGoogle Scholar
  2. Tilera. http://www.tilera.com. Accessed: 15-12-2013.Google ScholarGoogle Scholar
  3. J. L. Abellan, J. Fernandez, and M. E. Acacio. GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Agathos, N. Kallimanis, and V. Dimakopoulos. Speeding up OpenMP tasking. In Proceedings of the 18th international conference on Parallel Processing, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. E. Anderson. The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1(1):6--16, Jan. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schupbach, and A. Singhania. The multikernel: a new OS architecture for scalable multicore systems. In Proc. of the ACM SIGOPS 22nd symposium on Operating systems principles, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Berezecki, E. Frachtenberg, M. Paleczny, and K. Steele. Many-core key-value store. In Proceedings of the 2011 Inter- national Green Computing Conference and Workshops, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. I. Calciu, J. Gottschlich, and M. Herlihy. Using elimination and delegation to implement a scalable numa-friendly stack. In 5th USENIX Workshop on Hot Topics in Parallelism, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Cleary, O. Callanan, M. Purcell, and D. Gregg. Fast asymmetric thread synchronization. ACM Transactions on Architecture and Code Optimization, 9(4):27:1--27:22, Jan. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Fatourou and N. D. Kallimanis. A highly-efficient wait-free universal construction. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Fatourou and N. D. Kallimanis. Revisiting the combining synchronization technique. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. V. Gramoli, R. Guerraoui, and V. Trigonakis. TM2C: a software transactional memory for many-cores. In Proceedings of the 7th ACM european conference on Computer Systems, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Herlihy, B.-H. Lim, and N. Shavit. Scalable concurrent counting. ACM Transactions on Computer Systems, 13(4):343--364, Nov. 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. P. Herlihy and J. M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12(3):463--492, July 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, et al. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In International IEEE Solid- State Circuits Conference Digest of Technical Papers, 2010.Google ScholarGoogle Scholar
  17. J.-P. Lozi, F. David, G. Thomas, J. Lawall, and G. Muller. Remote core locking: migrating critical-section execution to im- prove the performance of multithreaded applications. In Proceedings of the 2012 USENIX Annual Technical Conference, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Martin, M. Hill, and D. Sorin. Why on-chip cache coherence is here to stay. Communications of the ACM, 55(7):78--89, July 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21--65, Feb. 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Z. Metreveli, N. Zeldovich, and M. F. Kaashoek. CPHASH: a cache-partitioned hash table. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Morrison and Y. Afek. Fast concurrent queues for x86 processors. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Owens, S. Sarkar, and P. Sewell. A better x86 memory model: x86-TSO. In Proceedings of the 22nd International Conference on Theorem Proving in Higher Order Logics, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Oyama, K. Taura, and A. Yonezawa. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications, 1999.Google ScholarGoogle Scholar
  25. N. Shavit and D. Touitou. Elimination trees and the construction of pools and stacks: preliminary version. In Proceedings of the 7th annual ACM symposium on Parallel algorithms and architectures, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Sorin, M. Hill, and D. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture, 6(3):1--212, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. A. Suleman, O. Mutlu, M. Qureshi, and Y. Patt. Accelerating Critical Section Execution with Asymmetric Multicore Architectures. IEEE Micro, 30(1):60--70, Jan. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. K. Treiber. Systems Programming: Coping with Parallelism. Technical Report RJ 5118, IBM Almaden Research Center, Apr. 1986.Google ScholarGoogle Scholar
  29. D. Wentzlaff and A. Agarwal. Factored operating systems (fos): the case for a scalable operating system for multicores. SIGOPS Operating Systems Review, 43(2):76--85, Apr. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Leveraging hardware message passing for efficient thread synchronization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!