Abstract
As the level of parallelism in manycore processors keeps increasing, providing efficient mechanisms for thread synchronization in concurrent programs is becoming a major concern. On cache-coherent shared-memory processors, synchronization efficiency is ultimately limited by the performance of the underlying cache coherence protocol. This paper studies how hardware support for message passing can improve synchronization performance. Considering the ubiquitous problem of mutual exclusion, we adapt two state-of-the-art solutions used on shared-memory processors, namely the server approach and the combining approach, to leverage the potential of hardware message passing. We propose HybComb, a novel combining algorithm that uses both message passing and shared memory features of emerging hybrid processors. We also introduce MP-Server, a straightforward adaptation of the server approach to hardware message passing. Evaluation on Tilera's TILE-Gx processor shows that MP-Server can execute contended critical sections with unprecedented throughput, as stalls related to cache coherence are removed from the critical path. HybComb can achieve comparable performance, while avoiding the need to dedicate server cores. Consequently, our queue and stack implementations, based on MP-Server and HybComb, largely outperform their most efficient pure-shared-memory counterparts.
- Kalray. http://www.kalray.eu. Accessed: 15-12-2013.Google Scholar
- Tilera. http://www.tilera.com. Accessed: 15-12-2013.Google Scholar
- J. L. Abellan, J. Fernandez, and M. E. Acacio. GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, 2011. Google Scholar
Digital Library
- S. Agathos, N. Kallimanis, and V. Dimakopoulos. Speeding up OpenMP tasking. In Proceedings of the 18th international conference on Parallel Processing, 2012. Google Scholar
Digital Library
- T. E. Anderson. The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1(1):6--16, Jan. 1990. Google Scholar
Digital Library
- A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schupbach, and A. Singhania. The multikernel: a new OS architecture for scalable multicore systems. In Proc. of the ACM SIGOPS 22nd symposium on Operating systems principles, 2009. Google Scholar
Digital Library
- M. Berezecki, E. Frachtenberg, M. Paleczny, and K. Steele. Many-core key-value store. In Proceedings of the 2011 Inter- national Green Computing Conference and Workshops, 2011. Google Scholar
Digital Library
- I. Calciu, J. Gottschlich, and M. Herlihy. Using elimination and delegation to implement a scalable numa-friendly stack. In 5th USENIX Workshop on Hot Topics in Parallelism, 2013.Google Scholar
Digital Library
- J. Cleary, O. Callanan, M. Purcell, and D. Gregg. Fast asymmetric thread synchronization. ACM Transactions on Architecture and Code Optimization, 9(4):27:1--27:22, Jan. 2013. Google Scholar
Digital Library
- P. Fatourou and N. D. Kallimanis. A highly-efficient wait-free universal construction. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, 2011. Google Scholar
Digital Library
- P. Fatourou and N. D. Kallimanis. Revisiting the combining synchronization technique. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2012. Google Scholar
Digital Library
- V. Gramoli, R. Guerraoui, and V. Trigonakis. TM2C: a software transactional memory for many-cores. In Proceedings of the 7th ACM european conference on Computer Systems, 2012. Google Scholar
Digital Library
- D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures, 2010. Google Scholar
Digital Library
- M. Herlihy, B.-H. Lim, and N. Shavit. Scalable concurrent counting. ACM Transactions on Computer Systems, 13(4):343--364, Nov. 1995. Google Scholar
Digital Library
- M. P. Herlihy and J. M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12(3):463--492, July 1990. Google Scholar
Digital Library
- J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, et al. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In International IEEE Solid- State Circuits Conference Digest of Technical Papers, 2010.Google Scholar
- J.-P. Lozi, F. David, G. Thomas, J. Lawall, and G. Muller. Remote core locking: migrating critical-section execution to im- prove the performance of multithreaded applications. In Proceedings of the 2012 USENIX Annual Technical Conference, 2012. Google Scholar
Digital Library
- M. Martin, M. Hill, and D. Sorin. Why on-chip cache coherence is here to stay. Communications of the ACM, 55(7):78--89, July 2012. Google Scholar
Digital Library
- J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21--65, Feb. 1991. Google Scholar
Digital Library
- Z. Metreveli, N. Zeldovich, and M. F. Kaashoek. CPHASH: a cache-partitioned hash table. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2012. Google Scholar
Digital Library
- M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing, 1996. Google Scholar
Digital Library
- A. Morrison and Y. Afek. Fast concurrent queues for x86 processors. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2013. Google Scholar
Digital Library
- S. Owens, S. Sarkar, and P. Sewell. A better x86 memory model: x86-TSO. In Proceedings of the 22nd International Conference on Theorem Proving in Higher Order Logics, 2009. Google Scholar
Digital Library
- Y. Oyama, K. Taura, and A. Yonezawa. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications, 1999.Google Scholar
- N. Shavit and D. Touitou. Elimination trees and the construction of pools and stacks: preliminary version. In Proceedings of the 7th annual ACM symposium on Parallel algorithms and architectures, 1995. Google Scholar
Digital Library
- D. Sorin, M. Hill, and D. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture, 6(3):1--212, 2011.Google Scholar
Digital Library
- M. A. Suleman, O. Mutlu, M. Qureshi, and Y. Patt. Accelerating Critical Section Execution with Asymmetric Multicore Architectures. IEEE Micro, 30(1):60--70, Jan. 2010. Google Scholar
Digital Library
- R. K. Treiber. Systems Programming: Coping with Parallelism. Technical Report RJ 5118, IBM Almaden Research Center, Apr. 1986.Google Scholar
- D. Wentzlaff and A. Agarwal. Factored operating systems (fos): the case for a scalable operating system for multicores. SIGOPS Operating Systems Review, 43(2):76--85, Apr. 2009. Google Scholar
Digital Library
Index Terms
Leveraging hardware message passing for efficient thread synchronization
Recommendations
Leveraging hardware message passing for efficient thread synchronization
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programmingAs the level of parallelism in manycore processors keeps increasing, providing efficient mechanisms for thread synchronization in concurrent programs is becoming a major concern. On cache-coherent shared-memory processors, synchronization efficiency is ...
Leveraging Hardware Message Passing for Efficient Thread Synchronization
Special Issue on PPOPP 2014As the level of parallelism in manycore processors keeps increasing, providing efficient mechanisms for thread synchronization in concurrent programs is becoming a major concern. On cache-coherent shared-memory processors, synchronization efficiency is ...
Efficient Synchronization in Message Passing Systems
AINA '08: Proceedings of the 22nd International Conference on Advanced Information Networking and ApplicationsThe problem of synchronization can be formulated in terms of rules constraining the occupancy of regions in different processes, where a region is a block of code whose execution may require synchronization. In this region synchronization problem, the ...







Comments