Abstract
Hybrid MPI+Threads programming has emerged as an alternative model to the “MPI everywhere” model to better handle the increasing core density in cluster nodes. While the MPI standard allows multithreaded concurrent communication, such flexibility comes with the cost of maintaining thread safety within the MPI implementation, typically implemented using critical sections. In contrast to previous works that studied the importance of critical-section granularity in MPI implementations, in this paper we investigate the implication of critical-section arbitration on communication performance. We first analyze the MPI runtime when multithreaded concurrent communication takes place on hierarchical memory systems. Our results indicate that the mutex-based approach that most MPI implementations use today can incur performance penalties due to unfair arbitration. We then present methods to mitigate these penalties with a first-come, first-served arbitration and a priority locking scheme that favors threads doing useful work. Through evaluations using several benchmarks and applications, we demonstrate up to 5-fold improvement in performance.
- MPICH. URL www.mpich.org.Google Scholar
- OpenMP. URL openmp.org.Google Scholar
- OSU microbenchmarks suite. URL mvapich.cse.ohio-state.edu/benchmarks.Google Scholar
- Introducing the Graph 500, May 2010. Cray User’s Group (CUG).Google Scholar
- MPI: A message-passing interface standard version 3.0, Sept. 2012. URL http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf.Google Scholar
- P. Balaji, D. Buntinas, D. Goodell, W. D. Gropp, and R. Thakur. Finegrained multithreading support for hybrid threaded MPI programming. Int. J. High Perform. Comput.Appl., 24:49–57, Feb. 2010. Google Scholar
Digital Library
- P. Balaji, D. Buntinas, D. Goodell, W. Gropp, T. Hoefler, S. Kumar, E. Lusk, R. Thakur, and J. L. Träff. MPI on millions of cores. Parallel Processing Letters, 21(01):45–60, 2011.Google Scholar
Cross Ref
- J. Chhugani, N. Satish, C. Kim, J. Sewall, and P. Dubey. Fast and efficient graph traversal algorithm for CPUs: Maximizing single-node efficiency. In IEEE 26th International Parallel Distributed Processing Symposium (IPDPS), 2012, pages 378–389, May 2012. Google Scholar
Digital Library
- T. David, R. Guerraoui, and V. Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, pages 33–48, New York, NY, USA, 2013. ACM. Google Scholar
Digital Library
- J. Dinan, P. Balaji, J. R. Hammond, S. Krishnamoorthy, and V. Tipparaju. Supporting the global arrays PGAS model using MPI onesided communication. In Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 739–750, May 2012. Google Scholar
Digital Library
- J. Dinan, P. Balaji, D. Goodell, D. Miller, M. Snir, and R. Thakur. Enabling MPI interoperability through flexible communication endpoints. 2013.Google Scholar
Digital Library
- G. Dózsa, S. Kumar, P. Balaji, D. Buntinas, D. Goodell, W. Gropp, J. Ratterman, and R. Thakur. Enabling concurrent multithreaded MPI communication on multicore petascale systems. In Proceedings of the 17th European MPI Users’ Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroMPI’10, pages 11–20, Berlin, Heidelberg, 2010. Springer-Verlag. Google Scholar
Digital Library
- U. Drepper. Futexes are tricky. Dec. 2005.Google Scholar
- H. Franke, R. Russell, and M. Kirkwood. Fuss, futexes and furwocks: Fast userlevel locking in Linux. In AUUG Conference Proceedings, page 85, 2002.Google Scholar
- D. Goodell, P. Balaji, D. Buntinas, G. Dozsa, W. Gropp, S. Kumar, B. R. de Supinski, and R. Thakur. Minimizing MPI resource contention in multithreaded multicore environments. In Proceedings of the 2010 IEEE International Conference on Cluster Computing, CLUSTER ’10, pages 1–8, Washington, DC, USA, 2010. IEEE Computer Society. Google Scholar
Digital Library
- W. Gropp and R. Thakur. Thread-safety in an MPI implementation: Requirements and analysis. Parallel Comput., 33:595–604, Sept. 2007. Google Scholar
Digital Library
- T. Hoefler, G. Bronevetsky, B. Barrett, B. R. de Supinski, and A. Lumsdaine. Efficient MPI support for advanced hybrid programming models. In Recent Advances in the Message Passing Interface (EuroMPI’10), volume LNCS 6305, pages 50–61. Springer, Sept. 2010. Google Scholar
Digital Library
- J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems (TOCS), 9(1):21–65, 1991. Google Scholar
Digital Library
- J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In ACM SIGARCH Computer Architecture News, volume 19, pages 269–278. ACM, 1991. Google Scholar
Digital Library
- J. Meng, J. Yuan, J. Cheng, Y. Wei, and S. Feng. Small world asynchronous parallel model for genome assembly. In J. Park, A. Zomaya, S.-S. Yeo, and S. Sahni, editors, Network and Parallel Computing, volume 7513, chapter Lecture Notes in Computer Science, pages 145–155. Springer Berlin Heidelberg, 2012.Google Scholar
Cross Ref
- J. Meng, B. Wang, Y. Wei, S. Feng, and P. Balaji. SWAP-assembler: scalable and efficient genome assembly towards thousands of cores. BMC Bioinformatics, 15(Suppl 9):–2, 2014.Google Scholar
- I. Molnar. The native POSIX thread library for Linux. Technical report, 2003.Google Scholar
- J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, and E. Aprà. Advances, applications and performance of the global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl., 20(2):203–231, May 2006. Google Scholar
Digital Library
- J. Nieplocha, V. Tipparaju, M. Krishnan, and D. K. Panda. High performance remote memory access communication: The ARMCI approach. Int. J. High Perform. Comput. Appl., 20(2):233–253, May 2006. Google Scholar
Digital Library
- C. Pheatt. Intel − threading building blocks. J. Comput. Sci. Coll., 23 (4):298–298, Apr. 2008. Google Scholar
Digital Library
- T. Suzumura, K. Ueno, H. Sato, K. Fujisawa, and S. Matsuoka. Performance characteristics of graph500 on large-scale distributed environment. In Workload Characterization (IISWC), 2011 IEEE International Symposium on, pages 149–158, Nov. 2011. Google Scholar
Digital Library
- R. Thakur and W. Gropp. Test suite for evaluating performance of multithreaded MPI communication. Parallel Computing, 35(12):608– 617, 2009. Google Scholar
Digital Library
- M. Valiev, E. J. Bylaska, N. Govind, K. Kowalski, T. P. Straatsma, H. J. J. V. Dam, D. Wang, J. Nieplocha, E. Apra, T. L. Windus, and W. A. de Jong. NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications, 181(9):1477–1489, 2010.Google Scholar
Cross Ref
Index Terms
MPI+Threads: runtime contention and remedies
Recommendations
MPI+Threads: runtime contention and remedies
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingHybrid MPI+Threads programming has emerged as an alternative model to the “MPI everywhere” model to better handle the increasing core density in cluster nodes. While the MPI standard allows multithreaded concurrent communication, such flexibility comes ...
Lock Contention Management in Multithreaded MPI
In this article, we investigate contention management in lock-based thread-safe MPI libraries. Specifically, we make two assumptions: (1) locks are the only form of synchronization when protecting communication paths; and (2) contention occurs, and thus ...
Advanced Thread Synchronization for Multithreaded MPI Implementations
CCGrid '17: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid ComputingConcurrent multithreaded access to the Message Passing Interface (MPI) is gaining importance to support emerging hybrid MPI applications. The interoperability between threads and MPI, however, is complex and renders efficient implementations nontrivial. ...






Comments