skip to main content
research-article
Public Access

MPI+Threads: runtime contention and remedies

Published:24 January 2015Publication History
Skip Abstract Section

Abstract

Hybrid MPI+Threads programming has emerged as an alternative model to the “MPI everywhere” model to better handle the increasing core density in cluster nodes. While the MPI standard allows multithreaded concurrent communication, such flexibility comes with the cost of maintaining thread safety within the MPI implementation, typically implemented using critical sections. In contrast to previous works that studied the importance of critical-section granularity in MPI implementations, in this paper we investigate the implication of critical-section arbitration on communication performance. We first analyze the MPI runtime when multithreaded concurrent communication takes place on hierarchical memory systems. Our results indicate that the mutex-based approach that most MPI implementations use today can incur performance penalties due to unfair arbitration. We then present methods to mitigate these penalties with a first-come, first-served arbitration and a priority locking scheme that favors threads doing useful work. Through evaluations using several benchmarks and applications, we demonstrate up to 5-fold improvement in performance.

References

  1. MPICH. URL www.mpich.org.Google ScholarGoogle Scholar
  2. OpenMP. URL openmp.org.Google ScholarGoogle Scholar
  3. OSU microbenchmarks suite. URL mvapich.cse.ohio-state.edu/benchmarks.Google ScholarGoogle Scholar
  4. Introducing the Graph 500, May 2010. Cray User’s Group (CUG).Google ScholarGoogle Scholar
  5. MPI: A message-passing interface standard version 3.0, Sept. 2012. URL http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf.Google ScholarGoogle Scholar
  6. P. Balaji, D. Buntinas, D. Goodell, W. D. Gropp, and R. Thakur. Finegrained multithreading support for hybrid threaded MPI programming. Int. J. High Perform. Comput.Appl., 24:49–57, Feb. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Balaji, D. Buntinas, D. Goodell, W. Gropp, T. Hoefler, S. Kumar, E. Lusk, R. Thakur, and J. L. Träff. MPI on millions of cores. Parallel Processing Letters, 21(01):45–60, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  8. J. Chhugani, N. Satish, C. Kim, J. Sewall, and P. Dubey. Fast and efficient graph traversal algorithm for CPUs: Maximizing single-node efficiency. In IEEE 26th International Parallel Distributed Processing Symposium (IPDPS), 2012, pages 378–389, May 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. David, R. Guerraoui, and V. Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, pages 33–48, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Dinan, P. Balaji, J. R. Hammond, S. Krishnamoorthy, and V. Tipparaju. Supporting the global arrays PGAS model using MPI onesided communication. In Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 739–750, May 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dinan, P. Balaji, D. Goodell, D. Miller, M. Snir, and R. Thakur. Enabling MPI interoperability through flexible communication endpoints. 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Dózsa, S. Kumar, P. Balaji, D. Buntinas, D. Goodell, W. Gropp, J. Ratterman, and R. Thakur. Enabling concurrent multithreaded MPI communication on multicore petascale systems. In Proceedings of the 17th European MPI Users’ Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroMPI’10, pages 11–20, Berlin, Heidelberg, 2010. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. U. Drepper. Futexes are tricky. Dec. 2005.Google ScholarGoogle Scholar
  14. H. Franke, R. Russell, and M. Kirkwood. Fuss, futexes and furwocks: Fast userlevel locking in Linux. In AUUG Conference Proceedings, page 85, 2002.Google ScholarGoogle Scholar
  15. D. Goodell, P. Balaji, D. Buntinas, G. Dozsa, W. Gropp, S. Kumar, B. R. de Supinski, and R. Thakur. Minimizing MPI resource contention in multithreaded multicore environments. In Proceedings of the 2010 IEEE International Conference on Cluster Computing, CLUSTER ’10, pages 1–8, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. W. Gropp and R. Thakur. Thread-safety in an MPI implementation: Requirements and analysis. Parallel Comput., 33:595–604, Sept. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Hoefler, G. Bronevetsky, B. Barrett, B. R. de Supinski, and A. Lumsdaine. Efficient MPI support for advanced hybrid programming models. In Recent Advances in the Message Passing Interface (EuroMPI’10), volume LNCS 6305, pages 50–61. Springer, Sept. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems (TOCS), 9(1):21–65, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In ACM SIGARCH Computer Architecture News, volume 19, pages 269–278. ACM, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Meng, J. Yuan, J. Cheng, Y. Wei, and S. Feng. Small world asynchronous parallel model for genome assembly. In J. Park, A. Zomaya, S.-S. Yeo, and S. Sahni, editors, Network and Parallel Computing, volume 7513, chapter Lecture Notes in Computer Science, pages 145–155. Springer Berlin Heidelberg, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  21. J. Meng, B. Wang, Y. Wei, S. Feng, and P. Balaji. SWAP-assembler: scalable and efficient genome assembly towards thousands of cores. BMC Bioinformatics, 15(Suppl 9):–2, 2014.Google ScholarGoogle Scholar
  22. I. Molnar. The native POSIX thread library for Linux. Technical report, 2003.Google ScholarGoogle Scholar
  23. J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, and E. Aprà. Advances, applications and performance of the global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl., 20(2):203–231, May 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Nieplocha, V. Tipparaju, M. Krishnan, and D. K. Panda. High performance remote memory access communication: The ARMCI approach. Int. J. High Perform. Comput. Appl., 20(2):233–253, May 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Pheatt. Intel − threading building blocks. J. Comput. Sci. Coll., 23 (4):298–298, Apr. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. Suzumura, K. Ueno, H. Sato, K. Fujisawa, and S. Matsuoka. Performance characteristics of graph500 on large-scale distributed environment. In Workload Characterization (IISWC), 2011 IEEE International Symposium on, pages 149–158, Nov. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Thakur and W. Gropp. Test suite for evaluating performance of multithreaded MPI communication. Parallel Computing, 35(12):608– 617, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Valiev, E. J. Bylaska, N. Govind, K. Kowalski, T. P. Straatsma, H. J. J. V. Dam, D. Wang, J. Nieplocha, E. Apra, T. L. Windus, and W. A. de Jong. NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications, 181(9):1477–1489, 2010.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. MPI+Threads: runtime contention and remedies

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 50, Issue 8
        PPoPP '15
        August 2015
        290 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2858788
        • Editor:
        • Andy Gill
        Issue’s Table of Contents
        • cover image ACM Conferences
          PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
          January 2015
          290 pages
          ISBN:9781450332057
          DOI:10.1145/2688500

        Copyright © 2015 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 January 2015

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!