skip to main content
research-article

Barrier elision for production parallel programs

Published:24 January 2015Publication History
Skip Abstract Section

Abstract

Large scientific code bases are often composed of several layers of runtime libraries, implemented in multiple programming languages. In such situation, programmers often choose conservative synchronization patterns leading to suboptimal performance. In this paper, we present context-sensitive dynamic optimizations that elide barriers redundant during the program execution. In our technique, we perform data race detection alongside the program to identify redundant barriers in their calling contexts; after an initial learning, we start eliding all future instances of barriers occurring in the same calling context. We present an automatic on-the-fly optimization and a multi-pass guided optimization. We apply our techniques to NWChem--a 6 million line computational chemistry code written in C/C++/Fortran that uses several runtime libraries such as Global Arrays, ComEx, DMAPP, and MPI. Our technique elides a surprisingly high fraction of barriers (as many as 63%) in production runs. This redundancy elimination translates to application speedups as high as 14% on 2048 cores. Our techniques also provided valuable insight about the application behavior, later used by NWChem developers. Overall, we demonstrate the value of holistic context-sensitive analyses that consider the domain science in conjunction with the associated runtime software stack.

References

  1. Cray Unified Parallel C. http://docs.cray.com/books/S-2179-50/html-S-2179-50/z1035483822pvl.html.Google ScholarGoogle Scholar
  2. S. Agarwal et al. May-happen-in-parallel analysis of X10 programs. In Proc. of the 12th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, PPoPP ’07, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Barton et al. Shared memory programming for large scale machines. In Proc. of the ACM SIGPLAN 2006 Conf. on Programming Language Design and Implementation, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. S. Blackford et al. ScaLAPACK User’s Guide. Society for Industrial and Applied Mathematics, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. D. Bond et al. Efficient, context-sensitive detection of real-world semantic attacks. In Proc. of the 5th ACM SIGPLAN Workshop on Programming Languages and Analysis for Security, PLAS ’10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. D. Bond and K. S. McKinley. Probabilistic calling context. In Proc. of the 22Nd Annual ACM SIGPLAN Conf. on Object-oriented Programming Systems and Applications, OOPSLA ’07, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. V. Cavé et al. Habanero-java: The new adventures of old X10. In Proc. of the 9th Intl. Conf. on Principles and Practice of Programming in Java, PPPJ ’11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Chabbi, X. Liu, and J. Mellor-Crummey. Call paths for pin tools. In Proc. of Annual IEEE/ACM Intl. Symp. on Code Generation and Optimization, CGO ’14, pages 76:76–76:86, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Chabbi and J. Mellor-Crummey. DeadSpy: a tool to pinpoint program inefficiencies. In Proc. of the 10th Intl. Symp. on Code Generation and Optimization, CGO ’12, pages 124–134, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Chabbi, K. Murthy, M. Fagan, and J. Mellor-Crummey. Effective sampling-driven performance tools for GPU-accelerated supercomputers. In Proc. of the Intl. Conf. on High Performance Computing, Networking, Storage and Analysis, SC ’13, pages 43:1–43:12, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Chamberlain et al. Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl., 21(3), Aug. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Charles et al. X10: An object-oriented approach to non-uniform cluster computing. SIGPLAN Not., 40(10), Oct. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. ComEx: Communications Runtime for Exascale. http://hpc.pnl.gov/comex/.Google ScholarGoogle Scholar
  14. A. Danalis. MPI and compiler technology: A love-hate relationship. In Proc. of the 19th European Conf. on Recent Advances in the Message Passing Interface, EuroMPI’12, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. C. Diniz and M. C. Rinard. Lock coarsening: Eliminating lock overhead in automatically parallelized object-based programs. J. Parallel Distrib. Comput., 49, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. A. Heroux et al. An overview of the trilinos project. ACM Trans. Math. Softw., 31, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Compiled MPI. http://htor.inf.ethz.ch/research/compi/.Google ScholarGoogle Scholar
  18. P. Husbands et al. A performance analysis of the Berkeley UPC compiler. In Proc. of the 17th Annual Intl. Conf. on Supercomputing, ICS ’03, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. E. Jeremiassen and S. J. Eggers. Static analysis of barrier synchronization in explicitly parallel programs. In Proc. of the IFIP WG10.3 Working Conf. on Parallel Architectures and Compilation Techniques, PACT ’94, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Kamil and K. Yelick. Concurrency analysis for parallel programs with textually aligned barriers. In In Proc. of the 18th Intl. Workshop on Languages and Compilers for Parallel Computing, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Karwande et al. CC-MPI: A compiled communication capable MPI prototype for ethernet switched clusters. In Proc. of the Ninth ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, PPoPP ’03, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P.-W. Lai et al. A framework for load balancing of tensor contraction expressions via dynamic task partitioning. In Proc. of the Intl. Conf. on High Performance Computing, Networking, Storage and Analysis, SC ’13, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. The libunwind project. http://www.nongnu.org/libunwind/.Google ScholarGoogle Scholar
  24. J. F. Mart´ınez and J. Torrellas. Speculative synchronization: applying thread-level speculation to explicitly parallel applications. In Proc. of the 10th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 18–29, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Nieplocha and B. Carpenter. ARMCI: A portable remote memory copy libray for distributed array libraries and compiler run-time systems. In Proc. of the 11 IPPS/SPDP’99 Workshops Held in Conjunction with the 13th Intl. Parallel Processing Symp. and 10th Symp. on Parallel and Distributed Processing, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Nieplocha et al. Advances, applications and performance of the global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl., 20(2), May 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. S. Park et al. Scaling data race detection for partitioned global address space programs. In Proc. of the 27th Intl. ACM Conf. on Intl. Conf. on Supercomputing, ICS ’13, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. F. Petrini, D. J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proc. of the 2003 ACM/IEEE conference on Supercomputing, SC ’03, pages 55–, New York, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. Preissl et al. Transforming MPI source code based on communication patterns. Future Gener. Comput. Syst., 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. Rajwar and J. R. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. In Proc. of the 34th Annual ACM/IEEE Intl. Symp. on Microarchitecture, MICRO 34, pages 294– 305, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Sharma, S. Vakkalanka, G. Gopalakrishnan, R. Kirby, R. Thakur, and W. Gropp. A formal approach to detect functionally irrelevant barriers in MPI programs. In A. Lastovetsky et al., editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume 5205 of Lecture Notes in Computer Science, pages 265–273. Springer Berlin Heidelberg, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Shasha and M. Snir. Efficient and correct execution of parallel programs that share memory. ACM Trans. Program. Lang. Syst., 10(2), 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. G. Siek et al. Boost Graph Library: User Guide and Reference Manual, The. Pearson Education, 2001.Google ScholarGoogle Scholar
  34. E. Solomonik et al. Cyclops tensor framework: reducing communication and eliminating load imbalance in massively parallel contractions. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th Intl. Symp., pages 813––824. IEEE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Sparsehash. https://code.google.com/p/sparsehash/.Google ScholarGoogle Scholar
  36. N. R. Tallent et al. Binary analysis for measurement and attribution of program performance. In Proc. of the 2009 ACM SIGPLAN Conf. on Programming Language Design and Implementation, PLDI ’09, pages 441–452, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Valiev et al. NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications, 181(9):1477–1489, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  38. Y. Zhang and E. Duesterwald. Barrier matching for programs with textually unaligned barriers. In PPOPP, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Y. Zhang et al. Concurrency analysis for shared memory programs with textually unaligned barriers. In LCPC, pages 95–109, 2007.Google ScholarGoogle Scholar

Index Terms

  1. Barrier elision for production parallel programs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 50, Issue 8
        PPoPP '15
        August 2015
        290 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2858788
        • Editor:
        • Andy Gill
        Issue’s Table of Contents
        • cover image ACM Conferences
          PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
          January 2015
          290 pages
          ISBN:9781450332057
          DOI:10.1145/2688500

        Copyright © 2015 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 January 2015

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!