skip to main content
research-article
Public Access

Diagnosing the causes and severity of one-sided message contention

Published:24 January 2015Publication History
Skip Abstract Section

Abstract

Two trends suggest network contention for one-sided messages is poised to become a performance problem that concerns application developers: an increased interest in one-sided programming models and a rising ratio of hardware threads to network injection bandwidth. Often it is difficult to reason about when one-sided tasks decrease or increase network contention. We present effective and portable techniques for diagnosing the causes and severity of one-sided message contention. To detect that a message is affected by contention, we maintain statistics representing instantaneous network resource demand. Using lightweight measurement and modeling, we identify the portion of a message's latency that is due to contention and whether contention occurs at the initiator or target. We attribute these metrics to program statements in their full static and dynamic context. We characterize contention for an important computational chemistry benchmark on InfiniBand, Cray Aries, and IBM Blue Gene/Q interconnects. We pinpoint the sources of contention, estimate their severity, and show that when message delivery time deviates from an ideal model, there are other messages contending for the same network links. With a small change to the benchmark, we reduce contention by 50% and improve total runtime by 20%.

References

  1. L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurr. Comput.: Pract. Exper., 22(6):685–701, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Agarwal. Limits on interconnection network performance. IEEE Trans. Parallel Distrib. Syst., 2(4):398–412, Oct 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. LogGP: Incorporating long messages into the LogP model — One step closer towards a realistic model for parallel computation. In Proc. of the 7th ACM Symp. on Parallel Algorithms and Architectures, pages 95–105, New York, NY, USA, 1995. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Alverson, D. Roweth, and L. Kaplan. The Gemini system interconnect. In Proc. of the 18th IEEE Symp. on High Performance Interconnects, pages 83–87, Aug 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Bhatele, K. Mohror, S. H. Langer, and K. E. Isaacs. There goes the neighborhood: Performance degradation due to nearby jobs. In Proc. of the 2013 ACM/IEEE Conf. on Supercomputing, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby. Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Trans. Parallel Distrib. Syst., 8(11):1143–1156, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl., 21(3):291–312, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Chan, W. Gropp, and E. Lusk. An efficient format for nearly constant-time access to arbitrary time intervals in large trace files. Scientific Programming, 16(2-3):155–165, Jan. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Chaudhuri, M. Heinrich, C. Holt, J. Singh, E. Rothberg, and J. Hennessy. Latency, occupancy, and bandwidth in dsm multiprocessors: a performance evaluation. IEEE Trans. Comput., 52(7):862–880, Jul 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Chen, N. Eisley, P. Heidelberger, R. Senger, Y. Sugawara, S. Kumar, V. Salapura, D. Satterfield, B. Steinmacher-Burow, and J. Parker. The IBM Blue Gene/Q interconnection network and message unit. In Proc. of the 2011 ACM/IEEE Conf. on Supercomputing, Nov 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Coarfa, J. Mellor-Crummey, N. Froyd, and Y. Dotsenko. Scalability analysis of SPMD codes using expectations. In Proc. of the 21st Intl. Conf. on Supercomputing, pages 13–22, New York, NY, USA, June 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cray Inc. Using the Cray Gemini hardware counters (s-0025-10). http://docs.cray.com/books/S-0025-10/, July 2010.Google ScholarGoogle Scholar
  13. Cray Inc. Using the Aries hardware counters (s-0045-10). http: //docs.cray.com/books/S-0045-10/, Mar 2013.Google ScholarGoogle Scholar
  14. J. Daily, A. Vishnu, B. Palmer, H. van Dam, and D. Kerbyson. On the suitability of MPI as a PGAS runtime. In Proc. of the Intl. Conf. on High Performance Computing, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  15. L. De Rose, B. Homer, D. Johnson, S. Kaufmann, and H. Poxon. Cray performance analysis tools. In Tools for High Performance Computing, pages 191–199. Springer, 2008.Google ScholarGoogle Scholar
  16. A. Edelman, P. McCorquodale, and S. Toledo. The future Fast Fourier transform? SIAM Journal on Scientific Computing, 20(3):1094–1114, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Environmental Molecular Science Laboratory. MSC Benchmark, v. 2.0. http://www.emsl.pnl.gov/capabilities/computing/ msc/msc_benchmark/, February 2012.Google ScholarGoogle Scholar
  18. G. Faanes, A. Bataineh, D. Roweth, T. Court, E. Froese, B. Alverson, T. Johnson, J. Kopnick, M. Higgins, and J. Reinhard. Cray Cascade: A scalable HPC system based on a dragonfly network. In Proc. of the 2012 ACM/IEEE Conf. on Supercomputing, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. I. T. Foster, J. L. Tilson, A. F. Wagner, R. L. Shepard, R. J. Harrison, R. A. Kendall, and R. J. Littlefield. Toward high-performance computational chemistry: Scalable Fock matrix construction algorithms. Journal of Computational Chemistry, 17(1):109–123, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  20. K. Fürlinger, N. J. Wright, D. Skinner, C. Klausecker, and D. Kranzlmüller. Effective holistic performance measurement at petascale using IPM. In C. Bischof, H.-G. Hegering, W. E. Nagel, and G. Wittum, editors, Competence in High Performance Computing 2010, pages 15–26. Springer, 2012.Google ScholarGoogle Scholar
  21. H. Gahvari, A. H. Baker, M. Schulz, U. M. Yang, K. E. Jordan, and W. Gropp. Modeling the performance of an algebraic multigrid cycle on HPC platforms. In Proc. of the 25th Intl. Conf. on Supercomputing, pages 172–181, New York, NY, USA, Nov. 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Geimer, F. Wolf, B. J. N. Wylie, E. Ábrahám, D. Becker, and B. Mohr. The Scalasca performance toolset architecture. Concurr. Comput.: Pract. Exper., 22(6):702–719, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Geist and R. Lucas. Major computer science challenges at exascale. Int. J. High Perform. Comput. Appl., 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Gusat, D. Craddock, W. Denzel, T. Engbersen, N. Ni, G. Pfister, W. Rooney, and J. Duato. Congestion control in InfiniBand networks. In Proc. of the 13th Symp. on High Performance Interconnects, pages 158–159, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. InfiniBand Trade Association. InfiniBand Architecture Specification, Release 1.2, October 2004.Google ScholarGoogle Scholar
  26. N. Jain, A. Bhatele, M. P. Robson, T. Gamblin, and L. V. Kale. Predicting application performance using supervised learning on communication features. In Proc. of the 2013 ACM/IEEE Conf. on Supercomputing, pages 95:1–95:12, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. J. Kerbyson, H. J. Alme, A. Hoisie, F. Petrini, H. J. Wasserman, and M. Gittings. Predictive performance and scalability modeling of a large-scale application. In Proc. of the 2001 ACM/IEEE Conf. on Supercomputing, pages 1–12, New York, NY, USA, Nov. 2001. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Mellor-Crummey, L. Adhianto, G. Jin, and W. N. Scherer III. A new vision for Coarray Fortran. In Proc. of the Third Conf. on Partitioned Global Address Space Programming Models, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Moritz and M. Frank. LoGPC: Modeling network contention in message-passing programs. IEEE Trans. Parallel Distrib. Syst., 12(4):404–415, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, and E. Aprà. Advances, applications and performance of the Global Arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl., 20(2):203–231, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R. W. Numrich and J. Reid. Co-array Fortran for parallel programming. SIGPLAN Fortran Forum, 17(2):1–31, Aug. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. K. Pedretti, C. Vaughan, R. Barrett, K. Devine, and K. S. Hemmert. Using the cray gemini performance counters. In Proc. of the 2013 Cray User’s Group, 2013.Google ScholarGoogle Scholar
  33. J. Peter D. Barnes, C. D. Carothers, D. R. Jefferson, and J. M. LaPre. Warp speed: Executing time warp on 1,966,080 cores. In Proc. of the 2013 ACM SIGSIM Conf. on Principles of Advanced Discrete Simulation, SIGSIM-PADS ’13, pages 327–336, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. G. Pfister, M. Gusat, W. Denzel, D. Craddock, N. Ni, W. Rooney, T. Engbersen, R. Luijten, R. Krishnamurthy, and J. Duato. Solving hot spot contention using InfiniBand architecture congestion control. In Workshop on High Performance Interconnects for Distributed Computing, 2005.Google ScholarGoogle Scholar
  35. G. F. Pfister and V. A. Norton. Hot-spot contention and combining in multistage interconnection networks. IEEE Trans. Comput., C- 34(10):943–948, Oct. 1985.Google ScholarGoogle ScholarCross RefCross Ref
  36. J. Santos, Y. Turner, and G. Janakiraman. End-to-end congestion control for Infiniband. In Twenty-Second Annual Joint Conf. of the IEEE Computer and Communications, volume 2, pages 1123–1133, March 2003.Google ScholarGoogle ScholarCross RefCross Ref
  37. S. S. Shende and A. D. Malony. The TAU parallel performance system. Int. J. High Perform. Comput. Appl., 20(2):287–311, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. MPI: The Complete Reference, Volume 1: The MPI Core. MIT Press, Cambridge, MA, USA, 2nd. (revised) edition, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. E. Solomonik, A. Bhatele, and J. Demmel. Improving communication performance in dense linear algebra via topology aware collectives. In Proc. of the 2011 ACM/IEEE Conf. on Supercomputing, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. T. P. Straatsma and J. A. McCammon. ARGOS, a vectorized general molecular dynamics program. J. Comput. Chem., 11(9):934–951, July 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. UPC Consortium. The UPC language specification, v. 1.2. Lawrence Berkeley National Lab Tech Report LBNL-59208, May 2005.Google ScholarGoogle Scholar
  42. M. Valiev, E. Bylaska, N. Govind, K. Kowalski, T. Straatsma, H. V. Dam, D. Wang, J. Nieplocha, E. Apra, T. Windus, and W. de Jong. NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications, 181(9):1477–1489, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  43. J. S. Vetter and M. O. McCracken. Statistical scalability analysis of communication operations in distributed applications. In Proc. of the 8th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. A. Vishnu, J. Daily, and B. Palmer. Designing scalable PGAS communication subsystems on Cray Gemini interconnect. In 19th Intl. Conf. on High Performance Computing, pages 1–10, Dec 2012.Google ScholarGoogle ScholarCross RefCross Ref
  45. A. Vishnu, D. Kerbyson, K. Barker, and H. van Dam. Building scalable PGAS communication subsystem on Blue Gene/Q. In 2013 IEEE 27th Intl. Parallel and Distributed Processing Symposium Workshops, pages 825–833, May 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. A. Vishnu, A. Mamidala, H.-W. Jin, and D. Panda. Performance modeling of subnet management on fat tree InfiniBand networks using OpenSM. In Proc. of the First Intl. Workshop on System Management Techniques, Processes, and Services (held in conjunction with IPDPS’05), April 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. F. Wolf, B. Mohr, J. Dongarra, and S. Moore. Efficient pattern search in large traces through successive refinement. In M. Danelutto, M. Vanneschi, and D. Laforenza, editors, Proc. of the 10th Intl. Euro-Par Conf., volume 3149, pages 47–54. Springer, Aug. 2004.Google ScholarGoogle Scholar

Index Terms

  1. Diagnosing the causes and severity of one-sided message contention

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!