Abstract
Two trends suggest network contention for one-sided messages is poised to become a performance problem that concerns application developers: an increased interest in one-sided programming models and a rising ratio of hardware threads to network injection bandwidth. Often it is difficult to reason about when one-sided tasks decrease or increase network contention. We present effective and portable techniques for diagnosing the causes and severity of one-sided message contention. To detect that a message is affected by contention, we maintain statistics representing instantaneous network resource demand. Using lightweight measurement and modeling, we identify the portion of a message's latency that is due to contention and whether contention occurs at the initiator or target. We attribute these metrics to program statements in their full static and dynamic context. We characterize contention for an important computational chemistry benchmark on InfiniBand, Cray Aries, and IBM Blue Gene/Q interconnects. We pinpoint the sources of contention, estimate their severity, and show that when message delivery time deviates from an ideal model, there are other messages contending for the same network links. With a small change to the benchmark, we reduce contention by 50% and improve total runtime by 20%.
- L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurr. Comput.: Pract. Exper., 22(6):685–701, 2010. Google Scholar
Digital Library
- A. Agarwal. Limits on interconnection network performance. IEEE Trans. Parallel Distrib. Syst., 2(4):398–412, Oct 1991. Google Scholar
Digital Library
- A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. LogGP: Incorporating long messages into the LogP model — One step closer towards a realistic model for parallel computation. In Proc. of the 7th ACM Symp. on Parallel Algorithms and Architectures, pages 95–105, New York, NY, USA, 1995. ACM. Google Scholar
Digital Library
- R. Alverson, D. Roweth, and L. Kaplan. The Gemini system interconnect. In Proc. of the 18th IEEE Symp. on High Performance Interconnects, pages 83–87, Aug 2010. Google Scholar
Digital Library
- A. Bhatele, K. Mohror, S. H. Langer, and K. E. Isaacs. There goes the neighborhood: Performance degradation due to nearby jobs. In Proc. of the 2013 ACM/IEEE Conf. on Supercomputing, New York, NY, USA, 2013. ACM. Google Scholar
Digital Library
- J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby. Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Trans. Parallel Distrib. Syst., 8(11):1143–1156, 1997. Google Scholar
Digital Library
- B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl., 21(3):291–312, 2007. Google Scholar
Digital Library
- A. Chan, W. Gropp, and E. Lusk. An efficient format for nearly constant-time access to arbitrary time intervals in large trace files. Scientific Programming, 16(2-3):155–165, Jan. 2008. Google Scholar
Digital Library
- M. Chaudhuri, M. Heinrich, C. Holt, J. Singh, E. Rothberg, and J. Hennessy. Latency, occupancy, and bandwidth in dsm multiprocessors: a performance evaluation. IEEE Trans. Comput., 52(7):862–880, Jul 2003. Google Scholar
Digital Library
- D. Chen, N. Eisley, P. Heidelberger, R. Senger, Y. Sugawara, S. Kumar, V. Salapura, D. Satterfield, B. Steinmacher-Burow, and J. Parker. The IBM Blue Gene/Q interconnection network and message unit. In Proc. of the 2011 ACM/IEEE Conf. on Supercomputing, Nov 2011. Google Scholar
Digital Library
- C. Coarfa, J. Mellor-Crummey, N. Froyd, and Y. Dotsenko. Scalability analysis of SPMD codes using expectations. In Proc. of the 21st Intl. Conf. on Supercomputing, pages 13–22, New York, NY, USA, June 2007. ACM. Google Scholar
Digital Library
- Cray Inc. Using the Cray Gemini hardware counters (s-0025-10). http://docs.cray.com/books/S-0025-10/, July 2010.Google Scholar
- Cray Inc. Using the Aries hardware counters (s-0045-10). http: //docs.cray.com/books/S-0045-10/, Mar 2013.Google Scholar
- J. Daily, A. Vishnu, B. Palmer, H. van Dam, and D. Kerbyson. On the suitability of MPI as a PGAS runtime. In Proc. of the Intl. Conf. on High Performance Computing, 2014.Google Scholar
Cross Ref
- L. De Rose, B. Homer, D. Johnson, S. Kaufmann, and H. Poxon. Cray performance analysis tools. In Tools for High Performance Computing, pages 191–199. Springer, 2008.Google Scholar
- A. Edelman, P. McCorquodale, and S. Toledo. The future Fast Fourier transform? SIAM Journal on Scientific Computing, 20(3):1094–1114, 1998. Google Scholar
Digital Library
- Environmental Molecular Science Laboratory. MSC Benchmark, v. 2.0. http://www.emsl.pnl.gov/capabilities/computing/ msc/msc_benchmark/, February 2012.Google Scholar
- G. Faanes, A. Bataineh, D. Roweth, T. Court, E. Froese, B. Alverson, T. Johnson, J. Kopnick, M. Higgins, and J. Reinhard. Cray Cascade: A scalable HPC system based on a dragonfly network. In Proc. of the 2012 ACM/IEEE Conf. on Supercomputing, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Google Scholar
Digital Library
- I. T. Foster, J. L. Tilson, A. F. Wagner, R. L. Shepard, R. J. Harrison, R. A. Kendall, and R. J. Littlefield. Toward high-performance computational chemistry: Scalable Fock matrix construction algorithms. Journal of Computational Chemistry, 17(1):109–123, 1996.Google Scholar
Cross Ref
- K. Fürlinger, N. J. Wright, D. Skinner, C. Klausecker, and D. Kranzlmüller. Effective holistic performance measurement at petascale using IPM. In C. Bischof, H.-G. Hegering, W. E. Nagel, and G. Wittum, editors, Competence in High Performance Computing 2010, pages 15–26. Springer, 2012.Google Scholar
- H. Gahvari, A. H. Baker, M. Schulz, U. M. Yang, K. E. Jordan, and W. Gropp. Modeling the performance of an algebraic multigrid cycle on HPC platforms. In Proc. of the 25th Intl. Conf. on Supercomputing, pages 172–181, New York, NY, USA, Nov. 2011. ACM. Google Scholar
Digital Library
- M. Geimer, F. Wolf, B. J. N. Wylie, E. Ábrahám, D. Becker, and B. Mohr. The Scalasca performance toolset architecture. Concurr. Comput.: Pract. Exper., 22(6):702–719, 2010. Google Scholar
Digital Library
- A. Geist and R. Lucas. Major computer science challenges at exascale. Int. J. High Perform. Comput. Appl., 2009. Google Scholar
Digital Library
- M. Gusat, D. Craddock, W. Denzel, T. Engbersen, N. Ni, G. Pfister, W. Rooney, and J. Duato. Congestion control in InfiniBand networks. In Proc. of the 13th Symp. on High Performance Interconnects, pages 158–159, 2005. Google Scholar
Digital Library
- InfiniBand Trade Association. InfiniBand Architecture Specification, Release 1.2, October 2004.Google Scholar
- N. Jain, A. Bhatele, M. P. Robson, T. Gamblin, and L. V. Kale. Predicting application performance using supervised learning on communication features. In Proc. of the 2013 ACM/IEEE Conf. on Supercomputing, pages 95:1–95:12, New York, NY, USA, 2013. ACM. Google Scholar
Digital Library
- D. J. Kerbyson, H. J. Alme, A. Hoisie, F. Petrini, H. J. Wasserman, and M. Gittings. Predictive performance and scalability modeling of a large-scale application. In Proc. of the 2001 ACM/IEEE Conf. on Supercomputing, pages 1–12, New York, NY, USA, Nov. 2001. ACM. Google Scholar
Digital Library
- J. Mellor-Crummey, L. Adhianto, G. Jin, and W. N. Scherer III. A new vision for Coarray Fortran. In Proc. of the Third Conf. on Partitioned Global Address Space Programming Models, 2009. Google Scholar
Digital Library
- C. Moritz and M. Frank. LoGPC: Modeling network contention in message-passing programs. IEEE Trans. Parallel Distrib. Syst., 12(4):404–415, 2001. Google Scholar
Digital Library
- J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, and E. Aprà. Advances, applications and performance of the Global Arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl., 20(2):203–231, 2006. Google Scholar
Digital Library
- R. W. Numrich and J. Reid. Co-array Fortran for parallel programming. SIGPLAN Fortran Forum, 17(2):1–31, Aug. 1998. Google Scholar
Digital Library
- K. Pedretti, C. Vaughan, R. Barrett, K. Devine, and K. S. Hemmert. Using the cray gemini performance counters. In Proc. of the 2013 Cray User’s Group, 2013.Google Scholar
- J. Peter D. Barnes, C. D. Carothers, D. R. Jefferson, and J. M. LaPre. Warp speed: Executing time warp on 1,966,080 cores. In Proc. of the 2013 ACM SIGSIM Conf. on Principles of Advanced Discrete Simulation, SIGSIM-PADS ’13, pages 327–336, New York, NY, USA, 2013. ACM. Google Scholar
Digital Library
- G. Pfister, M. Gusat, W. Denzel, D. Craddock, N. Ni, W. Rooney, T. Engbersen, R. Luijten, R. Krishnamurthy, and J. Duato. Solving hot spot contention using InfiniBand architecture congestion control. In Workshop on High Performance Interconnects for Distributed Computing, 2005.Google Scholar
- G. F. Pfister and V. A. Norton. Hot-spot contention and combining in multistage interconnection networks. IEEE Trans. Comput., C- 34(10):943–948, Oct. 1985.Google Scholar
Cross Ref
- J. Santos, Y. Turner, and G. Janakiraman. End-to-end congestion control for Infiniband. In Twenty-Second Annual Joint Conf. of the IEEE Computer and Communications, volume 2, pages 1123–1133, March 2003.Google Scholar
Cross Ref
- S. S. Shende and A. D. Malony. The TAU parallel performance system. Int. J. High Perform. Comput. Appl., 20(2):287–311, 2006. Google Scholar
Digital Library
- M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. MPI: The Complete Reference, Volume 1: The MPI Core. MIT Press, Cambridge, MA, USA, 2nd. (revised) edition, 1998. Google Scholar
Digital Library
- E. Solomonik, A. Bhatele, and J. Demmel. Improving communication performance in dense linear algebra via topology aware collectives. In Proc. of the 2011 ACM/IEEE Conf. on Supercomputing, New York, NY, USA, 2011. ACM. Google Scholar
Digital Library
- T. P. Straatsma and J. A. McCammon. ARGOS, a vectorized general molecular dynamics program. J. Comput. Chem., 11(9):934–951, July 1990. Google Scholar
Digital Library
- UPC Consortium. The UPC language specification, v. 1.2. Lawrence Berkeley National Lab Tech Report LBNL-59208, May 2005.Google Scholar
- M. Valiev, E. Bylaska, N. Govind, K. Kowalski, T. Straatsma, H. V. Dam, D. Wang, J. Nieplocha, E. Apra, T. Windus, and W. de Jong. NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications, 181(9):1477–1489, 2010.Google Scholar
Cross Ref
- J. S. Vetter and M. O. McCracken. Statistical scalability analysis of communication operations in distributed applications. In Proc. of the 8th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, 2001. Google Scholar
Digital Library
- A. Vishnu, J. Daily, and B. Palmer. Designing scalable PGAS communication subsystems on Cray Gemini interconnect. In 19th Intl. Conf. on High Performance Computing, pages 1–10, Dec 2012.Google Scholar
Cross Ref
- A. Vishnu, D. Kerbyson, K. Barker, and H. van Dam. Building scalable PGAS communication subsystem on Blue Gene/Q. In 2013 IEEE 27th Intl. Parallel and Distributed Processing Symposium Workshops, pages 825–833, May 2013. Google Scholar
Digital Library
- A. Vishnu, A. Mamidala, H.-W. Jin, and D. Panda. Performance modeling of subnet management on fat tree InfiniBand networks using OpenSM. In Proc. of the First Intl. Workshop on System Management Techniques, Processes, and Services (held in conjunction with IPDPS’05), April 2005. Google Scholar
Digital Library
- F. Wolf, B. Mohr, J. Dongarra, and S. Moore. Efficient pattern search in large traces through successive refinement. In M. Danelutto, M. Vanneschi, and D. Laforenza, editors, Proc. of the 10th Intl. Euro-Par Conf., volume 3149, pages 47–54. Springer, Aug. 2004.Google Scholar
Index Terms
Diagnosing the causes and severity of one-sided message contention
Recommendations
Diagnosing the causes and severity of one-sided message contention
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingTwo trends suggest network contention for one-sided messages is poised to become a performance problem that concerns application developers: an increased interest in one-sided programming models and a rising ratio of hardware threads to network ...
Analyzing lock contention in multithreaded applications
PPoPP '10Many programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access to shared data. In many cases, contention for locks reduces parallel efficiency and hurts scalability. Being able to quantify ...
Analyzing lock contention in multithreaded applications
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingMany programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access to shared data. In many cases, contention for locks reduces parallel efficiency and hurts scalability. Being able to quantify ...






Comments