ABSTRACT
Applications often involve iterative execution of identical or slowly evolving calculations. Such applications require incremental rebalancing to improve load balance across iterations. In this paper, we consider the design and evaluation of two distinct approaches to addressing this challenge: persistence-based load balancing and work stealing. The work to be performed is overdecomposed into tasks, enabling automatic rebalancing by the middleware. We present a hierarchical persistence-based rebalancing algorithm that performs localized incremental rebalancing. We also present an active-message-based retentive work stealing algorithm optimized for iterative applications on distributed memory machines. We demonstrate low overheads and high efficiencies on the full NERSC Hopper (146,400 cores) and ALCF Intrepid systems (163,840 cores), and on up to 128,000 cores on OLCF Titan.
- NERSC Hopper. http://www.nersc.gov/users/computational-systems/hopper.Google Scholar
- G. Baumgartner et al. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proc. of IEEE, 93(2):276--292, 2005.Google Scholar
Cross Ref
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. In PPoPP, pages 207--216, July 1995. Google Scholar
Digital Library
- R. D. Blumofe and P. A. Lisiecki. Adaptive and reliable parallel computing on networks of workstations. In USENIX, pages 10--10, 1997. Google Scholar
Digital Library
- Ü. V. Çatalyürek and C. Aykanat. Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Trans. Parallel Distrib. Syst., 10(7):673--693, 1999. Google Scholar
Digital Library
- Ü. V. Çatalyürek, E. G. Boman, K. D. Devine, D. Bozdag, R. T. Heaphy, and L. A. Riesen. A repartitioning hypergraph model for dynamic load balancing. JPDC, 69(8):711--724, 2009. Google Scholar
Digital Library
- A. Chandramowlishwaran, K. Knobe, and R. Vuduc. Performance evaluation of concurrent collections on high-performance multicore computing systems. In IPDPS, 2010.Google Scholar
Cross Ref
- P. Charles, C. Grothoff, V. Saraswat, et al. X10: an object-oriented approach to non-uniform cluster computing. In OOPSLA, pages 519--538, 2005. Google Scholar
Digital Library
- N. H. Darach Golden and S. McGrath. Parallel adaptive mesh refinement for large eddy simulation using the finite element methods. In PARA, pages 172--181, 1998. Google Scholar
Digital Library
- A. Darte, J. Mellor-Crummey, R. Fowler, and D. C. Miranda. Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computations. JPDC, 63(9):887--911, 2003. Google Scholar
Digital Library
- R. Das, Y.-S. Hwang, M. Uysal, J. Saltz, and A. Sussman. Applying the CHAOS/PARTI library to irregular problems in computational chemistry and computational aerodynamics. In Scalable Parallel Libraries Conference, pages 45--56, oct 1993.Google Scholar
- J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In SC, 2009. Google Scholar
Digital Library
- N. Francez. Distributed termination. ACM Trans. Program. Lang. Syst., 2:42--55, January 1980. Google Scholar
Digital Library
- E. Gabriel et al. Open MPI: Goals, concept, and design of a next generation MPI implementation. In European PVM/MPI, September 2004.Google Scholar
- G. R. Gao, T. L. Sterling, R. Stevens, M. Hereld, and W. Zhu. Parallex: A study of a new parallel computation model. In IPDPS, pages 1--6, 2007.Google Scholar
Cross Ref
- B. Hendrickson and R. Leland. An improved spectral graph partitioning algorithm for mapping parallel computations. SIAM J. Sci. Comput., 16:452--469, March 1995. Google Scholar
Digital Library
- C. Joerg and B. C. Kuszmaul. Massively parallel chess. In Proceedings of the Third DIMACS Parallel Implementation Challenge, Rutgers, 1994.Google Scholar
- L. Kalé and S. Krishnan. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In OOPSLA'93, pages 91--108, September 1993. Google Scholar
Digital Library
- G. Karypis, K. Schloegel, and V. Kumar. Parmetis: Parallel graph partitioning and sparse matrix ordering library. Version 1.0, Dept. of Computer Science, University of Minnesota, 1997.Google Scholar
- J. Nieplocha, V. Tipparaju, M. Krishnan, and D. K. Panda. High performance remote memory access communication: The ARMCI approach. Int. J. High Perform. Comput. Appl., 20(2):233--253, 2006. Google Scholar
Digital Library
- J. C. Phillips et al. Scalable molecular dynamics with namd. Journal of Computational Chemistry, 26(16):1781--1802, 2005.Google Scholar
Cross Ref
- J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O'Reilly Media, 2007. Google Scholar
Digital Library
- V. A. Saraswat, P. Kambadur, S. B. Kodali, D. Grove, and S. Krishnamoorthy. Lifeline-based global load balancing. In PPoPP, pages 201--212, 2011. Google Scholar
Digital Library
- A. Szabo and N. S. Ostlund. Modern Quantum Chemistry. McGraw-Hill Inc., New York, 1996.Google Scholar
- R. V. van Nieuwpoort, T. Kielmann, and H. E. Bal. Efficient load balancing for wide-area divide-and-conquer applications. In PPoPP, pages 34--43, 2001. Google Scholar
Digital Library
- R. D. Williams. Performance of dynamic load balancing algorithms for unstructured mesh calculations. Concurrency: Pract. Exper., 3:457--481, October 1991. Google Scholar
Digital Library
- G. Zheng, A. Bhatele, E. Meneses, and L. V. Kale. Periodic Hierarchical Load Balancing for Large Supercomputers. IJHPCA, 2010. Google Scholar
Digital Library
Index Terms
Work stealing and persistence-based load balancers for iterative overdecomposed applications
Recommendations
Scalable work stealing
SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and AnalysisIrregular and dynamic parallel applications pose significant challenges to achieving scalable performance on large-scale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. Scalable ...





Comments