ABSTRACT
Irregular and dynamic parallel applications pose significant challenges to achieving scalable performance on large-scale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. Scalable dynamic load balancing on large clusters is a challenging problem which can be addressed with distributed dynamic load balancing systems. Work stealing is a popular approach to distributed dynamic load balancing; however its performance on large-scale clusters is not well understood. Prior work on work stealing has largely focused on shared memory machines. In this work we investigate the design and scalability of work stealing on modern distributed memory systems. We demonstrate high efficiency and low overhead when scaling to 8,192 processors for three benchmark codes: a producer-consumer benchmark, the unbalanced tree search benchmark, and a multiresolution analysis kernel.
- N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proc. 10th ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 119--129, 1998. Google Scholar
Digital Library
- G. Baumgartner, A. Auer, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao, R. J. Harrison, S. Hirata, S. Krishanmoorthy, S. Krishnan, C.-C. Lam, Q. Lu, M. Nooijen, R. M. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proceedings of the IEEE, 93(2):276--292, Feb. 2005.Google Scholar
Cross Ref
- P. Berenbrink, T. Friedetzky, and L. Goldberg. The natural work-stealing algorithm is stable. In Proc. 42nd IEEE Symposium on Foundations of Computer Science (FOCS), pages 178--187, 2001. Google Scholar
Digital Library
- G. E. Blelloch and J. Greiner. A provable time and space efflcient implementation of NESL. In Proc. 1st ACM SIGPLAN Intl. Conf. on Functional Programming (ICFP), pages 213--225, Philadelphia, Pennsylvania, May 1996. Google Scholar
Digital Library
- R. D. Blumofe and C. Leiserson. Scheduling multithreaded computations by work stealing. In Proc. 35th Symposium on Foundations of Computer Science (FOCS), pages 356--368, Nov. 1994. Google Scholar
Digital Library
- R. D. Blumofe and P. A. Lisiecki. Adaptive and reliable parallel computing on networks of workstations. In Proc. USENIX Annual Technical Conference (ATEC), pages 10--10, 1997. Google Scholar
Digital Library
- Ü. V. Çatalyürek, E. G. Boman, K. D. Devine, D. Bozdag, R. Heaphy, and L. A. Riesen. Hypergraph-based dynamic load balancing for adaptive scientific computations. In Proc. 21st Intl. Parallel and Distributed Processing Symposium (IPDPS), pages 1--11. IEEE, 2007.Google Scholar
Cross Ref
- B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the chapel language. Int. J. High Performance Computing Applications (IJHPCA), 21(3):291--312, 2007. Google Scholar
Digital Library
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proc. Conf. on Object Oriented Prog. Systems, Languages, and Applications (OOPSLA), pages 519--538, 2005. Google Scholar
Digital Library
- G. Cong, S. Kodali, S. Krishnamoorty, D. Lea, V. Saraswat, and T. Wen. Solving irregular graph problems using adaptive work-stealing. In Proc. 37th Int Conf. on Parallel Processing (ICPP), Portland, OR, Sept. 2008. Google Scholar
Digital Library
- K. D. Devine, E. G. Boman, R. T. Heaphy, B. A. Hendrickson, J. D. Teresco, J. Faik, J. E. Flaherty, and L. G. Gervasio. New challanges in dynamic load balancing. J. Appl. Numer. Math., 52(2--3):133--152, 2005. Google Scholar
Digital Library
- J. Dinan, S. Krishnamoorthy, D. B. Larkins, J. Nieplocha, and P. Sadayappan. Scioto: A framework for global-view task parallelism. In Proc. 37th Intl. Conf. on Parallel Processing (ICPP), pages 586--593, 2008. Google Scholar
Digital Library
- J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. Dynamic load balancing of unbalanced computations using message passing. In Proc. of 6th Intl. Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS), pages 1--8, 2007.Google Scholar
Cross Ref
- J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. A message passing benchmark for unbalanced applications. J. Simulation Modelling Practice and Theory, 16(9):1177--1189, 2008.Google Scholar
Cross Ref
- N. Francez and M. Rodeh. Achieving distributed termination without freezing. IEEE Trans. on Software Engineering, SE-8(3):287--292, May 1982. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proc. Conf. on Prog. Language Design and Implementation (PLDI), pages 212--223. ACM SIGPLAN, 1998. Google Scholar
Digital Library
- Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for terminally strict parallel programs. In Proc. 23rd Intl. Parallel and Distributed Processing Symposium (IPDPS), 2009.Google Scholar
- D. Hendler, Y. Lev, M. Moir, and N. Shavit. A dynamic-sized nonblocking work stealing deque. J. Distributed Computing, 18(3):189--207, 2006. Google Scholar
Digital Library
- Intel Corporation. Cluster OpenMP user's guide v9.1. (309096-002 US), 2005--2006.Google Scholar
- L. V. Kalé and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. In Proc. Conf. on Object Oriented Prog. Systems, Languages, and Applications (OOPSLA), pages 91--108, 1993. Google Scholar
Digital Library
- G. Karypis and V. Kumar. MeTis: Unstrctured Graph Partitioning and Sparse Matrix Ordering System, Version 4.0, Sept. 1998.Google Scholar
- S. Krishnamoorthy, Ü. Çatalyürek, J. Nieplocha, A. Rountev, and P. Sadayappan. Hypergraph partitioning for automatic memory hierarchy management. In Proc. ACM/IEEE Conference Supercomputing (SC), page 98, 2006. Google Scholar
Digital Library
- V. Kumar, A. Y. Grama, and N. R. Vempaty. Scalable load balancing techniques for parallel computers. J. Parallel Distrib. Comput., 22(1):60--79, 1994. Google Scholar
Digital Library
- Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys, 31(4):406--471, 1999. Google Scholar
Digital Library
- L. Lamport. A new solution of dijkstra's concurrent programming problem. Commun. ACM, 17(8):453--455, 1974. Google Scholar
Digital Library
- M. M. Michael, M. T. Vechev, and V. A. Saraswat. Idempotent work stealing. In Proc. of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), pages 45--54, Feb. 2009. Google Scholar
Digital Library
- M. Mitzenmacher. Analyses of load stealing models based on differential equations. In Proc. 10th Symposium on Parallel Algorithms and Architectures (SPAA), pages 212--221. ACM, 1998. Google Scholar
Digital Library
- MPI Forum. MPI: A message-passing interface standard. Technical Report UT-CS-94-230, 1994. Google Scholar
Digital Library
- J. Nieplocha and B. Carpenter. ARMCI: A portable remote memory copy library for distributed array libraries and compiler run-time systems. Lecture Notes in Computer Science, 1586:533--546, 1999. Google Scholar
Digital Library
- J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: a portable "shared-memory" programming model for distributed memory computers. In Proc. ACM/IEEE Conference Supercomputing (SC), pages 340--349, 1994. Google Scholar
Digital Library
- S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: An unbalanced tree search benchmark. In Proc. 19th Intl Workshop on Languages and Compilers for Parallel Computing (LCPC), pages 235--250, 2006. Google Scholar
Digital Library
- S. Olivier and J. Prins. Scalable dynamic load balancing using UPC. In Proc. of 37th Intl. Conference on Parallel Processing (ICPP), Sept. 2008. Google Scholar
Digital Library
- A. Sinha and L. V. Kalé. A load balancing strategy for prioritized execution of tasks. In Proc. 7th Intl. Parallel Processing Symposium (IPPS), pages 230--237, 1993.Google Scholar
Digital Library
- G. L. Steele Jr. Parallel programming and parallel abstractions in fortress. In Proc. 14th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT), page 157, 2005. Google Scholar
Digital Library
- A. Trifunović and W. J. Knottenbelt. Parallel multilevel algorithms for hypergraph partitioning. J. Parallel Distrib. Comput., 68(5):563--581, 2008. Google Scholar
Digital Library
- UPC Consortium. UPC language specifications, v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Lab, 2005.Google Scholar
Cross Ref
- R. V. van Nieuwpoort, T. Kielmann, and H. E. Bal. Efficient load balancing for wide-area divide-and-conquer applications. In Proc. of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), 2001. Google Scholar
Digital Library
- T. Yanai, G. I. Fann, Z. Gan, R. J. Harrison, and G. Beylkin. Multiresolution quantum chemistry in multiwavelet bases: Analytic derivatives for hartree--fock and density functional theory. J. Chemical Physics, 121(7):2866--2876, 2004.Google Scholar
Cross Ref
- K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance java dialect. Concurrency: Practice and Experience, 10(11--13):825--836, 1998.Google Scholar
Index Terms
Scalable work stealing
Recommendations
Work stealing and persistence-based load balancers for iterative overdecomposed applications
HPDC '12: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed ComputingApplications often involve iterative execution of identical or slowly evolving calculations. Such applications require incremental rebalancing to improve load balance across iterations. In this paper, we consider the design and evaluation of two ...
Enabling a highly-scalable global address space model for petascale computing
CF '10: Proceedings of the 7th ACM international conference on Computing frontiersOver the past decade, the trajectory to the petascale has been built on increased complexity and scale of the underlying parallel architectures. Meanwhile, software developers have struggled to provide tools that maintain the productivity of ...
Optimized distributed work-stealing
IA^3 '16: Proceedings of the Sixth Workshop on Irregular Applications: Architectures and AlgorithmsWork-stealing is a popular approach for dynamic load balancing of task-parallel programs. However, as has been widely studied, the use of classical work-stealing algorithms on massively parallel and distributed supercomputers introduces several ...




Comments