Abstract
On shared-memory systems, Cilk-style work-stealing has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS). There are two main difficulties in extending this approach to distributed memory. In the shared memory approach, thieves (nodes without work) constantly attempt to asynchronously steal work from randomly chosen victims until they find work. In distributed memory, thieves cannot autonomously steal work from a victim without disrupting its execution. When work is sparse, this results in performance degradation. In essence, a direct extension of traditional work-stealing to distributed memory violates the work-first principle underlying work-stealing. Further, thieves spend useless CPU cycles attacking victims that have no work, resulting in system inefficiencies in multi-programmed contexts. Second, it is non-trivial to detect active distributed termination (detect that programs at all nodes are looking for work, hence there is no work). This problem is well-studied and requires careful design for good performance. Unfortunately, in most existing languages/frameworks, application developers are forced to implement their own distributed termination detection.
In this paper, we develop a simple set of ideas that allow work-stealing to be efficiently extended to distributed memory. First, we introduce lifeline graphs: low-degree, low-diameter, fully connected directed graphs. Such graphs can be constructed from k-dimensional hypercubes. When a node is unable to find work after w unsuccessful steals, it quiesces after informing the outgoing edges in its lifeline graph. Quiescent nodes do not disturb other nodes. A quiesced node is reactivated when work arrives from a lifeline and itself shares this work with those of its incoming lifelines that are activated. Termination occurs precisely when computation at all nodes has quiesced. In a language such as X10, such passive distributed termination can be detected automatically using the finish construct -- no application code is necessary.
Our design is implemented in a few hundred lines of X10. On the binomial tree described in olivier:08}, the program achieve 87% efficiency on an Infiniband cluster of 1024 Power7 cores, with a peak throughput of 2.37 GNodes/sec. It achieves 87% efficiency on a Blue Gene/P with 2048 processors, and a peak throughput of 0.966 GNodes/s. All numbers are relative to single core sequential performance. This implementation has been refactored into a reusable global load balancing framework. Applications can use this framework to obtain global load balance with minimal code changes.
In summary, we claim: (a) the first formulation of UTS that does not involve application level global termination detection, (b) the introduction of lifeline graphs to reduce failed steals (c) the demonstration of simple lifeline graphs based on k-hypercubes, (d) performance with superior efficiency (or the same efficiency but over a wider range) than published results on UTS. In particular, our framework can deliver the same or better performance as an unrestricted random work-stealing implementation, while reducing the number of attempted steals.
- J. E. Baldeschwieler, R. D. Blumofe, and E. A. Brewer. Atlas: an infrastructure for global computing. In EW 7: Proceedings of the 7th workshop on ACMSIGOPS European workshop, pages 165--172, New York, NY, USA, 1996. ACM. Google Scholar
Digital Library
- R. Batoukov and T. Sorevik. A Generic Parallel Branch and Bound Environment on a Network of Workstations. In HiPer '99: Proceedings of High Performance Computing on Hewlett- Packard Systems, pages 474--483, 1999.Google Scholar
- P. Berenbrink, T. Friedetzky, and L. A. Goldberg. The natural work-stealing algorithm is stable. SIAM J. Comput., 32(5):1260-- 1279, 2003. Google Scholar
Digital Library
- S. M. Blackburn, R. L. Hudson, R. Morrison, J. E. B. Moss, D. S. Munro, and J. Zigman. Starting with termination: a methodology for building distributed garbage collection algorithms. Aust. Comput. Sci. Commun., 23(1):20--28, 2001. Google Scholar
Digital Library
- R. D. Blumofe and C. E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS), pages 356--368, 1994. Google Scholar
Digital Library
- R. D. Blumofe and P. A. Lisiecki. Adaptive and reliable parallel computing on networks of workstations. In ATEC '97: Proceedings of the annual conference on USENIX Annual Technical Conference, pages 10--10, Berkeley, CA, USA, 1997. USENIX Association. Google Scholar
Digital Library
- G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing. In ICPP '08: Proceedings of the 2008 37th International Conference on Parallel Processing, pages 536-- 545, Washington, DC, USA, 2008. IEEE Computer Society. Google Scholar
Digital Library
- J. Dean, S. Ghemawat, and G. Inc. Mapreduce: simplified data processing on large clusters. In In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design and Implementation, pages 137--150. USENIX Association, 2004. Google Scholar
Digital Library
- E. Dijkstra and C. Scholten. Termination detection for diffusing computations. In Information Processing Letters, volume 11, pages 1--4, 1980.Google Scholar
Cross Ref
- E. W. Dijkstra. Derivation of a termination detection algorithm for distributed computations. pages 507--512, 1987. Google Scholar
Digital Library
- J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--11, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
- J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. Dynamic Load Balancing of Unbalanced Computations Using Message Passing. In IPDPS 07: Parallel and Distributed Processing Symposium, pages 1--8, Long Beach, CA, March 2007. IEEE International.Google Scholar
- T. A. El-Ghazawi, W. W. Carlson, and J. M. Draper. UPC Language Specification, 1.1 edition, May 2003. http://www.gwu.- edu/Üupc/downloads/upc specs 1.1p2pre1.pdf.Google Scholar
- N. Francez and M. Rodeh. Achieving distributed termination without freezing. IEEE Trans. Softw. Eng., 8(3):287--292, 1982. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation, pages 212--223, Montreal, Quebec, Canada, June 1998. Proceedings published in ACM SIGPLAN Notices, Vol. 33, No. 5, May, 1998. Google Scholar
Digital Library
- A. Grama and V. Kumar. State of the Art in Parallel Search Techniques for Discrete Optimization Problems. IEEE Trans. on Knowl. and Data Eng., 11(1):28--35, 1999. Google Scholar
Digital Library
- Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-First and Help- First Scheduling Policies for Async-Finish Task Parallelism. In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium, May 2009. Google Scholar
Digital Library
- L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. In Proceedings of Object Oriented Programming Systems, Languages and Applications, ACM Sigplan Notes, volume 28, pages 91--108, 1993. Google Scholar
Digital Library
- P. Kambadur, A. Gupta, A. Ghoting, H. Avron, and A. Lumsdaine. PFunc:Modern task parallelism for modern high performance computing. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing (SC), Portland, Oregon, November 2009. Google Scholar
Digital Library
- V. Kumar, A. Y. Grama, and N. R. Vempaty. Scalable load balancing techniques for parallel computers. J. Parallel Distrib. Comput., 22(1):60--79, 1994. Google Scholar
Digital Library
- Message Passing Interface Forum. MPI, June 1995. http://www.mpi-forum.org/.Google Scholar
- Message Passing Interface Forum. MPI-2, July 1997. http://www.mpi-forum.org/.Google Scholar
- E. Mohr, D. A. Kranz, and R. H. Halstead, Jr. Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs. IEEE Trans. Parallel Distrib. Syst., 2(3):264--280, 1991. Google Scholar
Digital Library
- S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: an unbalanced tree search benchmark. In LCPC'06: Proceedings of the 19th international conference on Languages and compilers for parallel computing, pages 235--250, Berlin, Heidelberg, 2007. Springer-Verlag. Google Scholar
Digital Library
- S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. Uts: an unbalanced tree search benchmark. In Proceedings of the 19th international conference on Languages and compilers for parallel computing, LCPC'06, pages 235--250, Berlin, Heidelberg, 2007. Springer-Verlag. Google Scholar
Digital Library
- S. Olivier and J. Prins. Scalable dynamic load balancing using UPC. In ICPP '08: Proceedings of the 2008 37th International Conference on Parallel Processing, pages 123--131, Washington, DC, USA, 2008. IEEE Computer Society. Google Scholar
Digital Library
- OpenMP Architecture Review Board. OpenMP Application Program Interface, v3.0. May 2008.Google Scholar
- J. Prins, J. Huan, B. Pugh, C.-W. Tseng, and P. Sadayappan. UPC Implementation of an Unbalanced Tree Search Benchmark. Technical Report 03-034, University of North Carolina at Chapel Hill, October 2003.Google Scholar
- J. Reinders. Intel Threading Building Blocks. O'Reilly, 2007. Google Scholar
Digital Library
- V. Saraswat, G. Almasi, G. Bikshandi, C. Cascaval, D. Cunningham, D. Grove, S. Kodali, I. Peshansky, and O. Tardieu. The asynchronous partitioned global address space model. In AMP'10: Proceedings of The FirstWorkshop on Advances inMessage Passing, June 2010.Google Scholar
- A. B. Sinha and L. V. Kale. A load balancing strategy for prioritized execution of tasks. In IIPS'93: Proceedings of International Parallel Processing Symposium, pages 230--237, 1993.Google Scholar
Digital Library
- R. V. van Nieuwpoort, T. Kielmann, and H. E. Bal. Efficient load balancing for wide-area divide-and-conquer applications. In PPoPP '01: Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming, pages 34--43, New York, NY, USA, 2001. ACM. Google Scholar
Digital Library
Index Terms
Lifeline-based global load balancing
Recommendations
Lifeline-based global load balancing
PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programmingOn shared-memory systems, Cilk-style work-stealing has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS). There are two main difficulties in extending this approach to distributed memory. ...
Scalable parallel numerical constraint solver using global load balancing
X10 2015: Proceedings of the ACM SIGPLAN Workshop on X10We present a scalable parallel solver for numerical constraint satisfaction problems (NCSPs). Our parallelization scheme consists of homogeneous worker solvers, each of which runs on an available core and communicates with others via the global load ...
Cooperation vs. coordination for lifeline-based global load balancing in APGAS
X10 2016: Proceedings of the 6th ACM SIGPLAN Workshop on X10Work stealing can be implemented in either a cooperative or a coordinated way. We compared the two approaches for lifeline-based global load balancing, which is the algorithm used by X10's Global Load Balancing framework GLB. We conducted our study ...







Comments