skip to main content
research-article

Lifeline-based global load balancing

Published:12 February 2011Publication History
Skip Abstract Section

Abstract

On shared-memory systems, Cilk-style work-stealing has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS). There are two main difficulties in extending this approach to distributed memory. In the shared memory approach, thieves (nodes without work) constantly attempt to asynchronously steal work from randomly chosen victims until they find work. In distributed memory, thieves cannot autonomously steal work from a victim without disrupting its execution. When work is sparse, this results in performance degradation. In essence, a direct extension of traditional work-stealing to distributed memory violates the work-first principle underlying work-stealing. Further, thieves spend useless CPU cycles attacking victims that have no work, resulting in system inefficiencies in multi-programmed contexts. Second, it is non-trivial to detect active distributed termination (detect that programs at all nodes are looking for work, hence there is no work). This problem is well-studied and requires careful design for good performance. Unfortunately, in most existing languages/frameworks, application developers are forced to implement their own distributed termination detection.

In this paper, we develop a simple set of ideas that allow work-stealing to be efficiently extended to distributed memory. First, we introduce lifeline graphs: low-degree, low-diameter, fully connected directed graphs. Such graphs can be constructed from k-dimensional hypercubes. When a node is unable to find work after w unsuccessful steals, it quiesces after informing the outgoing edges in its lifeline graph. Quiescent nodes do not disturb other nodes. A quiesced node is reactivated when work arrives from a lifeline and itself shares this work with those of its incoming lifelines that are activated. Termination occurs precisely when computation at all nodes has quiesced. In a language such as X10, such passive distributed termination can be detected automatically using the finish construct -- no application code is necessary.

Our design is implemented in a few hundred lines of X10. On the binomial tree described in olivier:08}, the program achieve 87% efficiency on an Infiniband cluster of 1024 Power7 cores, with a peak throughput of 2.37 GNodes/sec. It achieves 87% efficiency on a Blue Gene/P with 2048 processors, and a peak throughput of 0.966 GNodes/s. All numbers are relative to single core sequential performance. This implementation has been refactored into a reusable global load balancing framework. Applications can use this framework to obtain global load balance with minimal code changes.

In summary, we claim: (a) the first formulation of UTS that does not involve application level global termination detection, (b) the introduction of lifeline graphs to reduce failed steals (c) the demonstration of simple lifeline graphs based on k-hypercubes, (d) performance with superior efficiency (or the same efficiency but over a wider range) than published results on UTS. In particular, our framework can deliver the same or better performance as an unrestricted random work-stealing implementation, while reducing the number of attempted steals.

References

  1. J. E. Baldeschwieler, R. D. Blumofe, and E. A. Brewer. Atlas: an infrastructure for global computing. In EW 7: Proceedings of the 7th workshop on ACMSIGOPS European workshop, pages 165--172, New York, NY, USA, 1996. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Batoukov and T. Sorevik. A Generic Parallel Branch and Bound Environment on a Network of Workstations. In HiPer '99: Proceedings of High Performance Computing on Hewlett- Packard Systems, pages 474--483, 1999.Google ScholarGoogle Scholar
  3. P. Berenbrink, T. Friedetzky, and L. A. Goldberg. The natural work-stealing algorithm is stable. SIAM J. Comput., 32(5):1260-- 1279, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. M. Blackburn, R. L. Hudson, R. Morrison, J. E. B. Moss, D. S. Munro, and J. Zigman. Starting with termination: a methodology for building distributed garbage collection algorithms. Aust. Comput. Sci. Commun., 23(1):20--28, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. D. Blumofe and C. E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS), pages 356--368, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. D. Blumofe and P. A. Lisiecki. Adaptive and reliable parallel computing on networks of workstations. In ATEC '97: Proceedings of the annual conference on USENIX Annual Technical Conference, pages 10--10, Berkeley, CA, USA, 1997. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing. In ICPP '08: Proceedings of the 2008 37th International Conference on Parallel Processing, pages 536-- 545, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Dean, S. Ghemawat, and G. Inc. Mapreduce: simplified data processing on large clusters. In In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design and Implementation, pages 137--150. USENIX Association, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Dijkstra and C. Scholten. Termination detection for diffusing computations. In Information Processing Letters, volume 11, pages 1--4, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  10. E. W. Dijkstra. Derivation of a termination detection algorithm for distributed computations. pages 507--512, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--11, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. Dynamic Load Balancing of Unbalanced Computations Using Message Passing. In IPDPS 07: Parallel and Distributed Processing Symposium, pages 1--8, Long Beach, CA, March 2007. IEEE International.Google ScholarGoogle Scholar
  13. T. A. El-Ghazawi, W. W. Carlson, and J. M. Draper. UPC Language Specification, 1.1 edition, May 2003. http://www.gwu.- edu/Üupc/downloads/upc specs 1.1p2pre1.pdf.Google ScholarGoogle Scholar
  14. N. Francez and M. Rodeh. Achieving distributed termination without freezing. IEEE Trans. Softw. Eng., 8(3):287--292, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation, pages 212--223, Montreal, Quebec, Canada, June 1998. Proceedings published in ACM SIGPLAN Notices, Vol. 33, No. 5, May, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Grama and V. Kumar. State of the Art in Parallel Search Techniques for Discrete Optimization Problems. IEEE Trans. on Knowl. and Data Eng., 11(1):28--35, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-First and Help- First Scheduling Policies for Async-Finish Task Parallelism. In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium, May 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. In Proceedings of Object Oriented Programming Systems, Languages and Applications, ACM Sigplan Notes, volume 28, pages 91--108, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Kambadur, A. Gupta, A. Ghoting, H. Avron, and A. Lumsdaine. PFunc:Modern task parallelism for modern high performance computing. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing (SC), Portland, Oregon, November 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Kumar, A. Y. Grama, and N. R. Vempaty. Scalable load balancing techniques for parallel computers. J. Parallel Distrib. Comput., 22(1):60--79, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Message Passing Interface Forum. MPI, June 1995. http://www.mpi-forum.org/.Google ScholarGoogle Scholar
  22. Message Passing Interface Forum. MPI-2, July 1997. http://www.mpi-forum.org/.Google ScholarGoogle Scholar
  23. E. Mohr, D. A. Kranz, and R. H. Halstead, Jr. Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs. IEEE Trans. Parallel Distrib. Syst., 2(3):264--280, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: an unbalanced tree search benchmark. In LCPC'06: Proceedings of the 19th international conference on Languages and compilers for parallel computing, pages 235--250, Berlin, Heidelberg, 2007. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. Uts: an unbalanced tree search benchmark. In Proceedings of the 19th international conference on Languages and compilers for parallel computing, LCPC'06, pages 235--250, Berlin, Heidelberg, 2007. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Olivier and J. Prins. Scalable dynamic load balancing using UPC. In ICPP '08: Proceedings of the 2008 37th International Conference on Parallel Processing, pages 123--131, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. OpenMP Architecture Review Board. OpenMP Application Program Interface, v3.0. May 2008.Google ScholarGoogle Scholar
  28. J. Prins, J. Huan, B. Pugh, C.-W. Tseng, and P. Sadayappan. UPC Implementation of an Unbalanced Tree Search Benchmark. Technical Report 03-034, University of North Carolina at Chapel Hill, October 2003.Google ScholarGoogle Scholar
  29. J. Reinders. Intel Threading Building Blocks. O'Reilly, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. V. Saraswat, G. Almasi, G. Bikshandi, C. Cascaval, D. Cunningham, D. Grove, S. Kodali, I. Peshansky, and O. Tardieu. The asynchronous partitioned global address space model. In AMP'10: Proceedings of The FirstWorkshop on Advances inMessage Passing, June 2010.Google ScholarGoogle Scholar
  31. A. B. Sinha and L. V. Kale. A load balancing strategy for prioritized execution of tasks. In IIPS'93: Proceedings of International Parallel Processing Symposium, pages 230--237, 1993.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. V. van Nieuwpoort, T. Kielmann, and H. E. Bal. Efficient load balancing for wide-area divide-and-conquer applications. In PPoPP '01: Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming, pages 34--43, New York, NY, USA, 2001. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Lifeline-based global load balancing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 46, Issue 8
        PPoPP '11
        August 2011
        300 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2038037
        Issue’s Table of Contents
        • cover image ACM Conferences
          PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
          February 2011
          326 pages
          ISBN:9781450301190
          DOI:10.1145/1941553
          • General Chair:
          • Calin Cascaval,
          • Program Chair:
          • Pen-Chung Yew

        Copyright © 2011 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 February 2011

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!