skip to main content
10.1145/1693453.1693479acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Lazy binary-splitting: a run-time adaptive work-stealing scheduler

Authors Info & Claims
Published:09 January 2010Publication History

ABSTRACT

We present Lazy Binary Splitting (LBS), a user-level scheduler of nested parallelism for shared-memory multiprocessors that builds on existing Eager Binary Splitting work-stealing (EBS) implemented in Intel's Threading Building Blocks (TBB), but improves performance and ease-of-programming. In its simplest form (SP), EBS requires manual tuning by repeatedly running the application under carefully controlled conditions to determine a stop-splitting-threshold (sst)for every do-all loop in the code. This threshold limits the parallelism and prevents excessive overheads for fine-grain parallelism. Besides being tedious, this tuning also over-fits the code to some particular dataset, platform and calling context of the do-all loop, resulting in poor performance portability for the code. LBS overcomes both the performance portability and ease-of-programming pitfalls of a manually fixed threshold by adapting dynamically to run-time conditions without requiring tuning.

We compare LBS to Auto-Partitioner (AP), the latest default scheduler of TBB, which does not require manual tuning either but lacks context portability, and outperform it by 38.9% using TBB's default AP configuration, and by 16.2% after we tuned AP to our experimental platform. We also compare LBS to SP by manually finding SP's sst using a training dataset and then running both on a different execution dataset. LBS outperforms SP by 19.5% on average. while allowing for improved performance portability without requiring tedious manual tuning. LBS also outperforms SP with sst=1, its default value when undefined, by 56.7%, and serializing work-stealing (SWS), another work-stealer by 54.7%. Finally, compared to serializing inner parallelism (SI) which has been used by OpenMP, LBS is 54.2% faster.

References

  1. Intel Threading Building Blocks Reference Manual, Rev. 1.9, 2008.Google ScholarGoogle Scholar
  2. Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The data locality of work stealing. In SPAA '00: Proceedings of the 12th annual ACM symposium on Parallel algorithms and architectures, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Kunal Agrawal, Yuxiong He, and Charles E. Leiserson. Adaptive work stealing with parallelism feedback. In PPoPP '07: Proceedings of the 12th ACMSIGPLAN symposium on Principles and practice of parallel programming, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proc. of the 10th annual ACM symp. on Parallel algorithms and architectures, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS Department, Berkeley, Dec 2006.Google ScholarGoogle Scholar
  6. Aydin O. Balkan, Gang Qu, and Uzi Vishkin. A mesh-of-trees interconnection network for single-chip parallel processing. In ASAP'06: Proceedings of the IEEE 17th International Conference on Application-specific Systems, Architectures and Processors, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Guy E. Blelloch, Jonathan C. Hardwick, Siddhartha Chatterjee, Jay Sipelstein, and Marco Zagha. Implementation of a portable nested data-parallel language. In Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Guy E. Blelloch and Gary W. Sabot. Compiling collection-oriented languages onto massively parallel computers. J. Parallel Distrib. Comput., 8(2):119--134, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46(5), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. George C. Caragea, A. Beliz Saybasili, XingzhiWen, and Uzi Vishkin. Brief announcement: performance potential of an easy-to-program PRAM-on-chip prototype versus state-of-the-art processor. In SPAA '09: Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Olivier Certner, Zheng Li, Pierre Palatin, Olivier Temam, Frederic Arzel, and Nathalie Drach. A practical approach for reconciling high and predictable performance in non-regular parallel programs. In DATE '08: Proceedings of the conference on Design, automation and test in Europe, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. David Chase and Yossi Lev. Dynamic circular work-stealing deque. In SPAA '05: Proceedings of the 17th annual ACM symposium on Parallelism in algorithms and architectures, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Alejandro Duran, Julita Corbalán, and Eduard Ayguadé. An adaptive cut-off for task parallelism. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Alejandro Duran, Marc Gonzàlez, and Julita Corbalán. Automatic thread distribution for nested parallelism in OpenMP. In ICS '05: Proc. of the 19th annual international conf. on Supercomputing, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI '98: Proceedings of the ACMSIGPLAN 1998 conference on Programming language design and implementation, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Seth Copen Goldstein, Klaus Erik Schauser, and David E. Culler. Lazy threads: implementing a fast parallel call. Journal of Parallel and Distributed Computing, 37(1):5--20, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Danny Hendler, Yossi Lev, and Nir Shavit. Dynamic memory ABP work-stealing. In DISC, pages 188--200, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  18. Danny Hendler and Nir Shavit. Non-blocking steal-half work queues. In PODC '02: Proceedings of the twenty-first annual symposium on Principles of distributed computing, 2002 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. A. Kranz, Jr. R. H. Halstead, and E. Mohr. Mul-t: a highperformance parallel lisp. SIGPLAN Not., 24(7):81--90, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Alexey Kukanov and Michael Voss. The foundations for scalable multi-core software in intel threading building blocks. Intel Technology Journal, 11(04), November 2007.Google ScholarGoogle ScholarCross RefCross Ref
  21. C.E. Leiserson. The Cilk++ concurrency platform. In Design Automation Conference, 2009. DAC '09. 46th ACM/IEEE, July 2009 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Xavier Martorell, Eduard Ayguadé, Nacho Navarro, Julita Corbalán, Marc González, and Jesús Labarta. Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors. In ICS '99: Proc. of the 13th international conf. on Supercomputing, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Maged M. Michael, Martin T. Vechev, and Vijay A. Saraswat. Idempotent work stealing. In PPoPP'09: Proc. of the 14th ACM SIGPLAN symp. on Principles and practice of parallel programming, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Eric Mohr, David A. Kranz, and Jr. Robert H. Halstead. Lazy task creation: a technique for increasing the granularity of parallel programs. In LFP '90: Proceedings of the 1990 ACM conference on LISP and functional programming, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Dimitrios S. Nikolopoulos, Eleftherios D. Polychronopoulos, and Theodore S. Papatheodorou. Efficient runtime thread management for the nano-threads programming model. In Proceedings of the 2nd IEEE IPPS/SPDP Workshop on Runtime Systems for Parallel Programming, LNCS, pages 183--194, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  26. OpenMP Architecture Review Board. OpenMP Application Program Interface, Ver. 3.0 May 2008. http://www.openmp.org.Google ScholarGoogle Scholar
  27. A. Robison, M. Voss, and A. Kukanov. Optimization via reflection on work stealing in TBB. In IPDPS 2008: IEEE International Symposium on Parallel and Distributed Processing, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  28. A. Beliz Saybasili, Alexandros Tzannes, Bernard R. Brooks, and Uzi Vishkin. Highly parallel multi-dimentional fast fourier transform on fine- and coarse-grained many-core approaches. In PDCS '09: The 21st IASTED International Conference on Parallel and Distributed Computing and Systems, 2009.Google ScholarGoogle Scholar
  29. Kenjiro Taura, Kunio Tabata, and Akinori Yonezawa. StackThreads/MP: integrating futures into calling standards. In PPoPP '99: Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Shane Torbert, Ron Tzur Uzi Vishkin, and David Ellison. Is teaching parallel algorithmic thinking to high-school student possible? one teachers experience. In Proc. 41st ACM Technical Symposium on Computer Science Education (SIG CSE) (To Appear), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Uzi Vishkin, George C. Caragea, and Bryant C. Lee. Handbook of Parallel Computing: Models, Algorithms and Applications, chapter Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform. CRC Press. Rajasekaran, R., and Reif, J. Eds, 2008.Google ScholarGoogle Scholar
  32. Xingzhi Wen and Uzi Vishkin. FPGA-based prototype of a PRAMon-chip processor. In CF '08: Proceedings of the 5th international conference on Computing frontiers, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Lazy binary-splitting: a run-time adaptive work-stealing scheduler

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
      January 2010
      372 pages
      ISBN:9781605588773
      DOI:10.1145/1693453
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 45, Issue 5
        PPoPP '10
        May 2010
        346 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/1837853
        Issue’s Table of Contents

      Copyright © 2010 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 January 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate230of1,014submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!