skip to main content
research-article
Public Access

Exploiting Vector and Multicore Parallelism for Recursive, Data- and Task-Parallel Programs

Published:26 January 2017Publication History
Skip Abstract Section

Abstract

Modern hardware contains parallel execution resources that are well-suited for data-parallelism-vector units-and task parallelism-multicores. However, most work on parallel scheduling focuses on one type of hardware or the other. In this work, we present a scheduling framework that allows for a unified treatment of task- and data-parallelism. Our key insight is an abstraction, task blocks, that uniformly handles data-parallel iterations and task-parallel tasks, allowing them to be scheduled on vector units or executed independently as multicores. Our framework allows us to define schedulers that can dynamically select between executing task- blocks on vector units or multicores. We show that these schedulers are asymptotically optimal, and deliver the maximum amount of parallelism available in computation trees. To evaluate our schedulers, we develop program transformations that can convert mixed data- and task-parallel pro- grams into task block-based programs. Using a prototype instantiation of our scheduling framework, we show that, on an 8-core system, we can simultaneously exploit vector and multicore parallelism to achieve 14×-108× speedup over sequential baselines.

References

  1. N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. In SPAA, pages 119--129, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Barnes and P. Hut. A hierarchical o(nlogn) forcecalculation algorithm. Nature, 324(4):446--449, December 1986. Google ScholarGoogle ScholarCross RefCross Ref
  3. R. D. Blumofe and C. E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. J. ACM, 46(5):720--748, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing, 37(1):55--69, August 25, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Cilk. Cilk 5.4.6. http://supertech.csail.mit.edu/cilk/.Google ScholarGoogle Scholar
  6. I. Corp. Intel Cilk Plus Language Extension Specification, 2011.Google ScholarGoogle Scholar
  7. Y. Jo and M. Kulkarni. Enhancing Locality for Recursive Traversals of Recursive Structures. In OOPSLA, pages 463--482, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Jo, M. Goldfarb, and M. Kulkarni. Automatic Vectorization of Tree Traversals. In PACT, pages 363--374, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y. A. Liu and S. D. Stoller. From Recursion to Iteration: What are the Optimizations? In PEPM, pages 73--82, 2000.Google ScholarGoogle Scholar
  10. S. Maleki, Y. Gao, M. J. Garzaran, T. Wong, and D. A. Padua. An Evaluation of Vectorizing Compilers. In PACT, pages 372--382, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. OpenMP Architecture Review Board. OpenMP Specification and Features. http://openmp.org/wp/, May 2008.Google ScholarGoogle Scholar
  12. M. Pharr and W. R. Mark. ispc: A SPMD Compiler for High-performance CPU Programming. In Innovative Parallel Computing (InPar), 2012, pages 1--13. IEEE, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  13. J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O'Reilly, 2007.Google ScholarGoogle Scholar
  14. B. Ren, G. Agrawal, J. R. Larus, T. Mytkowicz, T. Poutanen, and W. Schulte. SIMD Parallelization of Applications that Traverse Irregular Data Structures. In 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 1--10. IEEE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Ren, Y. Jo, S. Krishnamoorthy, K. Agrawal, and M. Kulkarni. Efficient Execution of Recursive Programs on Commodity Vector Hardware. In PLDI, pages 509--520, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Rinard and P. C. Diniz. Commutativity analysis: a new analysis technique for parallelizing compilers. ACM Trans. Program. Lang. Syst., 19(6):942--991, 1997. ISSN 0164-0925. doi: http://doi.acm.org/10.1145/267959.269969.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Rugina and M. C. Rinard. Recursion Unrolling for Divide and Conquer Programs. In LCPC, pages 34--48, 2000.Google ScholarGoogle Scholar
  18. X10. The X10 Programming Language. www.research.ibm.com/x10/, Mar. 2006.Google ScholarGoogle Scholar

Index Terms

  1. Exploiting Vector and Multicore Parallelism for Recursive, Data- and Task-Parallel Programs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 52, Issue 8
        PPoPP '17
        August 2017
        442 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/3155284
        Issue’s Table of Contents
        • cover image ACM Conferences
          PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
          January 2017
          476 pages
          ISBN:9781450344937
          DOI:10.1145/3018743

        Copyright © 2017 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 January 2017

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!