Abstract
Modern hardware contains parallel execution resources that are well-suited for data-parallelism-vector units-and task parallelism-multicores. However, most work on parallel scheduling focuses on one type of hardware or the other. In this work, we present a scheduling framework that allows for a unified treatment of task- and data-parallelism. Our key insight is an abstraction, task blocks, that uniformly handles data-parallel iterations and task-parallel tasks, allowing them to be scheduled on vector units or executed independently as multicores. Our framework allows us to define schedulers that can dynamically select between executing task- blocks on vector units or multicores. We show that these schedulers are asymptotically optimal, and deliver the maximum amount of parallelism available in computation trees. To evaluate our schedulers, we develop program transformations that can convert mixed data- and task-parallel pro- grams into task block-based programs. Using a prototype instantiation of our scheduling framework, we show that, on an 8-core system, we can simultaneously exploit vector and multicore parallelism to achieve 14×-108× speedup over sequential baselines.
- N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. In SPAA, pages 119--129, 1998. Google Scholar
Digital Library
- J. Barnes and P. Hut. A hierarchical o(nlogn) forcecalculation algorithm. Nature, 324(4):446--449, December 1986. Google Scholar
Cross Ref
- R. D. Blumofe and C. E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. J. ACM, 46(5):720--748, 1999. Google Scholar
Digital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing, 37(1):55--69, August 25, 1996. Google Scholar
Digital Library
- Cilk. Cilk 5.4.6. http://supertech.csail.mit.edu/cilk/.Google Scholar
- I. Corp. Intel Cilk Plus Language Extension Specification, 2011.Google Scholar
- Y. Jo and M. Kulkarni. Enhancing Locality for Recursive Traversals of Recursive Structures. In OOPSLA, pages 463--482, 2011.Google Scholar
Digital Library
- Y. Jo, M. Goldfarb, and M. Kulkarni. Automatic Vectorization of Tree Traversals. In PACT, pages 363--374, 2013.Google Scholar
Digital Library
- Y. A. Liu and S. D. Stoller. From Recursion to Iteration: What are the Optimizations? In PEPM, pages 73--82, 2000.Google Scholar
- S. Maleki, Y. Gao, M. J. Garzaran, T. Wong, and D. A. Padua. An Evaluation of Vectorizing Compilers. In PACT, pages 372--382, 2011. Google Scholar
Digital Library
- OpenMP Architecture Review Board. OpenMP Specification and Features. http://openmp.org/wp/, May 2008.Google Scholar
- M. Pharr and W. R. Mark. ispc: A SPMD Compiler for High-performance CPU Programming. In Innovative Parallel Computing (InPar), 2012, pages 1--13. IEEE, 2012.Google Scholar
Cross Ref
- J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O'Reilly, 2007.Google Scholar
- B. Ren, G. Agrawal, J. R. Larus, T. Mytkowicz, T. Poutanen, and W. Schulte. SIMD Parallelization of Applications that Traverse Irregular Data Structures. In 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 1--10. IEEE, 2013. Google Scholar
Digital Library
- B. Ren, Y. Jo, S. Krishnamoorthy, K. Agrawal, and M. Kulkarni. Efficient Execution of Recursive Programs on Commodity Vector Hardware. In PLDI, pages 509--520, 2015. Google Scholar
Digital Library
- M. Rinard and P. C. Diniz. Commutativity analysis: a new analysis technique for parallelizing compilers. ACM Trans. Program. Lang. Syst., 19(6):942--991, 1997. ISSN 0164-0925. doi: http://doi.acm.org/10.1145/267959.269969.Google Scholar
Digital Library
- R. Rugina and M. C. Rinard. Recursion Unrolling for Divide and Conquer Programs. In LCPC, pages 34--48, 2000.Google Scholar
- X10. The X10 Programming Language. www.research.ibm.com/x10/, Mar. 2006.Google Scholar
Index Terms
Exploiting Vector and Multicore Parallelism for Recursive, Data- and Task-Parallel Programs
Recommendations
Exploiting Vector and Multicore Parallelism for Recursive, Data- and Task-Parallel Programs
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingModern hardware contains parallel execution resources that are well-suited for data-parallelism-vector units-and task parallelism-multicores. However, most work on parallel scheduling focuses on one type of hardware or the other. In this work, we ...
Extracting SIMD Parallelism from Recursive Task-Parallel Programs
The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to execute data-parallel computations ...
OpenMP task scheduling strategies for multicore NUMA systems
The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run-time system. Efficient scheduling ...







Comments