Abstract
Task parallelism has increasingly become a trend with programming models such as OpenMP 3.0, Cilk, Java Concurrency, X10, Chapel and Habanero-Java (HJ) to address the requirements of multicore programmers. While task parallelism increases productivity by allowing the programmer to express multiple levels of parallelism, it can also lead to performance degradation due to increased overheads. In this article, we introduce a transformation framework for optimizing task-parallel programs with a focus on task creation and task termination operations. These operations can appear explicitly in constructs such as async, finish in X10 and HJ, task, taskwait in OpenMP 3.0, and spawn, sync in Cilk, or implicitly in composite code statements such as foreach and ateach loops in X10, forall and foreach loops in HJ, and parallel loop in OpenMP.
Our framework includes a definition of data dependence in task-parallel programs, a happens-before analysis algorithm, and a range of program transformations for optimizing task parallelism. Broadly, our transformations cover three different but interrelated optimizations: (1) finish-elimination, (2) forall-coarsening, and (3) loop-chunking. Finish-elimination removes redundant task termination operations, forall-coarsening replaces expensive task creation and termination operations with more efficient synchronization operations, and loop-chunking extracts useful parallelism from ideal parallelism. All three optimizations are specified in an iterative transformation framework that applies a sequence of relevant transformations until a fixed point is reached. Further, we discuss the impact of exception semantics on the specified transformations, and extend them to handle task-parallel programs with precise exception semantics. Experimental results were obtained for a collection of task-parallel benchmarks on three multicore platforms: a dual-socket 128-thread (16-core) Niagara T2 system, a quad-socket 16-core Intel Xeon SMP, and a quad-socket 32-core Power7 SMP. We have observed that the proposed optimizations interact with each other in a synergistic way, and result in an overall geometric average performance improvement between 6.28× and 10.30×, measured across all three platforms for the benchmarks studied.
- Agarwal, S., Barik, R., Sarkar, V., and Shyamasundar, R. K. 2007. May-Happen-In-Parallel analysis of x10 programs. In Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’07). ACM Press, New York, 183--193. Google Scholar
Digital Library
- Amarasinghe, S. P. and Lam, M. S. 1993. Communication optimization and code generation for distributed memory machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’93). ACM Press, New York, 126--138. Google Scholar
Digital Library
- Bailey, D. H., Barszcz, E., Barton, J. T., Browning, D. S., Carter, R. L., Dagum, L., Fatoohi, R. A., Frederickson, P. O., Lasinski, T. A., Schreiber, R. S., Simon, H. D., Venkatakrishnan, V., and Weeratunga, S. K. 1991. The nas parallel benchmarks - Summary and preliminary results. In Proceedings of the ACM/IEEE Conference on Supercomputing. 158--165. Google Scholar
Digital Library
- Bikshandi, G., Castanos, J. G., Kodali, S. B., Nandivada, V. K., Peshansky, I., Saraswat, V. A., Sur, S., Varma, P., and Wen, T. 2009. Efficient, portable implementation of asynchronous multi-place programs. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). ACM Press, New York, 271--282. Google Scholar
Digital Library
- Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., and Zhou, Y. 1995. Cilk: An efficient multithreaded runtime system. SIGPLAN Not. 30, 8, 207--216. Google Scholar
Digital Library
- Chamberlain, B. L., Eun Choi, S., Deitz, S. J., And Snyder, L. 2004. The high-level parallel language ZPL improves productivity and performance. In Proceedings of the IEEE International Workshop on Productivity and Performance in High-End Computing. IEEE, 66--75.Google Scholar
- Chapel. 2005. Chapel. 2005. The chapel language specification version 0.4. http://chapel.cray.com/spec/spec-0.4.pdf.Google Scholar
- Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., Von Praun, C., and Sarkar, V. 2005. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA’05). ACM Press, New York, 519--538. Google Scholar
Digital Library
- Cytron, R., Lipkis, J., and Schonberg, E. 1990. A compiler-assisted approach to SPMD execution. In Proceedings of the ACM/IEEE Conference on Supercomputing (Supercomputing’90). IEEE Computer Society Press, Los Alamitos, CA, 398--406. Google Scholar
Digital Library
- Dean, J., Grove, D., and Chambers, C. 1995. Optimization of object-oriented programs using static class hierarchy analysis. In Proceedings of the 9th European Conference on Object-Oriented Programming (ECOOP’95). Springer, 77--101. Google Scholar
Digital Library
- Diniz, P. and Rinard, M. 1997. Synchronization transformations for parallel computing. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’97). ACM Press, New York, 187--200. Google Scholar
Digital Library
- Duesterwald, E. and Soffa, M. L. 1991. Concurrency analysis in the presence of procedures using a data-flow framework. In Proceedings of the Symposium on Testing, Analysis, and Verification (TAV4). ACM Press, New York, 36--48. Google Scholar
Digital Library
- Duran, A., Teruel, X., Ferrer, R., Martorell, X., and Ayguade, E. 2009. Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In Proceedings of the International Conference on Parallel Processing (ICPP’09). IEEE Computer Society, Los, Alamitos, CA, 124--131. Google Scholar
Digital Library
- Feng, M. and Leiserson, C. E. 1997. Efficient detection of determinacy races in cilk programs. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’97). ACM Press, New York, 1--11. Google Scholar
Digital Library
- Ferrer, R., Duran, A., Martorell, X., and Ayguade, E. 2009. Unrolling loops containing task parallelism. In Proceedings of the 22nd International Workshop on Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 5898, Springer, 416--423. Google Scholar
Digital Library
- Flynn, L. E. and Hummel, S. F. 1990. Scheduling variable-length parallel subtasks. Tech. rep. RC 15492, IBM.Google Scholar
- Gao, G. R. and Sarkar, V. 2000. Location consistency-A new memory model and cache consistency protocol. IEEE Trans. Comput. 49, 798--813. Google Scholar
Digital Library
- Georges, A., Buytaert, D., and Eeckhout, L. 2007. Statistically rigorous java performance evaluation. In Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems and Applications (OOPSLA’07). ACM Press, New York, 57--76. Google Scholar
Digital Library
- Guo, Y., Barik, R., Raman, R., and Sarkar, V. 2009. Work-First and help-first scheduling policies for async-finish task parallelism. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing. IEEE Computer Society, Los Alamitos, CA, 1--12. Google Scholar
Digital Library
- Gupta, R. 1989. The fuzzy barrier: A mechanism for high speed synchronization of processors. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM Press, New York, 54--63. Google Scholar
Digital Library
- Habanero. 2009. Habanero Java. http://habanero.rice.edu/hj.Google Scholar
- Heinz, E. A. and Philippsen, M. 1993. Synchronization barrier elimination in synchronous foralls. Tech. rep. 13/93, Department of Informatics, University of Karlruhe.Google Scholar
- JGF. 2000. The java grande forum benchmark suite. http://www.epcc.ed.ac.uk/javagrande/javag.html.Google Scholar
- Kennedy, K. and Allen, J. R. 2002. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann, San Francisco, CA. Google Scholar
Digital Library
- Kruskal, C. and Weiss, A. 1985. Allocating independent subtasks on parallel processors. IEEE Trans. Softw. Engin. SE-11, 10. Google Scholar
Digital Library
- Lamport, L. 1978. Time clocks, and the ordering of events in a distributed system. Comm. ACM 21, 558--565. Google Scholar
Digital Library
- Lamport, L. 1979. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput. 28, 690--691. Google Scholar
Digital Library
- Larus, J. R. and Rajwar, R. 2006. Transactional Memory. Morgan and Claypool.Google Scholar
- Lhotak, O. and Hendren, L. 2003. Scaling java points-to analysis using spark. In Proceedings of the 12th International Conference on Compiler Construction (CC’03). Springer, 153--169. Google Scholar
Digital Library
- Metcalfe, M. and Reid, J. 1990. Fortran 90 Explained. Oxford Science Publishers. Google Scholar
Digital Library
- Muchnick, S. S. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann, San Francisco, CA. Google Scholar
Digital Library
- Narayanan, S. H. K., Chen, G., Mahmut Kandemir, M. X., and Xie, Y. 2005. Temperature-Sensitive loop parallelization for chip multiprocessors. In Proceedings of the International Conference on Computer Design (ICCD’05). IEEE Computer Society, Los Alamitos, CA, 677--682. Google Scholar
Digital Library
- Nicolau, A., Li, G., Veidenbaum, A. V., and Kejariwal, A. 2009. Synchronization optimizations for efficient execution on multi-cores. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM Press, New York, 169--180. Google Scholar
Digital Library
- Nystrom, N., Clarkson, M. R., and Myers, A. C. 2003. Polyglot: An extensible compiler framework for Java. In Proceedings of the 12th International Conference on Compiler Construction (CC’03). Springer, 138--152. Google Scholar
Digital Library
- OpenMP. 2008. OpenMP application program interface, version 3.0. http://www.openmp.org/mpdocuments/spec30.pdfGoogle Scholar
- Peierls, T., Goetz, B., Bloch, J., Bowbeer, J., Lea, D., and Holmes, D. 2005. Java Concurrency in Practice. Addison-Wesley Professional. Google Scholar
Digital Library
- Polychronopoulos, C. D. and Kuck, D. J. 1987. Guided self-scheduling: A practical scheduling scheme for parallel supercomputers. IEEE Trans. Comput. C-36, 12. Google Scholar
Digital Library
- Salcianu, R. D. and Rinard, M. C. 2005. Purity and side effect analysis for Java programs. In Proceedings of the 6th International Conference on Verification, Model Checking and Abstract Interpretation (VMCAI). 199--215. Google Scholar
Digital Library
- Sarkar, V. 1988. Synchronization using counting semaphores. In Proceedings of the 2nd International Conference on Supercomputing (ICS’88). ACM Press, New York, 627--637. Google Scholar
Digital Library
- Sarkar, V. and Fink, S. J. 2001. Efficient dependence analysis for Java arrays. In Proceedings of the 7th International Euro-Par Conference on Parallel Processing (Euro-Par’01). Springer, 273--277. Google Scholar
Digital Library
- Shirako, J., Peixotto, D. M., Sarkar, V., and Scherer, W. N. 2008. Phasers: A unified deadlock-free construct for collective and point-to-point synchronization. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS’08). ACM Press, New York, 277--288. Google Scholar
Digital Library
- Shirako, J., Zhao, J. M., Nandivada, V. K., and Sarkar, V. N. 2009. Chunking parallel loops in the presence of synchronization. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM Press, New York, 181--192. Google Scholar
Digital Library
- Tseng, C.-W. 1995. Compiler optimizations for eliminating barrier synchronization. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP’95). ACM Press, New York, 144--155. Google Scholar
Digital Library
- Vallee-Rai, R., Co, P., Gagnon, E., Hendren, L., Lam, P., and Sundaresan, V. 1999. Soot - A java bytecode optimization framework. In Proceedings of the Conference of the Centre for Advanced Studies on Collaborative Research (CASCON’99). IBM Press, 125--135. Google Scholar
Digital Library
- Wolfe, M. 1996. High Performance Compilers for Parallel Computing. Addison-Wesley. Google Scholar
Digital Library
- Wolfe, M. and Banerjee, U. 1987. Data dependence and its application to parallel processing. Int. J. Parallel Program. 16, 137--178. Google Scholar
Digital Library
- Yelick, K., Bonachea, D., Chen, W.-Y., Colella, P., Datta, K., Duell, J., Graham, S. L., Hargrove, P., Hilfinger, P., Husbands, P., Iancu, C., Kamil, A., Nishtala, R., Su, J., Michael, W., and Wen, T. 2007. Productivity and performance using partitioned global address space languages. In Proceedings of the International Workshop on Parallel Symbolic Computation (PASCO’07). ACM Press, New York, 24--32. Google Scholar
Digital Library
- Yonezawa, N., Wada, K., And Aida, T. 2006. Barrier elimination based on access dependency analysis for OpenMP. In Proceedings of the 4th International Symposium on Parallel and Distributed Processing and Applications (ISPA’06), M. Guo, L. T. Yang, B. D. Martino, H. P. Zima, J. Dongarra, and F. Tang Eds., Lecture Notes in Computer Science, vol. 4330, Springer, 362--373. Google Scholar
Digital Library
- Zhao, J. and Sarkar, V. 2011. Intermediate language extensions for parallelism. In Proceedings of the 5th Workshop on Virtual Machines and Intermediate Languages (VMIL’11). 333--334. Google Scholar
Digital Library
- Zhao, J., Shirako, J., Nandivada, V. K., and Sarkar, V. 2010. Reducing task creation and termination overhead in explicitly parallel programs. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM Press, New York, 169--180. Google Scholar
Digital Library
Index Terms
A Transformation Framework for Optimizing Task-Parallel Programs
Recommendations
Reducing task creation and termination overhead in explicitly parallel programs
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesThere has been a proliferation of task-parallel programming systems to address the requirements of multicore programmers. Current production task-parallel systems include Cilk++, Intel Threading Building Blocks, Java Concurrency, .Net Task Parallel ...
Optimizing recursive task parallel programs
ICS '17: Proceedings of the International Conference on SupercomputingWe present a new optimization DECAF that optimizes recursive task parallel (RTP) programs by reducing the task creation and termination overheads. DECAF reduces the task termination (join) operations by aggressively increasing the scope of join ...
Optimizing performance of parallel programs on multicomputer and multi-core architectures: a comparative evaluation
ISTA '09: Proceedings of the 2009 conference on Information Science, Technology and ApplicationsWith the advent of multi-core architectures, there arises a need for comparative evaluations of the performance of well-understood parallel programs. This is because, it is necessary to gain an insight into the potential advantages of the available ...






Comments