skip to main content
research-article
Free Access

A Transformation Framework for Optimizing Task-Parallel Programs

Published:01 April 2013Publication History
Skip Abstract Section

Abstract

Task parallelism has increasingly become a trend with programming models such as OpenMP 3.0, Cilk, Java Concurrency, X10, Chapel and Habanero-Java (HJ) to address the requirements of multicore programmers. While task parallelism increases productivity by allowing the programmer to express multiple levels of parallelism, it can also lead to performance degradation due to increased overheads. In this article, we introduce a transformation framework for optimizing task-parallel programs with a focus on task creation and task termination operations. These operations can appear explicitly in constructs such as async, finish in X10 and HJ, task, taskwait in OpenMP 3.0, and spawn, sync in Cilk, or implicitly in composite code statements such as foreach and ateach loops in X10, forall and foreach loops in HJ, and parallel loop in OpenMP.

Our framework includes a definition of data dependence in task-parallel programs, a happens-before analysis algorithm, and a range of program transformations for optimizing task parallelism. Broadly, our transformations cover three different but interrelated optimizations: (1) finish-elimination, (2) forall-coarsening, and (3) loop-chunking. Finish-elimination removes redundant task termination operations, forall-coarsening replaces expensive task creation and termination operations with more efficient synchronization operations, and loop-chunking extracts useful parallelism from ideal parallelism. All three optimizations are specified in an iterative transformation framework that applies a sequence of relevant transformations until a fixed point is reached. Further, we discuss the impact of exception semantics on the specified transformations, and extend them to handle task-parallel programs with precise exception semantics. Experimental results were obtained for a collection of task-parallel benchmarks on three multicore platforms: a dual-socket 128-thread (16-core) Niagara T2 system, a quad-socket 16-core Intel Xeon SMP, and a quad-socket 32-core Power7 SMP. We have observed that the proposed optimizations interact with each other in a synergistic way, and result in an overall geometric average performance improvement between 6.28× and 10.30×, measured across all three platforms for the benchmarks studied.

References

  1. Agarwal, S., Barik, R., Sarkar, V., and Shyamasundar, R. K. 2007. May-Happen-In-Parallel analysis of x10 programs. In Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’07). ACM Press, New York, 183--193. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Amarasinghe, S. P. and Lam, M. S. 1993. Communication optimization and code generation for distributed memory machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’93). ACM Press, New York, 126--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bailey, D. H., Barszcz, E., Barton, J. T., Browning, D. S., Carter, R. L., Dagum, L., Fatoohi, R. A., Frederickson, P. O., Lasinski, T. A., Schreiber, R. S., Simon, H. D., Venkatakrishnan, V., and Weeratunga, S. K. 1991. The nas parallel benchmarks - Summary and preliminary results. In Proceedings of the ACM/IEEE Conference on Supercomputing. 158--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bikshandi, G., Castanos, J. G., Kodali, S. B., Nandivada, V. K., Peshansky, I., Saraswat, V. A., Sur, S., Varma, P., and Wen, T. 2009. Efficient, portable implementation of asynchronous multi-place programs. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). ACM Press, New York, 271--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., and Zhou, Y. 1995. Cilk: An efficient multithreaded runtime system. SIGPLAN Not. 30, 8, 207--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chamberlain, B. L., Eun Choi, S., Deitz, S. J., And Snyder, L. 2004. The high-level parallel language ZPL improves productivity and performance. In Proceedings of the IEEE International Workshop on Productivity and Performance in High-End Computing. IEEE, 66--75.Google ScholarGoogle Scholar
  7. Chapel. 2005. Chapel. 2005. The chapel language specification version 0.4. http://chapel.cray.com/spec/spec-0.4.pdf.Google ScholarGoogle Scholar
  8. Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., Von Praun, C., and Sarkar, V. 2005. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA’05). ACM Press, New York, 519--538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cytron, R., Lipkis, J., and Schonberg, E. 1990. A compiler-assisted approach to SPMD execution. In Proceedings of the ACM/IEEE Conference on Supercomputing (Supercomputing’90). IEEE Computer Society Press, Los Alamitos, CA, 398--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dean, J., Grove, D., and Chambers, C. 1995. Optimization of object-oriented programs using static class hierarchy analysis. In Proceedings of the 9th European Conference on Object-Oriented Programming (ECOOP’95). Springer, 77--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Diniz, P. and Rinard, M. 1997. Synchronization transformations for parallel computing. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’97). ACM Press, New York, 187--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Duesterwald, E. and Soffa, M. L. 1991. Concurrency analysis in the presence of procedures using a data-flow framework. In Proceedings of the Symposium on Testing, Analysis, and Verification (TAV4). ACM Press, New York, 36--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Duran, A., Teruel, X., Ferrer, R., Martorell, X., and Ayguade, E. 2009. Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In Proceedings of the International Conference on Parallel Processing (ICPP’09). IEEE Computer Society, Los, Alamitos, CA, 124--131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Feng, M. and Leiserson, C. E. 1997. Efficient detection of determinacy races in cilk programs. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’97). ACM Press, New York, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ferrer, R., Duran, A., Martorell, X., and Ayguade, E. 2009. Unrolling loops containing task parallelism. In Proceedings of the 22nd International Workshop on Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 5898, Springer, 416--423. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Flynn, L. E. and Hummel, S. F. 1990. Scheduling variable-length parallel subtasks. Tech. rep. RC 15492, IBM.Google ScholarGoogle Scholar
  17. Gao, G. R. and Sarkar, V. 2000. Location consistency-A new memory model and cache consistency protocol. IEEE Trans. Comput. 49, 798--813. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Georges, A., Buytaert, D., and Eeckhout, L. 2007. Statistically rigorous java performance evaluation. In Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems and Applications (OOPSLA’07). ACM Press, New York, 57--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Guo, Y., Barik, R., Raman, R., and Sarkar, V. 2009. Work-First and help-first scheduling policies for async-finish task parallelism. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing. IEEE Computer Society, Los Alamitos, CA, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Gupta, R. 1989. The fuzzy barrier: A mechanism for high speed synchronization of processors. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM Press, New York, 54--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Habanero. 2009. Habanero Java. http://habanero.rice.edu/hj.Google ScholarGoogle Scholar
  22. Heinz, E. A. and Philippsen, M. 1993. Synchronization barrier elimination in synchronous foralls. Tech. rep. 13/93, Department of Informatics, University of Karlruhe.Google ScholarGoogle Scholar
  23. JGF. 2000. The java grande forum benchmark suite. http://www.epcc.ed.ac.uk/javagrande/javag.html.Google ScholarGoogle Scholar
  24. Kennedy, K. and Allen, J. R. 2002. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kruskal, C. and Weiss, A. 1985. Allocating independent subtasks on parallel processors. IEEE Trans. Softw. Engin. SE-11, 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Lamport, L. 1978. Time clocks, and the ordering of events in a distributed system. Comm. ACM 21, 558--565. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Lamport, L. 1979. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput. 28, 690--691. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Larus, J. R. and Rajwar, R. 2006. Transactional Memory. Morgan and Claypool.Google ScholarGoogle Scholar
  29. Lhotak, O. and Hendren, L. 2003. Scaling java points-to analysis using spark. In Proceedings of the 12th International Conference on Compiler Construction (CC’03). Springer, 153--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Metcalfe, M. and Reid, J. 1990. Fortran 90 Explained. Oxford Science Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Muchnick, S. S. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Narayanan, S. H. K., Chen, G., Mahmut Kandemir, M. X., and Xie, Y. 2005. Temperature-Sensitive loop parallelization for chip multiprocessors. In Proceedings of the International Conference on Computer Design (ICCD’05). IEEE Computer Society, Los Alamitos, CA, 677--682. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Nicolau, A., Li, G., Veidenbaum, A. V., and Kejariwal, A. 2009. Synchronization optimizations for efficient execution on multi-cores. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM Press, New York, 169--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Nystrom, N., Clarkson, M. R., and Myers, A. C. 2003. Polyglot: An extensible compiler framework for Java. In Proceedings of the 12th International Conference on Compiler Construction (CC’03). Springer, 138--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. OpenMP. 2008. OpenMP application program interface, version 3.0. http://www.openmp.org/mpdocuments/spec30.pdfGoogle ScholarGoogle Scholar
  36. Peierls, T., Goetz, B., Bloch, J., Bowbeer, J., Lea, D., and Holmes, D. 2005. Java Concurrency in Practice. Addison-Wesley Professional. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Polychronopoulos, C. D. and Kuck, D. J. 1987. Guided self-scheduling: A practical scheduling scheme for parallel supercomputers. IEEE Trans. Comput. C-36, 12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Salcianu, R. D. and Rinard, M. C. 2005. Purity and side effect analysis for Java programs. In Proceedings of the 6th International Conference on Verification, Model Checking and Abstract Interpretation (VMCAI). 199--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Sarkar, V. 1988. Synchronization using counting semaphores. In Proceedings of the 2nd International Conference on Supercomputing (ICS’88). ACM Press, New York, 627--637. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Sarkar, V. and Fink, S. J. 2001. Efficient dependence analysis for Java arrays. In Proceedings of the 7th International Euro-Par Conference on Parallel Processing (Euro-Par’01). Springer, 273--277. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Shirako, J., Peixotto, D. M., Sarkar, V., and Scherer, W. N. 2008. Phasers: A unified deadlock-free construct for collective and point-to-point synchronization. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS’08). ACM Press, New York, 277--288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Shirako, J., Zhao, J. M., Nandivada, V. K., and Sarkar, V. N. 2009. Chunking parallel loops in the presence of synchronization. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM Press, New York, 181--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Tseng, C.-W. 1995. Compiler optimizations for eliminating barrier synchronization. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP’95). ACM Press, New York, 144--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Vallee-Rai, R., Co, P., Gagnon, E., Hendren, L., Lam, P., and Sundaresan, V. 1999. Soot - A java bytecode optimization framework. In Proceedings of the Conference of the Centre for Advanced Studies on Collaborative Research (CASCON’99). IBM Press, 125--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Wolfe, M. 1996. High Performance Compilers for Parallel Computing. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Wolfe, M. and Banerjee, U. 1987. Data dependence and its application to parallel processing. Int. J. Parallel Program. 16, 137--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Yelick, K., Bonachea, D., Chen, W.-Y., Colella, P., Datta, K., Duell, J., Graham, S. L., Hargrove, P., Hilfinger, P., Husbands, P., Iancu, C., Kamil, A., Nishtala, R., Su, J., Michael, W., and Wen, T. 2007. Productivity and performance using partitioned global address space languages. In Proceedings of the International Workshop on Parallel Symbolic Computation (PASCO’07). ACM Press, New York, 24--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Yonezawa, N., Wada, K., And Aida, T. 2006. Barrier elimination based on access dependency analysis for OpenMP. In Proceedings of the 4th International Symposium on Parallel and Distributed Processing and Applications (ISPA’06), M. Guo, L. T. Yang, B. D. Martino, H. P. Zima, J. Dongarra, and F. Tang Eds., Lecture Notes in Computer Science, vol. 4330, Springer, 362--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Zhao, J. and Sarkar, V. 2011. Intermediate language extensions for parallelism. In Proceedings of the 5th Workshop on Virtual Machines and Intermediate Languages (VMIL’11). 333--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Zhao, J., Shirako, J., Nandivada, V. K., and Sarkar, V. 2010. Reducing task creation and termination overhead in explicitly parallel programs. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM Press, New York, 169--180. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Transformation Framework for Optimizing Task-Parallel Programs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Programming Languages and Systems
        ACM Transactions on Programming Languages and Systems  Volume 35, Issue 1
        April 2013
        240 pages
        ISSN:0164-0925
        EISSN:1558-4593
        DOI:10.1145/2450136
        Issue’s Table of Contents

        Copyright © 2013 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 April 2013
        • Accepted: 1 November 2012
        • Revised: 1 October 2012
        • Received: 1 December 2011
        Published in toplas Volume 35, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!