skip to main content
10.1145/1504176.1504209acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors

Published:14 February 2009Publication History

ABSTRACT

Recent advances in polyhedral compilation technology have made it feasible to automatically transform affine sequential loop nests for tiled parallel execution on multi-core processors. However, for multi-statement input programs with statements of different dimensionalities, such as Cholesky or LU decomposition, the parallel tiled code generated by existing automatic parallelization approaches may suffer from significant load imbalance, resulting in poor scalability on multi-core systems. In this paper, we develop a completely automatic parallelization approach for transforming input affine sequential codes into efficient parallel codes that can be executed on a multi-core system in a load-balanced manner. In our approach, we employ a compile-time technique that enables dynamic extraction of inter-tile dependences at run-time, and dynamic scheduling of the parallel tiles on the processor cores for improved scalable execution. Our approach obviates the need for programmer intervention and re-writing of existing algorithms for efficient parallel execution on multi-cores. We demonstrate the usefulness of our approach through comparisons using linear algebra computations: LU and Cholesky decomposition.

References

  1. T. L. Adam, K. M. Chandy, and J. R. Dickson. A comparison of list schedules for parallel processing systems. Commun. ACM, 17(12):685--690, 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Allen and K. Kennedy. Automatic translation of Fortran programs to vector form. ACM Trans. on Programming Languages and Systems, 9(4):491--542, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Ancourt and F. Irigoin. Scanning polyhedra with do loops. In PPoPP'91, pages 39--50, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT'04, pages 7--16, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Bastoul, A. Cohen, S. Girbal, S. Sharma, and O. Temam. Putting polyhedral loop transformations to work. In Workshop on Languages and Compilers for Parallel Computing (LCPC'03), pages 23--30, 2003.Google ScholarGoogle Scholar
  6. U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Affine transformations for communication minimal parallelization and locality optimization of arbitrarily nested loop sequences. Technical Report OSU-CISRC-5/07-TR43, Ohio State University, May 2007.Google ScholarGoogle Scholar
  7. U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construction (ETAPS CC), Apr. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN Programming Languages Design and Implementation (PLDI '08), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. U. Bondhugula, J. Ramanujam, and P. Sadayappan. Pluto: A practical and fully automatic polyhedral parallelizer and locality optimizer. Technical Report OSU-CISRC-10/07-TR70, The Ohio State University, Oct. 2007.Google ScholarGoogle Scholar
  10. P. Boulet, A. Darte, G.-A. Silber, and F. Vivien. Loop parallelization algorithms: From parallelism extraction to code generation. Parallel Computing, 24(3-4):421--444, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Buttari, J. Dongarra, P. Husbands, J. Kurzak, and K. Yelick. Multithreading for synchronization tolerance in matrix factorization. In Proceedings of the SciDAC 2007 Conference. Journal of Physics: Conference Series, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  12. A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Technical Report UT-CS-07-600, Innovative Computing Laboratory, University of Tennessee Knoxville, September 2007. Submitted to Parallel Computing. LAPACK Working Note 191.Google ScholarGoogle Scholar
  13. D.-K. Chen, J. Torrellas, and P.-C. Yew. An efficient algorithm for the run-time parallelization of doacross loops. In Supercomputing '94: Proceedings of the 1994 conference on Supercomputing, pages 518--527, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Cintra and D. R. Llanos. Toward efficient and robust software speculative parallelization on multiprocessors. In PPoPP '03: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 13--24, New York, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. CLooG: The Chunky Loop Generator. http://www.cloog.org.Google ScholarGoogle Scholar
  16. A. Darte, G.-A. Silber, and F. Vivien. Combining retiming and scheduling techniques for loop parallelization and loop tiling. Parallel Processing Letters, 7(4):379--392, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  17. A. Darte and F. Vivien. Optimal fine and medium grain parallelism detection in polyhedral reduced dependence graphs. IJPP, 25(6):447--496, Dec. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Dongarra. Four important concepts that will effect math software. In 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA'08), 2008.Google ScholarGoogle Scholar
  19. P. Feautrier. Dataflow analysis of array and scalar references. IJPP, 20(1):23--53, 1991.Google ScholarGoogle ScholarCross RefCross Ref
  20. P. Feautrier. Some efficient solutions to the affine scheduling problem, part I: one-dimensional time. IJPP, 21(5):313--348, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Feautrier. Some efficient solutions to the affine scheduling problem, part II: multidimensional time. IJPP, 21(6):389--420, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Feautrier. Automatic parallelization in the polytope model. In The Data Parallel Programming Model, pages 79--103, 1996. Google ScholarGoogle ScholarCross RefCross Ref
  23. A. Gerasoulis and T. Yang. On the granularity and clustering of directed acyclic task graphs. IEEE Trans. Parallel Distrib. Syst., 4(6):686--701, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam. Semi-automatic composition of loop transformations. IJPP, 34(3):261--317, June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Griebl. Automatic Parallelization of Loop Programs for Distributed Memory Architectures. FMI, University of Passau, 2004. Habilitation Thesis.Google ScholarGoogle Scholar
  26. H. Kasahara, H. Honda, and S. Narita. Parallel processing of near fine grain tasks using static scheduling on oscar (optimally scheduled advanced multiprocessor). In Supercomputing '90: Proceedings of the 1990 ACM/IEEE conference on Supercomputing, pages 856--864, Washington, DC, USA, 1990. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv., 31(4):406--471, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S.-T. Leung and J. Zahorjan. Improving the performance of runtime parallelization. SIGPLAN Not., 28(7):83--91, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Lim. Improving Parallelism And Data Locality With Affine Partitioning. PhD thesis, Stanford University, Aug. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Lim, S. Liao, and M. Lam. Blocking and array contraction across arbitrarily nested loops using affine partitioning. In ACM SIGPLAN PPoPP, pages 103--112, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. W. Lim, G. I. Cheong, and M. S. Lam. An affine partitioning algorithm to maximize parallelism and minimize communication. In ACM Intl. Conf. on Supercomputing, pages 228--237, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine partitions. Parallel Computing, 24(3-4):445--475, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Parallel linear algebra for scalable multi-core architectures (PLASMA) project. http://icl.cs.utk.edu/plasma.Google ScholarGoogle Scholar
  34. PLUTO: A polyhedral automatic parallelizer and locality optimizer for multicores. http://pluto-compiler.sourceforge.net.Google ScholarGoogle Scholar
  35. R. Ponnusamy, J. Saltz, and A. Choudhary. Runtime compilation techniques for data partitioning and communication schedule reuse. In Supercomputing '93: Proceedings of the 1993 ACM/IEEE conference on Supercomputing, pages 361--370, New York, NY, USA, 1993. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. W. Pugh. The Omega test: a fast and practical integer programming algorithm for dependence analysis. Communications of the ACM, 8:102--114, Aug. 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. F. Quilleré, S. V. Rajopadhye, and D. Wilde. Generation of efficient nested loops from polyhedra. IJPP, 28(5):469--498, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. C. G. Quinones, C. Madriles, J. Sánchez, P. Marcuello, A. González, and D. M. Tullsen. Mitosis compiler: An infrastructure for speculative threading based on pre-computation slices. In PLDI '05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pages 269--279, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. L. Rauchwerger and D. Padua. The lrpd test: speculative run-time parallelization of loops with privatization and reduction parallelization. SIGPLAN Not., 30(6):218--232, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. P. Rundberg and P. S. Om. Low-cost thread-level data dependence speculation on multiprocessors. In In Fourth Workshop on Multithreaded Execution, Architecture and Compilation, pages 1--9, 2000.Google ScholarGoogle Scholar
  41. S. Rus, M. Pennings, and L. Rauchwerger. Sensitivity analysis for automatic parallelization on multi-cores. In ICS '07: Proceedings of the 21st annual international conference on Supercomputing, pages 263--273, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. Saltz, K. Crowley, R. Mirchandaney, and H. Berryman. Run-time scheduling and execution of loops on message passing machines. J. Parallel Distrib. Comput., 8(4):303--312, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. H. Salz, R. Mirchandaney, and K. Crowley. Run-time parallelization and scheduling of loops. IEEE Trans. Comput., 40(5):603--612, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. V. Sarkar. Partitioning and Scheduling Parallel Programs for Multiprocessors. MIT Press, Cambridge, MA, USA, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. V. Sarkar and J. Hennessy. Compile-time partitioning and scheduling of parallel programs. In SIGPLAN '86: Proceedings of the 1986 SIGPLAN symposium on Compiler construction, pages 17--26, New York, NY, USA, 1986. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. N. Vasilache, C. Bastoul, and A. Cohen. Polyhedral code generation in the real world. In International Conference on Compiler Construction (ETAPS CC'06), pages 185--201, Mar. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. N. Vasilache, C. Bastoul, S. Girbal, and A. Cohen. Violated dependence analysis. In ACM ICS, June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. M. Wolf and M. S. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parallel Distrib. Syst., 2(4):452--471, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
        February 2009
        322 pages
        ISBN:9781605583976
        DOI:10.1145/1504176
        • cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 44, Issue 4
          PPoPP '09
          April 2009
          294 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/1594835
          Issue’s Table of Contents

        Copyright © 2009 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 February 2009

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate230of1,014submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!