skip to main content
research-article
Public Access

Distributed memory code generation for mixed Irregular/Regular computations

Published:24 January 2015Publication History
Skip Abstract Section

Abstract

Many applications feature a mix of irregular and regular computational structures. For example, codes using adaptive mesh refinement (AMR) typically use a collection of regular blocks, where the number of blocks and the relationship between blocks is irregular. The computational structure in such applications generally involves regular (affine) loop computations within some number of innermost loops, while outer loops exhibit irregularity due to data-dependent control flow and indirect array access patterns. Prior approaches to distributed memory parallelization do not handle such computations effectively. They either target loop nests that are completely affine using polyhedral frameworks, or treat all loops as irregular. Consequently, the generated distributed memory code contains artifacts that disrupt the regular nature of previously affine innermost loops of the computation. This hampers subsequent optimizations to improve on-node performance. We propose a code generation framework that can effectively transform such applications for execution on distributed memory systems. Our approach generates distributed memory code which preserves program properties that enable subsequent polyhederal optimizations. Simultaneously, it addresses a major memory bottleneck of prior techniques that limits the scalability of the generated code. The effectiveness of the proposed framework is demonstrated on computations that are mixed regular/irregular, completely regular, and completely irregular.

References

  1. Cloog: Chunky loop generator. www.cloog.org/.Google ScholarGoogle Scholar
  2. Example of the generated inspector/executor code. http://hpcrl.cse.ohio-state.edu/IEC/example.pdf.Google ScholarGoogle Scholar
  3. Inspector/executor compiler. http://tinyurl.com/kwhp453.Google ScholarGoogle Scholar
  4. Pluto - an automatic parallelizer and locality optimizer for multicores. pluto-compiler.sourceforge.net.Google ScholarGoogle Scholar
  5. ROSE compiler infrastructure. www.rosecompiler.org.Google ScholarGoogle Scholar
  6. G. Agrawal, J. Saltz, and R. Das. Interprocedural partial redundancy elimination and its application to distributed memory compilation. In PLDI, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Anantpur and R. Govindarajan. Runtime dependence computation and execution of loops on heterogeneous systems. In CGO, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In SC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Basumallik and R. Eigenmann. Towards automatic translation of OpenMP to MPI. In ICS, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Basumallik and R. Eigenmann. Optimizing irregular sharedmemory applications for distributed-memory systems. In PPoPP, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Berryman, J. Saltz, and J. Scroggs. Execution time support for adaptive scientific algorithms on distributed memory machines. Concurrency: Practice and Experience, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. U. Bondhugula. Compiling affine loop nests for distributed-memory parallel architectures. In SC, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. U. Bondhugula, A. Hartono, J. Ramanujan, and P. Sadayappan. A practical automatic polyhedral program optimization system. In PLDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Colella, D. Graves, T. Ligocki, D. Martin, D. Modiano, D. Serafini, and B. Van Straalen. Chombo software package for AMR applications-design document, 2000.Google ScholarGoogle Scholar
  15. R. Das, J. Saltz, and R. von Hanxleden. Slicing analysis and indirect accesses to indirect arrays. LCPC, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Das, P. Havlak, J. Saltz, and K. Kennedy. Index array flattening through program transformation. In SC, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Dathathri, C. Reddy, T. Ramashekar, and U. Bondhugula. Generating efficient data movement code for heterogeneous architectures with distributed-memory. In PACT, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Gupta and S. V. Rajopadhye. The Z-polyhedral model. In PPOPP, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Huang, A. Raman, T. B. Jablin, Y. Zhang, T.-H. Hung, and D. I. August. Decoupled software pipelining creates parallelization opportunities. In CGO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Huang, T. B. Jablin, S. R. Beard, N. P. Johnson, and D. I. August. Automatically exploiting cross-invocation parallelism using runtime information. In CGO, 2013.Google ScholarGoogle Scholar
  21. X. Huo, V. Ravi, W. Ma, and G. Agrawal. An execution strategy and optimized runtime support for parallelizing irregular reductions on modern gpus. In ICS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. When polyhedral transformations meet simd code generation. In PLDI, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. LaMielle and M. Strout. Enabling code gen. with sparse polyhedral framework. Technical report, Colorado State University, 2010.Google ScholarGoogle Scholar
  24. C. Liu, M. H. Jamal, M. Kulkarni, A. Prakash, and V. Pai. Exploiting domain knowledge to optimize parallel computational mechanics codes. In ICS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Mittal and S. Mazumder. Hybrid discrete ordinatesspherical harmonics solution to the boltzmann transport equation for phonons for non-equilibrium heat conduction. Journal of Computational Physics, 2011.Google ScholarGoogle Scholar
  26. J. Neiplocha, V. Tipparaju, M. Krishnan, and D. K. Panda. High performance remote memory access communication: The ARMCI approach. Int. J. High Performance Computing Applications, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. Ottoni, R. Rangan, A. Stoler, and D. August. Automatic thread extraction with decoupled software pipelining. In MICRO, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Polybench. Polybench/c: the polyhedral benchmark suite. www.cs.ucla.edu/ ∼pouchet/software/polybench/.Google ScholarGoogle Scholar
  29. PolyOpt. A polyhedral optimizer for the ROSE compiler. www.cs.ucla.edu/ ∼pouchet/software/polyopt/.Google ScholarGoogle Scholar
  30. R. Ponnusamy, J. H. Saltz, and A. N. Choudhary. Runtime compilation techniques for data partitioning and communication schedule reuse. In SC, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L. Rauchwerger and D. Padua. The privatizing doall test : A run-time technique for doall loop identification and array privaization. In ICS, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. L. Rauchwerger and D. Padua. The LRPD test: Speculative runtime parallelization of loops with privatization and reduction parallelization. In PLDI, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. L. Rauchwerger, N. Amato, and D. Padua. Run-time methods for parallelizing partially parallel loops. In ICS, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. L. Rauchwerger, N. Amato, and D. Padua. A scalable method for run-time loop parallelization. International Journal of Parallel Programming, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Ravishankar, J. Eisenlohr, L.-N. Pouchet, J. Ramanujam, A. Rountev, and P. Sadayappan. Code generation for parallel execution of a class of irregular loops on distributed memory systems. In SC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. Reddy and U. Bondhugula. Effective automatic computation placement and data allocation for parallelization of regular programs. In ICS, 2104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Rus, M. Pennings, and L. Rauchwerger. Sensitivity analysis for automatic parallelization on multi-cores. In ICS, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Rus, L. Rauchwerger, and J. Hoeflinger. Hybrid analysis: Static & dynamic memory reference analysis. In ICS, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Saltz, K. Crowley, R. Mirchandaney, and H. Berryman. Run-time scheduling and execution of loops on message passing machines. J. Parallel Distrib. Comput., 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Sharma, R. Ponnusamy, B. Moon, Y. shin Hwang, R. Das, and J. Saltz. Run-time and compile-time support for adaptive irregular problems. In SC, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. K. Stock, M. Kong, L.-N. Pouchet, F. Rastello, J. Ramanujam, and P. Sadayappan. A framework for enhancing data reuse via associative reordering. In PLDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. M. M. Strout and P. D. Hovland. Metrics and models for reordering transformations. In MSP, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. M. M. Strout, G. George, and C. Olschanowsky. Set and relation manipulation for the sparse polyhedral framework. In LCPC, September 2012.Google ScholarGoogle Scholar
  44. A. Venkat, M. Shantharam, M. Hall, and M. M. Strout. Non-affine extensions to polyhedral code generation. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pages 185:185–185:194, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2670-4.. URL http://doi.acm. org/10.1145/2544137.2544141. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. S. Verdoolaege. ISL: An integer set library for the polyhedral model. In Mathematical Software–ICMS 2010, pages 299–302. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. R. von Hanxleden, K. Kennedy, C. Koelbel, R. Das, and J. Saltz. Compiler analysis for irregular problems in Fortran D. LCPC, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. X. Zhuang, A. E. Eichenberger, Y. Luo, K. O’Brien, and K. O’Brien. Exploiting parallelism with dependence-aware scheduling. In PACT, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Distributed memory code generation for mixed Irregular/Regular computations

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 50, Issue 8
      PPoPP '15
      August 2015
      290 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2858788
      • Editor:
      • Andy Gill
      Issue’s Table of Contents
      • cover image ACM Conferences
        PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
        January 2015
        290 pages
        ISBN:9781450332057
        DOI:10.1145/2688500

      Copyright © 2015 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 January 2015

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!