Abstract
Many applications feature a mix of irregular and regular computational structures. For example, codes using adaptive mesh refinement (AMR) typically use a collection of regular blocks, where the number of blocks and the relationship between blocks is irregular. The computational structure in such applications generally involves regular (affine) loop computations within some number of innermost loops, while outer loops exhibit irregularity due to data-dependent control flow and indirect array access patterns. Prior approaches to distributed memory parallelization do not handle such computations effectively. They either target loop nests that are completely affine using polyhedral frameworks, or treat all loops as irregular. Consequently, the generated distributed memory code contains artifacts that disrupt the regular nature of previously affine innermost loops of the computation. This hampers subsequent optimizations to improve on-node performance. We propose a code generation framework that can effectively transform such applications for execution on distributed memory systems. Our approach generates distributed memory code which preserves program properties that enable subsequent polyhederal optimizations. Simultaneously, it addresses a major memory bottleneck of prior techniques that limits the scalability of the generated code. The effectiveness of the proposed framework is demonstrated on computations that are mixed regular/irregular, completely regular, and completely irregular.
- Cloog: Chunky loop generator. www.cloog.org/.Google Scholar
- Example of the generated inspector/executor code. http://hpcrl.cse.ohio-state.edu/IEC/example.pdf.Google Scholar
- Inspector/executor compiler. http://tinyurl.com/kwhp453.Google Scholar
- Pluto - an automatic parallelizer and locality optimizer for multicores. pluto-compiler.sourceforge.net.Google Scholar
- ROSE compiler infrastructure. www.rosecompiler.org.Google Scholar
- G. Agrawal, J. Saltz, and R. Das. Interprocedural partial redundancy elimination and its application to distributed memory compilation. In PLDI, 1995. Google Scholar
Digital Library
- J. Anantpur and R. Govindarajan. Runtime dependence computation and execution of loops on heterogeneous systems. In CGO, 2013. Google Scholar
Digital Library
- V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In SC, 2012. Google Scholar
Digital Library
- A. Basumallik and R. Eigenmann. Towards automatic translation of OpenMP to MPI. In ICS, 2005. Google Scholar
Digital Library
- A. Basumallik and R. Eigenmann. Optimizing irregular sharedmemory applications for distributed-memory systems. In PPoPP, 2006. Google Scholar
Digital Library
- H. Berryman, J. Saltz, and J. Scroggs. Execution time support for adaptive scientific algorithms on distributed memory machines. Concurrency: Practice and Experience, 1991. Google Scholar
Digital Library
- U. Bondhugula. Compiling affine loop nests for distributed-memory parallel architectures. In SC, 2013. Google Scholar
Digital Library
- U. Bondhugula, A. Hartono, J. Ramanujan, and P. Sadayappan. A practical automatic polyhedral program optimization system. In PLDI, 2008. Google Scholar
Digital Library
- P. Colella, D. Graves, T. Ligocki, D. Martin, D. Modiano, D. Serafini, and B. Van Straalen. Chombo software package for AMR applications-design document, 2000.Google Scholar
- R. Das, J. Saltz, and R. von Hanxleden. Slicing analysis and indirect accesses to indirect arrays. LCPC, 1994. Google Scholar
Digital Library
- R. Das, P. Havlak, J. Saltz, and K. Kennedy. Index array flattening through program transformation. In SC, 1995. Google Scholar
Digital Library
- R. Dathathri, C. Reddy, T. Ramashekar, and U. Bondhugula. Generating efficient data movement code for heterogeneous architectures with distributed-memory. In PACT, 2013. Google Scholar
Digital Library
- G. Gupta and S. V. Rajopadhye. The Z-polyhedral model. In PPOPP, 2007. Google Scholar
Digital Library
- J. Huang, A. Raman, T. B. Jablin, Y. Zhang, T.-H. Hung, and D. I. August. Decoupled software pipelining creates parallelization opportunities. In CGO, 2010. Google Scholar
Digital Library
- J. Huang, T. B. Jablin, S. R. Beard, N. P. Johnson, and D. I. August. Automatically exploiting cross-invocation parallelism using runtime information. In CGO, 2013.Google Scholar
- X. Huo, V. Ravi, W. Ma, and G. Agrawal. An execution strategy and optimized runtime support for parallelizing irregular reductions on modern gpus. In ICS, 2011. Google Scholar
Digital Library
- M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. When polyhedral transformations meet simd code generation. In PLDI, 2013. Google Scholar
Digital Library
- A. LaMielle and M. Strout. Enabling code gen. with sparse polyhedral framework. Technical report, Colorado State University, 2010.Google Scholar
- C. Liu, M. H. Jamal, M. Kulkarni, A. Prakash, and V. Pai. Exploiting domain knowledge to optimize parallel computational mechanics codes. In ICS, 2013. Google Scholar
Digital Library
- A. Mittal and S. Mazumder. Hybrid discrete ordinatesspherical harmonics solution to the boltzmann transport equation for phonons for non-equilibrium heat conduction. Journal of Computational Physics, 2011.Google Scholar
- J. Neiplocha, V. Tipparaju, M. Krishnan, and D. K. Panda. High performance remote memory access communication: The ARMCI approach. Int. J. High Performance Computing Applications, 2006. Google Scholar
Digital Library
- G. Ottoni, R. Rangan, A. Stoler, and D. August. Automatic thread extraction with decoupled software pipelining. In MICRO, 2005. Google Scholar
Digital Library
- Polybench. Polybench/c: the polyhedral benchmark suite. www.cs.ucla.edu/ ∼pouchet/software/polybench/.Google Scholar
- PolyOpt. A polyhedral optimizer for the ROSE compiler. www.cs.ucla.edu/ ∼pouchet/software/polyopt/.Google Scholar
- R. Ponnusamy, J. H. Saltz, and A. N. Choudhary. Runtime compilation techniques for data partitioning and communication schedule reuse. In SC, 1993. Google Scholar
Digital Library
- L. Rauchwerger and D. Padua. The privatizing doall test : A run-time technique for doall loop identification and array privaization. In ICS, 1994. Google Scholar
Digital Library
- L. Rauchwerger and D. Padua. The LRPD test: Speculative runtime parallelization of loops with privatization and reduction parallelization. In PLDI, 1995. Google Scholar
Digital Library
- L. Rauchwerger, N. Amato, and D. Padua. Run-time methods for parallelizing partially parallel loops. In ICS, 1995. Google Scholar
Digital Library
- L. Rauchwerger, N. Amato, and D. Padua. A scalable method for run-time loop parallelization. International Journal of Parallel Programming, 1995. Google Scholar
Digital Library
- M. Ravishankar, J. Eisenlohr, L.-N. Pouchet, J. Ramanujam, A. Rountev, and P. Sadayappan. Code generation for parallel execution of a class of irregular loops on distributed memory systems. In SC, 2012. Google Scholar
Digital Library
- C. Reddy and U. Bondhugula. Effective automatic computation placement and data allocation for parallelization of regular programs. In ICS, 2104. Google Scholar
Digital Library
- S. Rus, M. Pennings, and L. Rauchwerger. Sensitivity analysis for automatic parallelization on multi-cores. In ICS, 2002. Google Scholar
Digital Library
- S. Rus, L. Rauchwerger, and J. Hoeflinger. Hybrid analysis: Static & dynamic memory reference analysis. In ICS, 2002. Google Scholar
Digital Library
- J. Saltz, K. Crowley, R. Mirchandaney, and H. Berryman. Run-time scheduling and execution of loops on message passing machines. J. Parallel Distrib. Comput., 1990. Google Scholar
Digital Library
- S. Sharma, R. Ponnusamy, B. Moon, Y. shin Hwang, R. Das, and J. Saltz. Run-time and compile-time support for adaptive irregular problems. In SC, 1994. Google Scholar
Digital Library
- K. Stock, M. Kong, L.-N. Pouchet, F. Rastello, J. Ramanujam, and P. Sadayappan. A framework for enhancing data reuse via associative reordering. In PLDI, 2014. Google Scholar
Digital Library
- M. M. Strout and P. D. Hovland. Metrics and models for reordering transformations. In MSP, 2004. Google Scholar
Digital Library
- M. M. Strout, G. George, and C. Olschanowsky. Set and relation manipulation for the sparse polyhedral framework. In LCPC, September 2012.Google Scholar
- A. Venkat, M. Shantharam, M. Hall, and M. M. Strout. Non-affine extensions to polyhedral code generation. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pages 185:185–185:194, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2670-4.. URL http://doi.acm. org/10.1145/2544137.2544141. Google Scholar
Digital Library
- S. Verdoolaege. ISL: An integer set library for the polyhedral model. In Mathematical Software–ICMS 2010, pages 299–302. Springer, 2010. Google Scholar
Digital Library
- R. von Hanxleden, K. Kennedy, C. Koelbel, R. Das, and J. Saltz. Compiler analysis for irregular problems in Fortran D. LCPC, 1993. Google Scholar
Digital Library
- X. Zhuang, A. E. Eichenberger, Y. Luo, K. O’Brien, and K. O’Brien. Exploiting parallelism with dependence-aware scheduling. In PACT, 2009. Google Scholar
Digital Library
Index Terms
Distributed memory code generation for mixed Irregular/Regular computations
Recommendations
Distributed memory code generation for mixed Irregular/Regular computations
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingMany applications feature a mix of irregular and regular computational structures. For example, codes using adaptive mesh refinement (AMR) typically use a collection of regular blocks, where the number of blocks and the relationship between blocks is ...
Non-affine Extensions to Polyhedral Code Generation
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationThis paper describes a loop transformation framework that extends a polyhedral representation of loop nests to represent and transform computations with non-affine index arrays in loop bounds and subscripts via a new interface between compile-time and ...
Exploiting Locality for Irregular Scientific Codes
Irregular scientific codes experience poor cache performance due to their irregular memory access patterns. In this paper, we present two new locality improving techniques for irregular scientific codes. Our techniques exploit geometric structures ...






Comments