Abstract
While there has been decades of work on developing automatic, locality-enhancing transformations for regular programs that operate over dense matrices and arrays, there has been little investigation of such transformations for irregular programs, which operate over pointer-based data structures such as graphs, trees and lists. In this paper, we argue that, for a class of irregular applications we call traversal codes, there exists substantial data reuse and hence opportunity for locality exploitation. We develop a novel optimization called point blocking, inspired by the classic tiling loop transformation, and show that it can substantially enhance temporal locality in traversal codes. We then present a transformation and optimization framework called TreeTiler that automatically detects opportunities for applying point blocking and applies the transformation. TreeTiler uses autotuning techniques to determine appropriate parameters for the transformation. For a series of traversal algorithms drawn from real-world applications, we show that TreeTiler is able to deliver performance improvements of up to 245% over an optimized (but non-transformed) parallel baseline, and in several cases, significantly better scalability.
- S. Aluru, J. Gustafson, G. M. Prabhu, and F. E. Sevilgen. Distribution-independent hierarchical algorithms for the n-body problem. J. Supercomput., 12:303--323, October 1998. Google Scholar
Digital Library
- M. Amor, F. Argüello, J. López, O. G. Plata, and E. L. Zapata. A data parallel formulation of the barnes-hut method for n -body simulations. In Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia, PARA'00, pages 342--349, London, UK, 2001. Springer-Verlag. Google Scholar
Digital Library
- J. Barnes and P. Hut. A hierarchical o(n log n) force-calculation algorithm. Nature, 324(4):446--449, December 1986.Google Scholar
Cross Ref
- T. M. Chilimbi, B. Davidson, and J. R. Larus. Cache-conscious structure definition. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, PLDI '99, pages 13--24, New York, NY, USA, 1999. ACM. Google Scholar
Digital Library
- T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-conscious structure layout. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, PLDI '99, pages 1--12, New York, NY, USA, 1999. ACM. Google Scholar
Digital Library
- T. M. Chilimbi and J. R. Larus. Using generational garbage collection to implement cache-conscious data placement. In Proceedings of the 1st international symposium on Memory management, ISMM '98, pages 37--48, New York, NY, USA, 1998. ACM. Google Scholar
Digital Library
- T. Ekman and G. Hedin. The jastadd extensible java compiler. In Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications, OOPSLA '07, pages 1--18, New York, NY, USA, 2007. ACM. Google Scholar
Digital Library
- G. C. Fox. A graphical approach to load balancing and sparse matrix vector multiplication on the hypercube. Institute for Mathematics and Its Applications, 13:37--, 1988.Google Scholar
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the cilk-5 multithreaded language. SIGPLAN Not., 33(5):212--223, 1998. Google Scholar
Digital Library
- R. Ghiya, L. Hendren, and Y. Zhu. Detecting parallelism in c programs with recursive data structures. IEEE Transactions on Parallel and Distributed Systems, 1:35--47, 1998. Google Scholar
Digital Library
- R. Ghiya and L. J. Hendren. Is it a tree, a dag, or a cyclic graph? a shape analysis for heap-directed pointers in c. In POPL '96: Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 1--15, New York, NY, USA, 1996. ACM. Google Scholar
Digital Library
- A. G. Gray and A. W. Moore. $N$-Body Problems in Statistical Learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems (NIPS) 13 (Dec 2000). MIT Press, 2001.Google Scholar
- T. Hamada, T. Narumi, R. Yokota, K. Yasuoka, K. Nitadori, and M. Taiji. 42 tflops hierarchical n-body simulations on gpus with applications in both astrophysics and turbulence. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--12, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
- H. Han and C.-W. Tseng. Exploiting locality for irregular scientific codes. IEEE Trans. Parallel Distrib. Syst., 17:606--618, July 2006. Google Scholar
Digital Library
- L. Hernquist. Vectorization of tree traversals. J. Comput. Phys., 87:137--147, March 1990. Google Scholar
Digital Library
- K. Kennedy and J. Allen, editors. Optimizing compilers for modren architectures:a dependence-based approach. Morgan Kaufmann, 2001. Google Scholar
Digital Library
- M. Kulkarni, M. Burtscher, K. Pingali, and C. Cascaval. Lonestar: A suite of parallel irregular programs. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 65--76, April 2009.Google Scholar
Cross Ref
- M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. SIGPLAN Not. (Proceedings of PLDI 2007), 42(6):211--222, 2007. Google Scholar
Digital Library
- M. H. L. Nyland and J. Prins. Fast n-body simulation with cuda. GPU Gems, (3):677--695, 2007.Google Scholar
- C. Lattner and V. Adve. Automatic pool allocation: improving performance by controlling data structure layout in the heap. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, PLDI '05, pages 129--142, New York, NY, USA, 2005. ACM. Google Scholar
Digital Library
- J. Makino. Vectorization of a treecode. J. Comput. Phys., 87:148--160, March 1990. Google Scholar
Digital Library
- E. Mansson, J. Munkberg, and T. Akenine-Moller. Deep coherent ray tracing. In Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing, pages 79--85, Washington, DC, USA, 2007. IEEE Computer Society. Google Scholar
Digital Library
- J. Mellor-Crummey, D. Whalley, and K. Kennedy. Improving memory hierarchy performance for irregular applications using data and computation reorderings. Int. J. Parallel Program., 29(3):217--247, 2001. Google Scholar
Digital Library
- M. Pharr, C. Kolb, R. Gershbein, and P. Hanrahan. Rendering complex scenes with memory-coherent ray tracing. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, SIGGRAPH '97, pages 101--108, New York, NY, USA, 1997. ACM Press/Addison-Wesley Publishing Co. Google Scholar
Digital Library
- K. Pingali, M. Kulkarni, D. Nguyen, M. Burtscher, M. Mendez-Lojo, D. Prountzos, X. Sui, and Z. Zhong. Amorphous data-parallelism in irregular algorithms. Technical Report TR-09-05, Department of Computer Science, The University of Texas at Austin, February 2009.Google Scholar
- M. Rinard and P. C. Diniz. Commutativity analysis: a new analysis technique for parallelizing compilers. ACM Trans. Program. Lang. Syst., 19(6):942--991, 1997. Google Scholar
Digital Library
- M. Sagiv, T. Reps, and R. Wilhelm. Parametric shape analysis via 3-valued logic. ACM Transactions on Programming Languages and Systems, 24(3), May 2002. Google Scholar
Digital Library
- J. K. Salmon. Parallel hierarchical N-body methods. PhD thesis, Pasadena, CA, USA, 1991. Google Scholar
Digital Library
- J. P. Singh, C. Holt, T. Totsuka, A. Gupta, and J. Hennessy. Load balancing and data locality in adaptive hierarchical n-body methods: Barnes-hut, fast multipole, and radiosity. J. Parallel Distrib. Comput., 27(2):118--141, 1995. Google Scholar
Digital Library
- S. Thees and C. Weiland. Implementing lightcuts. Technical report, Fachhochschule Bonn-Rhein-Sieg, University of Applied Sciences, Fachbereich Informatik, July 2008.Google Scholar
- A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth. A scalable auto-tuning framework for compiler optimization. In IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing, pages 1--12, Washington, DC, USA, 2009. IEEE Computer Society. Google Scholar
Digital Library
- D. N. Truong, F. Bodin, and A. Seznec. Improving cache behavior of dynamically allocated data structures. In Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, PACT '98, pages 322--, Washington, DC, USA, 1998. IEEE Computer Society. Google Scholar
Digital Library
- R. Vuduc, J. W. Demmel, and K. A. Yelick. Oski: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series, 16(1), 2005.Google Scholar
- I. Wald. On fast construction of sah-based bounding volume hierarchies. In RT '07: Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing, pages 33--40, Washington, DC, USA, 2007. IEEE Computer Society. Google Scholar
Digital Library
- B. Walter, K. Bala, M. Kulkarni, and K. Pingali. Fast agglomerative clustering for rendering. In IEEE Symposium on Interactive Ray Tracing (RT), pages 81--86, August 2008.Google Scholar
Cross Ref
- B. Walter, S. Fernandez, A. Arbree, K. Bala, M. Donikian, and D. Greenberg. Lightcuts: a scalable approach to illumination. ACM Transactions on Graphics (SIGGRAPH), 24(3):1098--1107, July 2005. Google Scholar
Digital Library
- C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the atlas project. Parallel Computing, 27:2001, 2000.Google Scholar
Index Terms
Enhancing locality for recursive traversals of recursive structures
Recommendations
Locality Transformations for Nested Recursive Iteration Spaces
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsThere has been a significant amount of effort invested in designing scheduling transformations such as loop tiling and loop fusion that rearrange the execution of dynamic instances of loop nests to place operations that access the same data close ...
Automatically enhancing locality for tree traversals with traversal splicing
OOPSLA '12: Proceedings of the ACM international conference on Object oriented programming systems languages and applicationsGenerally applicable techniques for improving temporal locality in irregular programs, which operate over pointer-based data structures such as trees and graphs, are scarce. Focusing on a subset of irregular programs, namely, tree traversal algorithms ...
Enhancing locality for recursive traversals of recursive structures
OOPSLA '11: Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applicationsWhile there has been decades of work on developing automatic, locality-enhancing transformations for regular programs that operate over dense matrices and arrays, there has been little investigation of such transformations for irregular programs, which ...







Comments