Abstract
Generally applicable techniques for improving temporal locality in irregular programs, which operate over pointer-based data structures such as trees and graphs, are scarce. Focusing on a subset of irregular programs, namely, tree traversal algorithms like Barnes-Hut and nearest neighbor, previous work has proposed point blocking, a technique analogous to loop tiling in regular programs, to improve locality. However point blocking is highly dependent on point sorting, a technique to reorder points so that consecutive points will have similar traversals. Performing this a priori sort requires an understanding of the semantics of the algorithm and hence highly application specific techniques. In this work, we propose traversal splicing, a new, general, automatic locality optimization for irregular tree traversal codes, that is less sensitive to point order, and hence can deliver substantially better performance, even in the absence of semantic information. For six benchmark algorithms, we show that traversal splicing can deliver single-thread speedups of up to 9.147 (geometric mean: 3.095) over baseline implementations, and up to 4.752 (geometric mean: 2.079) over point-blocked implementations. Further, we show that in many cases, automatically applying traversal splicing to a baseline implementation yields performance that is better than carefully hand-optimized implementations.
- T. Aila and T. Karras. Architecture considerations for tracing incoherent rays. In Proceedings of the Conference on High Performance Graphics, HPG '10, pages 113--122, Aire-la-Ville, Switzerland, Switzerland, 2010. Eurographics Association. Google Scholar
Digital Library
- M. Amor, F. Argüello, J. López, O. G. Plata, and E. L. Zapata. A data parallel formulation of the barnes-hut method for n -body simulations. In Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia, pages 342--349, 2001. Google Scholar
Digital Library
- J. Barnes and P. Hut. A hierarchical o(n log n) force-calculation algorithm. Nature, 324(4):446--449, December 1986.Google Scholar
Cross Ref
- J. L. Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18:509--517, September 1975. Google Scholar
Digital Library
- E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '01, pages 245--250, New York, NY, USA, 2001. ACM. Google Scholar
Digital Library
- T. M. Chilimbi, B. Davidson, and J. R. Larus. Cache-conscious structure definition. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, pages 13--24, 1999. Google Scholar
Digital Library
- T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-conscious structure layout. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, pages 1--12, 1999. Google Scholar
Digital Library
- T. M. Chilimbi and J. R. Larus. Using generational garbage collection to implement cache-conscious data placement. In Proceedings of the 1st international symposium on Memory management, pages 37--48, 1998. Google Scholar
Digital Library
- C. Ding and K. Kennedy. Improving cache performance in dynamic applications through data and computation reorganization at run time. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, pages 229--241, 1999. Google Scholar
Digital Library
- J. Dongarra, K. London, S. Moore, P. Mucci, and D. Terpstra. Using papi for hardware performance monitoring on linux systems. In In Conference on Linux Clusters: The HPC Revolution, Linux Clusters Institute, 2001.Google Scholar
- T. Ekman and G. Hedin. The jastadd extensible java compiler. In Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications, pages 1--18, 2007. Google Scholar
Digital Library
- A. Frank and A. Asuncion. UCI machine learning repository, 2010.Google Scholar
- A. Georges, D. Buytaert, and L. Eeckhout. Statistically rigorous java performance evaluation. In Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications, OOPSLA '07, pages 57--76, New York, NY, USA, 2007. ACM. Google Scholar
Digital Library
- R. Ghiya, L. Hendren, and Y. Zhu. Detecting parallelism in c programs with recursive data structures. IEEE Transactions on Parallel and Distributed Systems, 1:35--47, 1998. Google Scholar
Digital Library
- R. Ghiya and L. J. Hendren. Is it a tree, a dag, or a cyclic graph? a shape analysis for heap-directed pointers in c. In POPL '96: Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 1--15, 1996. Google Scholar
Digital Library
- A. G. Gray and A. W. Moore. $N$-Body Problems in Statistical Learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems (NIPS) 13 (Dec 2000), 2001.Google Scholar
- M. Greenspan and M. Yurick. Approximate kd-tree search for efficient ICP. In Fourth International Conference on 3-D Digital Imaging and Modeling, pages 442--448, 2003.Google Scholar
- M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. SIGKDD Explor. Newsl., 11(1):10--18, Nov. 2009. Google Scholar
Digital Library
- Y. Jo and M. Kulkarni. Enhancing locality for recursive traversals of recursive structures. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications, pages 463--482, 2011. Google Scholar
Digital Library
- K. Kennedy and J. Allen, editors. Optimizing compilers for modren architectures:a dependence-based approach. 2001. Google Scholar
Digital Library
- M. Kulkarni, M. Burtscher, K. Pingali, and C. Cascaval. Lonestar: A suite of parallel irregular programs. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 65--76, April 2009.Google Scholar
Cross Ref
- C. Lattner and V. Adve. Automatic pool allocation: improving performance by controlling data structure layout in the heap. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pages 129--142, 2005. Google Scholar
Digital Library
- G. Loosli, S. Canu, and L. Bottou. Training invariant support vector machines using selective sampling, 2005.Google Scholar
- E. Mansson, J. Munkberg, and T. Akenine-Moller. Deep coherent ray tracing. In Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing, pages 79--85, 2007. Google Scholar
Digital Library
- R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation Techniques for Storage Hierarchies. IBM Systems Journal, 9(2):78--117, 1970. Google Scholar
Digital Library
- L. A. Meyerovich, T. Mytkowicz, and W. Schulte. Data parallel programming for irregular tree computations. In 3rd USENIX workshop on hot topics in parallelism, 2011.Google Scholar
- N. Mitchell, L. Carter, and J. Ferrante. Localizing non-affine array references. In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, pages 192--, 1999. Google Scholar
Digital Library
- B. Moon, Y. Byun, T.-J. Kim, P. Claudio, H.-S. Kim, Y.-J. Ban, S. W. Nam, and S.-E. Yoon. Cache-oblivious ray reordering. ACM Trans. Graph., 29(3):28:1--28:10, July 2010. Google Scholar
Digital Library
- P. A. Navratil. Memory-efficient, scalable ray tracing. PhD thesis, 2010.Google Scholar
- P. A. Navratil, D. S. Fussell, C. Lin, and W. R. Mark. Dynamic ray scheduling to improve ray coherence and bandwidth utilization. In Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing, RT '07, pages 95--104, Washington, DC, USA, 2007. IEEE Computer Society. Google Scholar
Digital Library
- S. M. Omohundro. Five balltree construction algorithms. Technical report, 1989.Google Scholar
- M. Pharr, C. Kolb, R. Gershbein, and P. Hanrahan. Rendering complex scenes with memory-coherent ray tracing. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pages 101--108, 1997. Google Scholar
Digital Library
- V. K. Pingali, S. A. McKee, W. C. Hseih, and J. B. Carter. Computation regrouping: restructuring programs for temporal data cache locality. In Proceedings of the 16th international conference on Supercomputing, pages 252--261, 2002. Google Scholar
Digital Library
- M. Rinard and P. C. Diniz. Commutativity analysis: a new analysis technique for parallelizing compilers. ACM Trans. Program. Lang. Syst., 19(6):942--991, 1997. Google Scholar
Digital Library
- M. Sagiv, T. Reps, and R. Wilhelm. Parametric shape analysis via 3-valued logic. ACM Transactions on Programming Languages and Systems, 24(3), May 2002. Google Scholar
Digital Library
- J. P. Singh, C. Holt, T. Totsuka, A. Gupta, and J. Hennessy. Load balancing and data locality in adaptive hierarchical n-body methods: Barnes-hut, fast multipole, and radiosity. J. Parallel Distrib. Comput., 27(2):118--141, 1995. Google Scholar
Digital Library
- M. M. Strout, L. Carter, and J. Ferrante. Rescheduling for locality in sparse matrix computations. In Proceedings of the International Conference on Computational Sciences-Part I, pages 137--148, 2001. Google Scholar
Digital Library
- D. N. Truong, F. Bodin, and A. Seznec. Improving cache behavior of dynamically allocated data structures. In Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, pages 322--, 1998. Google Scholar
Digital Library
- B. Walter, K. Bala, M. Kulkarni, and K. Pingali. Fast agglomerative clustering for rendering. In IEEE Symposium on Interactive Ray Tracing (RT), pages 81--86, August 2008.Google Scholar
Cross Ref
- Z. Wang, C. Wu, and P.-C. Yew. On improving heap memory layout by dynamic pool allocation. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, CGO '10, pages 92--100, New York, NY, USA, 2010. ACM. Google Scholar
Digital Library
Index Terms
Automatically enhancing locality for tree traversals with traversal splicing
Recommendations
Automatically enhancing locality for tree traversals with traversal splicing
OOPSLA '12: Proceedings of the ACM international conference on Object oriented programming systems languages and applicationsGenerally applicable techniques for improving temporal locality in irregular programs, which operate over pointer-based data structures such as trees and graphs, are scarce. Focusing on a subset of irregular programs, namely, tree traversal algorithms ...
Enhancing locality for recursive traversals of recursive structures
OOPSLA '11: Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applicationsWhile there has been decades of work on developing automatic, locality-enhancing transformations for regular programs that operate over dense matrices and arrays, there has been little investigation of such transformations for irregular programs, which ...
Enhancing locality for recursive traversals of recursive structures
OOPSLA '11While there has been decades of work on developing automatic, locality-enhancing transformations for regular programs that operate over dense matrices and arrays, there has been little investigation of such transformations for irregular programs, which ...







Comments