skip to main content
research-article

Automatically enhancing locality for tree traversals with traversal splicing

Published:19 October 2012Publication History
Skip Abstract Section

Abstract

Generally applicable techniques for improving temporal locality in irregular programs, which operate over pointer-based data structures such as trees and graphs, are scarce. Focusing on a subset of irregular programs, namely, tree traversal algorithms like Barnes-Hut and nearest neighbor, previous work has proposed point blocking, a technique analogous to loop tiling in regular programs, to improve locality. However point blocking is highly dependent on point sorting, a technique to reorder points so that consecutive points will have similar traversals. Performing this a priori sort requires an understanding of the semantics of the algorithm and hence highly application specific techniques. In this work, we propose traversal splicing, a new, general, automatic locality optimization for irregular tree traversal codes, that is less sensitive to point order, and hence can deliver substantially better performance, even in the absence of semantic information. For six benchmark algorithms, we show that traversal splicing can deliver single-thread speedups of up to 9.147 (geometric mean: 3.095) over baseline implementations, and up to 4.752 (geometric mean: 2.079) over point-blocked implementations. Further, we show that in many cases, automatically applying traversal splicing to a baseline implementation yields performance that is better than carefully hand-optimized implementations.

References

  1. T. Aila and T. Karras. Architecture considerations for tracing incoherent rays. In Proceedings of the Conference on High Performance Graphics, HPG '10, pages 113--122, Aire-la-Ville, Switzerland, Switzerland, 2010. Eurographics Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Amor, F. Argüello, J. López, O. G. Plata, and E. L. Zapata. A data parallel formulation of the barnes-hut method for n -body simulations. In Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia, pages 342--349, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Barnes and P. Hut. A hierarchical o(n log n) force-calculation algorithm. Nature, 324(4):446--449, December 1986.Google ScholarGoogle ScholarCross RefCross Ref
  4. J. L. Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18:509--517, September 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '01, pages 245--250, New York, NY, USA, 2001. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. M. Chilimbi, B. Davidson, and J. R. Larus. Cache-conscious structure definition. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, pages 13--24, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-conscious structure layout. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, pages 1--12, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. M. Chilimbi and J. R. Larus. Using generational garbage collection to implement cache-conscious data placement. In Proceedings of the 1st international symposium on Memory management, pages 37--48, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Ding and K. Kennedy. Improving cache performance in dynamic applications through data and computation reorganization at run time. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, pages 229--241, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Dongarra, K. London, S. Moore, P. Mucci, and D. Terpstra. Using papi for hardware performance monitoring on linux systems. In In Conference on Linux Clusters: The HPC Revolution, Linux Clusters Institute, 2001.Google ScholarGoogle Scholar
  11. T. Ekman and G. Hedin. The jastadd extensible java compiler. In Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications, pages 1--18, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Frank and A. Asuncion. UCI machine learning repository, 2010.Google ScholarGoogle Scholar
  13. A. Georges, D. Buytaert, and L. Eeckhout. Statistically rigorous java performance evaluation. In Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications, OOPSLA '07, pages 57--76, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Ghiya, L. Hendren, and Y. Zhu. Detecting parallelism in c programs with recursive data structures. IEEE Transactions on Parallel and Distributed Systems, 1:35--47, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Ghiya and L. J. Hendren. Is it a tree, a dag, or a cyclic graph? a shape analysis for heap-directed pointers in c. In POPL '96: Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 1--15, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. G. Gray and A. W. Moore. $N$-Body Problems in Statistical Learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems (NIPS) 13 (Dec 2000), 2001.Google ScholarGoogle Scholar
  17. M. Greenspan and M. Yurick. Approximate kd-tree search for efficient ICP. In Fourth International Conference on 3-D Digital Imaging and Modeling, pages 442--448, 2003.Google ScholarGoogle Scholar
  18. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. SIGKDD Explor. Newsl., 11(1):10--18, Nov. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Jo and M. Kulkarni. Enhancing locality for recursive traversals of recursive structures. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications, pages 463--482, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. Kennedy and J. Allen, editors. Optimizing compilers for modren architectures:a dependence-based approach. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Kulkarni, M. Burtscher, K. Pingali, and C. Cascaval. Lonestar: A suite of parallel irregular programs. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 65--76, April 2009.Google ScholarGoogle ScholarCross RefCross Ref
  22. C. Lattner and V. Adve. Automatic pool allocation: improving performance by controlling data structure layout in the heap. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pages 129--142, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. Loosli, S. Canu, and L. Bottou. Training invariant support vector machines using selective sampling, 2005.Google ScholarGoogle Scholar
  24. E. Mansson, J. Munkberg, and T. Akenine-Moller. Deep coherent ray tracing. In Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing, pages 79--85, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation Techniques for Storage Hierarchies. IBM Systems Journal, 9(2):78--117, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. L. A. Meyerovich, T. Mytkowicz, and W. Schulte. Data parallel programming for irregular tree computations. In 3rd USENIX workshop on hot topics in parallelism, 2011.Google ScholarGoogle Scholar
  27. N. Mitchell, L. Carter, and J. Ferrante. Localizing non-affine array references. In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, pages 192--, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. B. Moon, Y. Byun, T.-J. Kim, P. Claudio, H.-S. Kim, Y.-J. Ban, S. W. Nam, and S.-E. Yoon. Cache-oblivious ray reordering. ACM Trans. Graph., 29(3):28:1--28:10, July 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. A. Navratil. Memory-efficient, scalable ray tracing. PhD thesis, 2010.Google ScholarGoogle Scholar
  30. P. A. Navratil, D. S. Fussell, C. Lin, and W. R. Mark. Dynamic ray scheduling to improve ray coherence and bandwidth utilization. In Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing, RT '07, pages 95--104, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. M. Omohundro. Five balltree construction algorithms. Technical report, 1989.Google ScholarGoogle Scholar
  32. M. Pharr, C. Kolb, R. Gershbein, and P. Hanrahan. Rendering complex scenes with memory-coherent ray tracing. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pages 101--108, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. V. K. Pingali, S. A. McKee, W. C. Hseih, and J. B. Carter. Computation regrouping: restructuring programs for temporal data cache locality. In Proceedings of the 16th international conference on Supercomputing, pages 252--261, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Rinard and P. C. Diniz. Commutativity analysis: a new analysis technique for parallelizing compilers. ACM Trans. Program. Lang. Syst., 19(6):942--991, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Sagiv, T. Reps, and R. Wilhelm. Parametric shape analysis via 3-valued logic. ACM Transactions on Programming Languages and Systems, 24(3), May 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. P. Singh, C. Holt, T. Totsuka, A. Gupta, and J. Hennessy. Load balancing and data locality in adaptive hierarchical n-body methods: Barnes-hut, fast multipole, and radiosity. J. Parallel Distrib. Comput., 27(2):118--141, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. M. Strout, L. Carter, and J. Ferrante. Rescheduling for locality in sparse matrix computations. In Proceedings of the International Conference on Computational Sciences-Part I, pages 137--148, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D. N. Truong, F. Bodin, and A. Seznec. Improving cache behavior of dynamically allocated data structures. In Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, pages 322--, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. B. Walter, K. Bala, M. Kulkarni, and K. Pingali. Fast agglomerative clustering for rendering. In IEEE Symposium on Interactive Ray Tracing (RT), pages 81--86, August 2008.Google ScholarGoogle ScholarCross RefCross Ref
  40. Z. Wang, C. Wu, and P.-C. Yew. On improving heap memory layout by dynamic pool allocation. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, CGO '10, pages 92--100, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatically enhancing locality for tree traversals with traversal splicing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 47, Issue 10
      OOPSLA '12
      October 2012
      1011 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2398857
      Issue’s Table of Contents
      • cover image ACM Conferences
        OOPSLA '12: Proceedings of the ACM international conference on Object oriented programming systems languages and applications
        October 2012
        1052 pages
        ISBN:9781450315616
        DOI:10.1145/2384616

      Copyright © 2012 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 October 2012

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!