skip to main content
research-article

Enhancing locality for recursive traversals of recursive structures

Published:22 October 2011Publication History
Skip Abstract Section

Abstract

While there has been decades of work on developing automatic, locality-enhancing transformations for regular programs that operate over dense matrices and arrays, there has been little investigation of such transformations for irregular programs, which operate over pointer-based data structures such as graphs, trees and lists. In this paper, we argue that, for a class of irregular applications we call traversal codes, there exists substantial data reuse and hence opportunity for locality exploitation. We develop a novel optimization called point blocking, inspired by the classic tiling loop transformation, and show that it can substantially enhance temporal locality in traversal codes. We then present a transformation and optimization framework called TreeTiler that automatically detects opportunities for applying point blocking and applies the transformation. TreeTiler uses autotuning techniques to determine appropriate parameters for the transformation. For a series of traversal algorithms drawn from real-world applications, we show that TreeTiler is able to deliver performance improvements of up to 245% over an optimized (but non-transformed) parallel baseline, and in several cases, significantly better scalability.

References

  1. S. Aluru, J. Gustafson, G. M. Prabhu, and F. E. Sevilgen. Distribution-independent hierarchical algorithms for the n-body problem. J. Supercomput., 12:303--323, October 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Amor, F. Argüello, J. López, O. G. Plata, and E. L. Zapata. A data parallel formulation of the barnes-hut method for n -body simulations. In Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia, PARA'00, pages 342--349, London, UK, 2001. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Barnes and P. Hut. A hierarchical o(n log n) force-calculation algorithm. Nature, 324(4):446--449, December 1986.Google ScholarGoogle ScholarCross RefCross Ref
  4. T. M. Chilimbi, B. Davidson, and J. R. Larus. Cache-conscious structure definition. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, PLDI '99, pages 13--24, New York, NY, USA, 1999. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-conscious structure layout. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, PLDI '99, pages 1--12, New York, NY, USA, 1999. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. M. Chilimbi and J. R. Larus. Using generational garbage collection to implement cache-conscious data placement. In Proceedings of the 1st international symposium on Memory management, ISMM '98, pages 37--48, New York, NY, USA, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Ekman and G. Hedin. The jastadd extensible java compiler. In Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications, OOPSLA '07, pages 1--18, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. C. Fox. A graphical approach to load balancing and sparse matrix vector multiplication on the hypercube. Institute for Mathematics and Its Applications, 13:37--, 1988.Google ScholarGoogle Scholar
  9. M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the cilk-5 multithreaded language. SIGPLAN Not., 33(5):212--223, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Ghiya, L. Hendren, and Y. Zhu. Detecting parallelism in c programs with recursive data structures. IEEE Transactions on Parallel and Distributed Systems, 1:35--47, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Ghiya and L. J. Hendren. Is it a tree, a dag, or a cyclic graph? a shape analysis for heap-directed pointers in c. In POPL '96: Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 1--15, New York, NY, USA, 1996. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. G. Gray and A. W. Moore. $N$-Body Problems in Statistical Learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems (NIPS) 13 (Dec 2000). MIT Press, 2001.Google ScholarGoogle Scholar
  13. T. Hamada, T. Narumi, R. Yokota, K. Yasuoka, K. Nitadori, and M. Taiji. 42 tflops hierarchical n-body simulations on gpus with applications in both astrophysics and turbulence. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--12, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Han and C.-W. Tseng. Exploiting locality for irregular scientific codes. IEEE Trans. Parallel Distrib. Syst., 17:606--618, July 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Hernquist. Vectorization of tree traversals. J. Comput. Phys., 87:137--147, March 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. K. Kennedy and J. Allen, editors. Optimizing compilers for modren architectures:a dependence-based approach. Morgan Kaufmann, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Kulkarni, M. Burtscher, K. Pingali, and C. Cascaval. Lonestar: A suite of parallel irregular programs. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 65--76, April 2009.Google ScholarGoogle ScholarCross RefCross Ref
  18. M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. SIGPLAN Not. (Proceedings of PLDI 2007), 42(6):211--222, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. H. L. Nyland and J. Prins. Fast n-body simulation with cuda. GPU Gems, (3):677--695, 2007.Google ScholarGoogle Scholar
  20. C. Lattner and V. Adve. Automatic pool allocation: improving performance by controlling data structure layout in the heap. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, PLDI '05, pages 129--142, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Makino. Vectorization of a treecode. J. Comput. Phys., 87:148--160, March 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. E. Mansson, J. Munkberg, and T. Akenine-Moller. Deep coherent ray tracing. In Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing, pages 79--85, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Mellor-Crummey, D. Whalley, and K. Kennedy. Improving memory hierarchy performance for irregular applications using data and computation reorderings. Int. J. Parallel Program., 29(3):217--247, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Pharr, C. Kolb, R. Gershbein, and P. Hanrahan. Rendering complex scenes with memory-coherent ray tracing. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, SIGGRAPH '97, pages 101--108, New York, NY, USA, 1997. ACM Press/Addison-Wesley Publishing Co. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Pingali, M. Kulkarni, D. Nguyen, M. Burtscher, M. Mendez-Lojo, D. Prountzos, X. Sui, and Z. Zhong. Amorphous data-parallelism in irregular algorithms. Technical Report TR-09-05, Department of Computer Science, The University of Texas at Austin, February 2009.Google ScholarGoogle Scholar
  26. M. Rinard and P. C. Diniz. Commutativity analysis: a new analysis technique for parallelizing compilers. ACM Trans. Program. Lang. Syst., 19(6):942--991, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Sagiv, T. Reps, and R. Wilhelm. Parametric shape analysis via 3-valued logic. ACM Transactions on Programming Languages and Systems, 24(3), May 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. K. Salmon. Parallel hierarchical N-body methods. PhD thesis, Pasadena, CA, USA, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. P. Singh, C. Holt, T. Totsuka, A. Gupta, and J. Hennessy. Load balancing and data locality in adaptive hierarchical n-body methods: Barnes-hut, fast multipole, and radiosity. J. Parallel Distrib. Comput., 27(2):118--141, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Thees and C. Weiland. Implementing lightcuts. Technical report, Fachhochschule Bonn-Rhein-Sieg, University of Applied Sciences, Fachbereich Informatik, July 2008.Google ScholarGoogle Scholar
  31. A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth. A scalable auto-tuning framework for compiler optimization. In IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing, pages 1--12, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. N. Truong, F. Bodin, and A. Seznec. Improving cache behavior of dynamically allocated data structures. In Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, PACT '98, pages 322--, Washington, DC, USA, 1998. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Vuduc, J. W. Demmel, and K. A. Yelick. Oski: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series, 16(1), 2005.Google ScholarGoogle Scholar
  34. I. Wald. On fast construction of sah-based bounding volume hierarchies. In RT '07: Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing, pages 33--40, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. B. Walter, K. Bala, M. Kulkarni, and K. Pingali. Fast agglomerative clustering for rendering. In IEEE Symposium on Interactive Ray Tracing (RT), pages 81--86, August 2008.Google ScholarGoogle ScholarCross RefCross Ref
  36. B. Walter, S. Fernandez, A. Arbree, K. Bala, M. Donikian, and D. Greenberg. Lightcuts: a scalable approach to illumination. ACM Transactions on Graphics (SIGGRAPH), 24(3):1098--1107, July 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the atlas project. Parallel Computing, 27:2001, 2000.Google ScholarGoogle Scholar

Index Terms

  1. Enhancing locality for recursive traversals of recursive structures

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 46, Issue 10
      OOPSLA '11
      October 2011
      1063 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2076021
      Issue’s Table of Contents
      • cover image ACM Conferences
        OOPSLA '11: Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
        October 2011
        1104 pages
        ISBN:9781450309400
        DOI:10.1145/2048066

      Copyright © 2011 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 October 2011

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!