skip to main content
article

Cache locality optimization for recursive programs

Published:14 June 2017Publication History
Skip Abstract Section

Abstract

We present an approach to optimize the cache locality for recursive programs by dynamically splicing---recursively interleaving---the execution of distinct function invocations. By utilizing data effect annotations, we identify concurrency and data reuse opportunities across function invocations and interleave them to reduce reuse distance. We present algorithms that efficiently track effects in recursive programs, detect interference and dependencies, and interleave execution of function invocations using user-level (non-kernel) lightweight threads. To enable multi-core execution, a program is parallelized using a nested fork/join programming model. Our cache optimization strategy is designed to work in the context of a random work stealing scheduler. We present an implementation using the MIT Cilk framework that demonstrates significant improvements in sequential and parallel performance, competitive with a state-of-the-art compile-time optimizer for loop programs and a domain-specific optimizer for stencil programs.

References

  1. K. Agrawal, C. E. Leiserson, and J. Sukha. Executing task graphs using work-stealing. In 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS, pages 1–12, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  2. M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion: expressing locality and independence with logical regions. In SC Conference on High Performance Computing Networking, Storage and Analysis, page 66, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP), pages 207–216, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 101–113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. U. Bondhugula, V. Bandishti, and I. Pananilath. Diamond tiling: Tiling techniques to maximize parallelism for stencil computations. IEEE Transactions on Parallel and Distributed Systems, 28(5):1285–1298, May 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Boost Context. Boost Context. http://www.boost. org/doc/libs/1_56_0/libs/context/doc/ html/index.html.Google ScholarGoogle Scholar
  7. Z. Budimlic, M. G. Burke, V. Cavé, K. Knobe, G. Lowney, R. Newton, J. Palsberg, D. M. Peixotto, V. Sarkar, F. Schlimbach, and S. Tasirlar. Concurrent collections. Scientific Programming, 18(3-4):203–217, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. M. Burstall and J. Darlington. A transformation system for developing recursive programs. Journal of the ACM, 24(1): 44–67, 1977. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Chan, E. S. Quintana-Ortí, G. Quintana-Ortí, and R. A. van de Geijn. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116–125, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Chandra, A. Gupta, and J. L. Hennessy. Data locality and load balancing in COOL. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP), pages 249–259, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on CMPs. In SPAA: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 105–115, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. E. Conway. Design of a separable transition-diagram compiler. Communications of the ACM, 6(7):396–408, 1963. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. S. Danaher, I. A. Lee, and C. E. Leiserson. Programming with exceptions in JCilk. Science of Computer Programming, 63(2):147–171, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 212–223, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In 23rd IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pages 1–12, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y. Guo, Y. Zhao, V. Cavé, and V. Sarkar. SLAW: a scalable locality-aware adaptive work-stealing scheduler for multicore systems. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 341–342, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Z. Guyer and C. Lin. An annotation language for optimizing software libraries. In Proceedings of the Second Conference on Domain-Specific Languages (DSL), pages 39–52, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Heumann, V. S. Adve, and S. Wang. The tasks with effects model for safe concurrency. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 239–250, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Jo and M. Kulkarni. Enhancing locality for recursive traversals of recursive structures. In Proceedings of the 26th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 463–482, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. L. B. Jr., S. Heumann, N. Honarmand, S. V. Adve, V. S. Adve, A. Welc, and T. Shpeisman. Safe nondeterminism in a deterministic-by-default parallel language. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 535–548, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. K. Kennedy, B. Broom, K. D. Cooper, J. Dongarra, R. J. Fowler, D. Gannon, S. L. Johnsson, J. M. Mellor-Crummey, and L. Torczon. Telescoping languages: A strategy for automatic generation of scientific problem-solving systems from annotated libraries. Journal of Parallel Distribued Computing, 61(12):1803–1826, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Lea. A Java fork/join framework. In Proceedings of the ACM 2000 Java Grande Conference, pages 36–43, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Lifflander, S. Krishnamoorthy, and L. V. Kalé. Optimizing data locality for fork/join programs using constrained work stealing. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 857–868, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. D. Marsh, M. L. Scott, T. J. LeBlanc, and E. P. Markatos. First-class user-level theads. In Proceedings of the Thirteenth ACM Symposium on Operating System Principles (SOSP), pages 110–121, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. V. Maslov. Delinearization: An efficient way to break multiloop dependence equations. In Proceedings of the ACM SIGPLAN’92 Conference on Programming Language Design and Implementation (PLDI), pages 152–161, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. MIT Cilk 5.4.6. MIT Cilk 5.4.6. http://supertech. lcs.mit.edu/cilk.Google ScholarGoogle Scholar
  27. V. K. Nandivada, J. Shirako, J. Zhao, and V. Sarkar. A transformation framework for optimizing task-parallel programs. ACM Transactions on Programming Languages and Systems, 35(1):3:1–3:48, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. OpenMP Architecture Review Board. OpenMP Specification and Features. http://openmp.org/wp/, May 2008.Google ScholarGoogle Scholar
  29. A. Pan and V. Pai. Runtime-driven shared last-level cache management for task-parallel programs. Technical Report 466, Department of Electrical and Computer Engineering, Purdue University, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache locality. In ASPLOS-VII Proceedings - Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 60–71, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L.-N. Pouchet. Polybench: The polyhedral benchmark suite, 2012.Google ScholarGoogle Scholar
  32. D. J. Quinlan, M. Schordan, Q. Yi, and A. Sæbjørnsen. Classification and utilization of abstractions for optimization. In Leveraging Applications of Formal Methods, First International Symposium (ISoLA), pages 57–73, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. D. Robison. Composable parallel patterns with Intel Cilk Plus. Computing in Science and Engineering, 15(2):66–71, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. Rugina and M. C. Rinard. Automatic parallelization of divide and conquer algorithms. In Proceedings of the 1999 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 72–83, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. R. Rugina and M. C. Rinard. Pointer analysis for structured parallel programs. ACM Transactions on Programming Languages and Systems, 25(1):70–116, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. R. Rugina and M. C. Rinard. Symbolic bounds analysis of pointers, array indices, and accessed memory regions. ACM Transactions on Programming Languages and Systems, 27(2): 185–235, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Seo, A. Amer, P. Balaji, C. Bordage, G. Bosilca, A. Brooks, A. Castello, D. Genet, T. Herault, P. Jindal, L. Kale, S. Krishnamoorthy, J. Lifflander, H. Lu, E. Meneses, M. Snir, Y. Sun, and P. H. Beckman. Argobots: a lightweight threading/tasking framework. Technical Report ANL/MCS-P5515-0116, Argonne National Laboratory, 2016.Google ScholarGoogle Scholar
  39. A. K. Sujeeth, T. Rompf, K. J. Brown, H. Lee, H. Chafi, V. Popic, M. Wu, A. Prokopec, V. Jovanovic, M. Odersky, and K. Olukotun. Composition and reuse with compiled domainspecific languages. In ECOOP - Object-Oriented Programming - 27th European Conference, pages 52–78, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 117–128, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. O. Tardieu, H. Wang, and H. Lin. A work-stealing scheduler for X10’s task parallelism with suspension. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 267–276, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. TPL. The Task Parallel Library. http://msdn. microsoft.com/en-us/magazine/cc163340.Google ScholarGoogle Scholar
  43. aspx, Oct. 2007.Google ScholarGoogle Scholar
  44. S. Treichler, M. Bauer, and A. Aiken. Language support for dynamic, hierarchical data partitioning. In Proceedings of the ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA), pages 495–514, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. T. L. Veldhuizen. Active libraries and universal languages. PhD thesis, Indiana University, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. E. M. Westbrook, J. Zhao, Z. Budimlic, and V. Sarkar. Permission regions for race-free parallelism. In Runtime Verification - Second International Conference (RV), pages 94–109, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. K. B. Wheeler, R. C. Murphy, and D. Thain. Qthreads: An API for programming with millions of lightweight threads. In 22nd IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pages 1–8, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  48. X10. The X10 Programming Language. www.research. ibm.com/x10/, Mar. 2006.Google ScholarGoogle Scholar

Index Terms

  1. Cache locality optimization for recursive programs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 52, Issue 6
        PLDI '17
        June 2017
        708 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/3140587
        Issue’s Table of Contents
        • cover image ACM Conferences
          PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation
          June 2017
          708 pages
          ISBN:9781450349888
          DOI:10.1145/3062341

        Copyright © 2017 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 June 2017

        Check for updates

        Qualifiers

        • article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!