Abstract
We present an approach to optimize the cache locality for recursive programs by dynamically splicing---recursively interleaving---the execution of distinct function invocations. By utilizing data effect annotations, we identify concurrency and data reuse opportunities across function invocations and interleave them to reduce reuse distance. We present algorithms that efficiently track effects in recursive programs, detect interference and dependencies, and interleave execution of function invocations using user-level (non-kernel) lightweight threads. To enable multi-core execution, a program is parallelized using a nested fork/join programming model. Our cache optimization strategy is designed to work in the context of a random work stealing scheduler. We present an implementation using the MIT Cilk framework that demonstrates significant improvements in sequential and parallel performance, competitive with a state-of-the-art compile-time optimizer for loop programs and a domain-specific optimizer for stencil programs.
- K. Agrawal, C. E. Leiserson, and J. Sukha. Executing task graphs using work-stealing. In 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS, pages 1–12, 2010.Google Scholar
Cross Ref
- M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion: expressing locality and independence with logical regions. In SC Conference on High Performance Computing Networking, Storage and Analysis, page 66, 2012. Google Scholar
Digital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP), pages 207–216, 1995. Google Scholar
Digital Library
- U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 101–113, 2008. Google Scholar
Digital Library
- U. Bondhugula, V. Bandishti, and I. Pananilath. Diamond tiling: Tiling techniques to maximize parallelism for stencil computations. IEEE Transactions on Parallel and Distributed Systems, 28(5):1285–1298, May 2017. Google Scholar
Digital Library
- Boost Context. Boost Context. http://www.boost. org/doc/libs/1_56_0/libs/context/doc/ html/index.html.Google Scholar
- Z. Budimlic, M. G. Burke, V. Cavé, K. Knobe, G. Lowney, R. Newton, J. Palsberg, D. M. Peixotto, V. Sarkar, F. Schlimbach, and S. Tasirlar. Concurrent collections. Scientific Programming, 18(3-4):203–217, 2010. Google Scholar
Digital Library
- R. M. Burstall and J. Darlington. A transformation system for developing recursive programs. Journal of the ACM, 24(1): 44–67, 1977. Google Scholar
Digital Library
- E. Chan, E. S. Quintana-Ortí, G. Quintana-Ortí, and R. A. van de Geijn. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116–125, 2007. Google Scholar
Digital Library
- R. Chandra, A. Gupta, and J. L. Hennessy. Data locality and load balancing in COOL. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP), pages 249–259, 1993. Google Scholar
Digital Library
- S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on CMPs. In SPAA: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 105–115, 2007. Google Scholar
Digital Library
- M. E. Conway. Design of a separable transition-diagram compiler. Communications of the ACM, 6(7):396–408, 1963. Google Scholar
Digital Library
- J. S. Danaher, I. A. Lee, and C. E. Leiserson. Programming with exceptions in JCilk. Science of Computer Programming, 63(2):147–171, 2006. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 212–223, 1998. Google Scholar
Digital Library
- Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In 23rd IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pages 1–12, 2009. Google Scholar
Digital Library
- Y. Guo, Y. Zhao, V. Cavé, and V. Sarkar. SLAW: a scalable locality-aware adaptive work-stealing scheduler for multicore systems. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 341–342, 2010. Google Scholar
Digital Library
- S. Z. Guyer and C. Lin. An annotation language for optimizing software libraries. In Proceedings of the Second Conference on Domain-Specific Languages (DSL), pages 39–52, 1999. Google Scholar
Digital Library
- S. Heumann, V. S. Adve, and S. Wang. The tasks with effects model for safe concurrency. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 239–250, 2013. Google Scholar
Digital Library
- Y. Jo and M. Kulkarni. Enhancing locality for recursive traversals of recursive structures. In Proceedings of the 26th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 463–482, 2011. Google Scholar
Digital Library
- R. L. B. Jr., S. Heumann, N. Honarmand, S. V. Adve, V. S. Adve, A. Welc, and T. Shpeisman. Safe nondeterminism in a deterministic-by-default parallel language. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 535–548, 2011. Google Scholar
Digital Library
- K. Kennedy, B. Broom, K. D. Cooper, J. Dongarra, R. J. Fowler, D. Gannon, S. L. Johnsson, J. M. Mellor-Crummey, and L. Torczon. Telescoping languages: A strategy for automatic generation of scientific problem-solving systems from annotated libraries. Journal of Parallel Distribued Computing, 61(12):1803–1826, 2001. Google Scholar
Digital Library
- D. Lea. A Java fork/join framework. In Proceedings of the ACM 2000 Java Grande Conference, pages 36–43, 2000. Google Scholar
Digital Library
- J. Lifflander, S. Krishnamoorthy, and L. V. Kalé. Optimizing data locality for fork/join programs using constrained work stealing. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 857–868, 2014. Google Scholar
Digital Library
- B. D. Marsh, M. L. Scott, T. J. LeBlanc, and E. P. Markatos. First-class user-level theads. In Proceedings of the Thirteenth ACM Symposium on Operating System Principles (SOSP), pages 110–121, 1991. Google Scholar
Digital Library
- V. Maslov. Delinearization: An efficient way to break multiloop dependence equations. In Proceedings of the ACM SIGPLAN’92 Conference on Programming Language Design and Implementation (PLDI), pages 152–161, 1992. Google Scholar
Digital Library
- MIT Cilk 5.4.6. MIT Cilk 5.4.6. http://supertech. lcs.mit.edu/cilk.Google Scholar
- V. K. Nandivada, J. Shirako, J. Zhao, and V. Sarkar. A transformation framework for optimizing task-parallel programs. ACM Transactions on Programming Languages and Systems, 35(1):3:1–3:48, 2013. Google Scholar
Digital Library
- OpenMP Architecture Review Board. OpenMP Specification and Features. http://openmp.org/wp/, May 2008.Google Scholar
- A. Pan and V. Pai. Runtime-driven shared last-level cache management for task-parallel programs. Technical Report 466, Department of Electrical and Computer Engineering, Purdue University, 2015.Google Scholar
Digital Library
- J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache locality. In ASPLOS-VII Proceedings - Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 60–71, 1996. Google Scholar
Digital Library
- L.-N. Pouchet. Polybench: The polyhedral benchmark suite, 2012.Google Scholar
- D. J. Quinlan, M. Schordan, Q. Yi, and A. Sæbjørnsen. Classification and utilization of abstractions for optimization. In Leveraging Applications of Formal Methods, First International Symposium (ISoLA), pages 57–73, 2004. Google Scholar
Digital Library
- J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. 2007. Google Scholar
Digital Library
- A. D. Robison. Composable parallel patterns with Intel Cilk Plus. Computing in Science and Engineering, 15(2):66–71, 2013. Google Scholar
Digital Library
- R. Rugina and M. C. Rinard. Automatic parallelization of divide and conquer algorithms. In Proceedings of the 1999 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 72–83, 1999. Google Scholar
Digital Library
- R. Rugina and M. C. Rinard. Pointer analysis for structured parallel programs. ACM Transactions on Programming Languages and Systems, 25(1):70–116, 2003. Google Scholar
Digital Library
- R. Rugina and M. C. Rinard. Symbolic bounds analysis of pointers, array indices, and accessed memory regions. ACM Transactions on Programming Languages and Systems, 27(2): 185–235, 2005. Google Scholar
Digital Library
- S. Seo, A. Amer, P. Balaji, C. Bordage, G. Bosilca, A. Brooks, A. Castello, D. Genet, T. Herault, P. Jindal, L. Kale, S. Krishnamoorthy, J. Lifflander, H. Lu, E. Meneses, M. Snir, Y. Sun, and P. H. Beckman. Argobots: a lightweight threading/tasking framework. Technical Report ANL/MCS-P5515-0116, Argonne National Laboratory, 2016.Google Scholar
- A. K. Sujeeth, T. Rompf, K. J. Brown, H. Lee, H. Chafi, V. Popic, M. Wu, A. Prokopec, V. Jovanovic, M. Odersky, and K. Olukotun. Composition and reuse with compiled domainspecific languages. In ECOOP - Object-Oriented Programming - 27th European Conference, pages 52–78, 2013. Google Scholar
Digital Library
- Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 117–128, 2011. Google Scholar
Digital Library
- O. Tardieu, H. Wang, and H. Lin. A work-stealing scheduler for X10’s task parallelism with suspension. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 267–276, 2012. Google Scholar
Digital Library
- TPL. The Task Parallel Library. http://msdn. microsoft.com/en-us/magazine/cc163340.Google Scholar
- aspx, Oct. 2007.Google Scholar
- S. Treichler, M. Bauer, and A. Aiken. Language support for dynamic, hierarchical data partitioning. In Proceedings of the ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA), pages 495–514, 2013. Google Scholar
Digital Library
- T. L. Veldhuizen. Active libraries and universal languages. PhD thesis, Indiana University, 2004. Google Scholar
Digital Library
- E. M. Westbrook, J. Zhao, Z. Budimlic, and V. Sarkar. Permission regions for race-free parallelism. In Runtime Verification - Second International Conference (RV), pages 94–109, 2011. Google Scholar
Digital Library
- K. B. Wheeler, R. C. Murphy, and D. Thain. Qthreads: An API for programming with millions of lightweight threads. In 22nd IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pages 1–8, 2008.Google Scholar
Cross Ref
- X10. The X10 Programming Language. www.research. ibm.com/x10/, Mar. 2006.Google Scholar
Index Terms
Cache locality optimization for recursive programs
Recommendations
Cache locality optimization for recursive programs
PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and ImplementationWe present an approach to optimize the cache locality for recursive programs by dynamically splicing---recursively interleaving---the execution of distinct function invocations. By utilizing data effect annotations, we identify concurrency and data ...
Locality Transformations for Nested Recursive Iteration Spaces
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsThere has been a significant amount of effort invested in designing scheduling transformations such as loop tiling and loop fusion that rearrange the execution of dynamic instances of loop nests to place operations that access the same data close ...
Locality Transformations for Nested Recursive Iteration Spaces
ASPLOS '17There has been a significant amount of effort invested in designing scheduling transformations such as loop tiling and loop fusion that rearrange the execution of dynamic instances of loop nests to place operations that access the same data close ...






Comments