skip to main content
research-article

Combining Software Cache Partitioning and Loop Tiling for Effective Shared Cache Management

Published:22 May 2018Publication History
Skip Abstract Section

Abstract

One of the biggest challenges in multicore platforms is shared cache management, especially for data-dominant applications. Two commonly used approaches for increasing shared cache utilization are cache partitioning and loop tiling. However, state-of-the-art compilers lack efficient cache partitioning and loop tiling methods for two reasons. First, cache partitioning and loop tiling are strongly coupled together, and thus addressing them separately is simply not effective. Second, cache partitioning and loop tiling must be tailored to the target shared cache architecture details and the memory characteristics of the corunning workloads.

To the best of our knowledge, this is the first time that a methodology provides (1) a theoretical foundation in the above-mentioned cache management mechanisms and (2) a unified framework to orchestrate these two mechanisms in tandem (not separately). Our approach manages to lower the number of main memory accesses by an order of magnitude keeping at the same time the number of arithmetic/addressing instructions to a minimal level. We motivate this work by showcasing that cache partitioning, loop tiling, data array layouts, shared cache architecture details (i.e., cache size and associativity), and the memory reuse patterns of the executing tasks must be addressed together as one problem, when a (near)-optimal solution is requested. To this end, we present a search space exploration analysis where our proposal is able to offer a vast deduction in the required search space.

References

  1. F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O’Boyle, J. Thomson, M. Toussaint, and C. K. I. Williams. 2006. Using machine learning to focus iterative optimization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). IEEE, 295--305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Almagor, Keith D. Cooper, A. Grosul, T. J. Harvey, S. W. Reeves, D. Subramanian, L. Torczon, and T. Waterman. 2004. Finding effective compilation sequences. ACM SIGPLAN Not. 39, 7, 231--239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Utpal Banerjee. 1993. Linear Equations and Inequalities. Springer, Boston, MA, 49--94.Google ScholarGoogle Scholar
  4. Bin Bao and Chen Ding. 2013. Defensive loop tiling for shared cache. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). IEEE, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. M. Baskaran, N. Vydyanathan, U. K. R. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. 2009. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. ACM SIGPLAN Not. 44, 4, 219--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. ACM SIGARCH Comput. Archit. News 39, 2, 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. 2008a. A practical automatic polyhedral parallelizer and locality optimizer. ACM SIGPLAN Not. 43, 6, 101--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Uday Bondhugula, J. Ramanujam, and P. Sadayppan. 2008b. PLuTo: A practical and fully automatic polyhedral program optimization system. In Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI’08).Google ScholarGoogle Scholar
  9. Bach D. Bui, Marco Caccamo, Lui Sha, and Joseph Martinez. 2008. Impact of cache partitioning on multi-tasking real time embedded systems. In Proceedings of the 2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications, 101--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jichuan Chang and Gurindar S. Sohi. 2014. Cooperative cache partitioning for chip multiprocessors. In Proceedings of the 21st Annual International Conference on Supercomputing (ICS’07). ACM, 80--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Shi-Kuo Chang. 2003. Data Structures and Algorithms. Series on Software Engineering and Knowledge Engineering, Vol. 13. World Scientific.Google ScholarGoogle Scholar
  12. Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steven Reeves, Devika Subramanian, Linda Torczon, and Todd Waterman. 2005. ACME: Adaptive compilation made efficient. ACM SIGPLAN Not. 40, 7, 69--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steve Reeves, Devika Subramanian, Linda Torczon, and Todd Waterman. 2006. Exploring the structure of the space of compilation sequences using randomized search algorithms. J. Supercomput. 36, 2, 135--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Xiaoning Ding, Kaibo Wang, and Xiaodong Zhang. 2011. ULCC: A user-level facility for optimizing shared cache performance on multicores. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). ACM, 103--112. http://dblp.uni-trier.de/db/conf/ppopp/ppopp2011.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Haakon Dybdahl and Per Stenstrm. 2007. An adaptive shared/private NUCA cache partitioning scheme for chip multiprocessors. In Proceedings of the IEEE 13th International Symposium on High Performance Computer Architecture (HPCA’07). IEEE, 2--12. http://dblp.uni-trier.de/db/conf/hpca/hpca2007.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Haneda, P. M. W. Khnijnenburg, and H. A. G. Wijshoff. 2005. Automatic selection of compiler options using non-parametric inferential statistics. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05). IEEE,, 123--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Henretty, A. Rountev, H. Lin, J. Ramanujam, Q. Lu, T. F. Ngai, Y. Chen, C. Alias, U. Bondhugula, P. Sadayappan, and S. Krishnamoorthy. 2009. Data layout transformation for enhancing data locality on NUCA chip multiprocessors. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 348--357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Mahmut Kandemir, Sai Prashanth Muralidhara, Sri Hari Krishna Narayanan, Yuanrui Zhang, and Ozcan Ozturk. 2009. Optimizing shared cache behavior of chip multiprocessors. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42). ACM, 505--516. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Mahmut Kandemir, Taylan Yemliha, SaiPrashanth Muralidhara, Shekhar Srikantaiah, Mary Jane Irwin, and Yuanrui Zhnag. 2010. Cache topology aware computation mapping for multicores. ACM SIGPLAN Not. 45, 6, 74--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Dimitris Kaseridis, Jeffrey Stuecheli, and Lizy K. John. 2009. Bank-aware dynamic cache partitioning for multicore architectures. In Proceedings of the International Conference on Parallel Processing (ICPP’09). 18--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Vasilios Kelefouras, Angeliki Kritikakou, and Costas Goutis. 2015. A methodology for speeding up loop kernels by exploiting the software information and the memory architecture. Comput. Lang. Syst. Structures 41, 21--41. http://dblp.uni-trier.de/db/journals/cl/cl41.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Multi-level tiling: M for the price of one. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). ACM, Article 51, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hyoseung Kim, Arvind Kandhalu, and Ragunathan Rajkumar. 2013. A coordinated approach for practical OS-level cache management in multi-core real-time systems. In Proceedings of the 25th Euromicro Conference on Real-Time Systems, (ECRTS’13). 80--89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. M. W. Knijnenburg, T. Kisuki, K. Gallivan, and M. F. P. O’Boyle. 2004. The effect of cache models on iterative compilation for combined tiling and unrolling. Concur. Comput. Pract. Exper. 16, 2--3, 247--270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Prasad Kulkarni, Stephen Hines, Jason Hiser, David Whalley, Jack Davidson, and Douglas Jones. 2004. Fast searches for effective optimization phase sequences. ACM SIGPLAN Not. 39, 6, 171--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Prasad A. Kulkarni, David B. Whalley, and Gary S. Tyson. 2007. Evaluating heuristic optimization phase order search algorithms. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’07). IEEE, 157--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Prasad A. Kulkarni, David B. Whalley, Gary S. Tyson, and Jack W. Davidson. 2009. Practical exhaustive optimization phase order exploration and evaluation. ACM Trans. Archit. Code Optim. 6, 1, Article 1, 36 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42). ACM, 469--480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jacob Lidman, Daniel J. Quinlan, Chunhua Liao, and Sally A. McKee. 2012. ROSE: FTTransform-A source-to-source translation framework for exascale fault-tolerance research. In Proceedings of the 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W’12). IEEE, 1--6.Google ScholarGoogle Scholar
  30. Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan. 2008. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture (HPCA’08). IEEE, 367--378. http://dblp.uni-trier.de/db/conf/hpca/hpca2008.html.Google ScholarGoogle Scholar
  31. Jun Liu, Yuanrui Zhang, Wei Ding, and Mahmut T. Kandemir. 2011. On-chip cache hierarchy-aware tile scheduling for multicore machines. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE, 161--170. http://dblp.uni-trier.de/db/conf/cgo/cgo2011.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Antoine Monsifrot, François Bodin, and Rene Quiniou. 2002. A machine learning approach to automatic production of compiler heuristics. In Proceedings of the 10th International Conference on Artificial Intelligence: Methodology, Systems, and Applications (AIMSA’02). 41--50. http://dl.acm.org/citation.cfm?id=646053.677574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Miquel Moret, Francisco J. Cazorla, Alex Ramrez, and Mateo Valero. 2008. MLP-aware dynamic cache partitioning. In HiPEAC. Lecture Notes in Computer Science, Vol. 4917. Springer, 337--352. Retrieved from http://dblp.uni-trier.de/db/conf/hipeac/hipeac2008.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Dimitrios S. Nikolopoulos. 2003. Code and data transformations for improving shared cache performance on SMT processors. In Proceedings of the 5th International Symposium on High Performance Computing (ISHPC’03). 54--69.Google ScholarGoogle ScholarCross RefCross Ref
  35. Eunjung Park, Sameer Kulkarni, and John Cavazos. 2011. An evaluation of different modeling techniques for iterative compilation. In Proceedings of the 14th International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES’11). ACM, 65--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. L.-N. Pouchet. 2012. PolyBench/C Benchmark Suite. Retrieved from http://web.cs.ucla.edu/ pouchet/software/polybench/.Google ScholarGoogle Scholar
  37. Rakesh Reddy and Peter Petrov. 2010. Cache partitioning for energy-efficient and interference-free embedded multitasking. ACM Trans. Embed. Comput. Syst. 9, 3, Article 16, 35 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lakshminarayanan Renganarayanan, DaeGon Kim, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Parameterized tiled loops for free. ACM SIGPLAN Not. 42, 6, 405--414. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Mark Stephenson, Saman Amarasinghe, Martin Martin, and Una-May O’Reilly. 2003. Meta optimization: Improving compiler heuristics with machine learning. ACM SIGPLAN Not. 38, 5, 77--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Karthik T. Sundararajan, Vasileios Porpodas, Timothy M. Jones, Nigel P. Topham, and Bjorn Franke. 2012. Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs. In Proceedings of the IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA’12). IEEE, 311--322. http://dblp.uni-trier.de/db/conf/hpca/hpca2012.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. I-Jui Sung, Nasser Anssari, John A. Stratton, and Wen-Mei W. Hwu. 2012. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. Int. J. Parallel Program. 40, 1, 4--24.Google ScholarGoogle ScholarCross RefCross Ref
  42. David Tam, Reza Azimi, Livio Soares, and Michael Stumm. 2007. Managing shared L2 caches on multicore systems in software. In Proceedings of the Workshop on the Interaction Between Operating Systems and Computer Architecture.Google ScholarGoogle Scholar
  43. Michele Tartara and Stefano Crespi Reghizzi. 2013. Continuous learning of compiler heuristics. ACM Trans. Archit. Code Optim. 9, 4, Article 46, 25 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated empirical optimization of software and the ATLAS project. Parallel Comput. 27, 1--2, 3--35.Google ScholarGoogle ScholarCross RefCross Ref
  45. Ying Ye, Richard West, Zhuoqun Cheng, and Ye Li. 2014. COLORIS: A dynamic cache partitioning system using page coloring. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, 381--392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Chenjie Yu and Peter Petrov. 2010. Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms. In Proceedings of the 47th Design Automation Conference (DAC’10). ACM, 132--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Xiao Zhang, Sandhya Dwarkadas, and Kai Shen. 2009. Towards practical page coloring-based multicore cache management. In Proceedings of the 4th ACM European Conference on Computer Systems (EuroSys’09). ACM, 89--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Xing Zhou, Jean-Pierre Giacalone, Marıa Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical overlapped tiling. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, 207--218. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Combining Software Cache Partitioning and Loop Tiling for Effective Shared Cache Management

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!