Abstract
One of the biggest challenges in multicore platforms is shared cache management, especially for data-dominant applications. Two commonly used approaches for increasing shared cache utilization are cache partitioning and loop tiling. However, state-of-the-art compilers lack efficient cache partitioning and loop tiling methods for two reasons. First, cache partitioning and loop tiling are strongly coupled together, and thus addressing them separately is simply not effective. Second, cache partitioning and loop tiling must be tailored to the target shared cache architecture details and the memory characteristics of the corunning workloads.
To the best of our knowledge, this is the first time that a methodology provides (1) a theoretical foundation in the above-mentioned cache management mechanisms and (2) a unified framework to orchestrate these two mechanisms in tandem (not separately). Our approach manages to lower the number of main memory accesses by an order of magnitude keeping at the same time the number of arithmetic/addressing instructions to a minimal level. We motivate this work by showcasing that cache partitioning, loop tiling, data array layouts, shared cache architecture details (i.e., cache size and associativity), and the memory reuse patterns of the executing tasks must be addressed together as one problem, when a (near)-optimal solution is requested. To this end, we present a search space exploration analysis where our proposal is able to offer a vast deduction in the required search space.
- F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O’Boyle, J. Thomson, M. Toussaint, and C. K. I. Williams. 2006. Using machine learning to focus iterative optimization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). IEEE, 295--305. Google Scholar
Digital Library
- L. Almagor, Keith D. Cooper, A. Grosul, T. J. Harvey, S. W. Reeves, D. Subramanian, L. Torczon, and T. Waterman. 2004. Finding effective compilation sequences. ACM SIGPLAN Not. 39, 7, 231--239. Google Scholar
Digital Library
- Utpal Banerjee. 1993. Linear Equations and Inequalities. Springer, Boston, MA, 49--94.Google Scholar
- Bin Bao and Chen Ding. 2013. Defensive loop tiling for shared cache. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). IEEE, 1--11. Google Scholar
Digital Library
- M. M. Baskaran, N. Vydyanathan, U. K. R. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. 2009. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. ACM SIGPLAN Not. 44, 4, 219--228. Google Scholar
Digital Library
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. ACM SIGARCH Comput. Archit. News 39, 2, 1--7. Google Scholar
Digital Library
- U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. 2008a. A practical automatic polyhedral parallelizer and locality optimizer. ACM SIGPLAN Not. 43, 6, 101--113. Google Scholar
Digital Library
- Uday Bondhugula, J. Ramanujam, and P. Sadayppan. 2008b. PLuTo: A practical and fully automatic polyhedral program optimization system. In Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI’08).Google Scholar
- Bach D. Bui, Marco Caccamo, Lui Sha, and Joseph Martinez. 2008. Impact of cache partitioning on multi-tasking real time embedded systems. In Proceedings of the 2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications, 101--110. Google Scholar
Digital Library
- Jichuan Chang and Gurindar S. Sohi. 2014. Cooperative cache partitioning for chip multiprocessors. In Proceedings of the 21st Annual International Conference on Supercomputing (ICS’07). ACM, 80--81. Google Scholar
Digital Library
- Shi-Kuo Chang. 2003. Data Structures and Algorithms. Series on Software Engineering and Knowledge Engineering, Vol. 13. World Scientific.Google Scholar
- Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steven Reeves, Devika Subramanian, Linda Torczon, and Todd Waterman. 2005. ACME: Adaptive compilation made efficient. ACM SIGPLAN Not. 40, 7, 69--77. Google Scholar
Digital Library
- Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steve Reeves, Devika Subramanian, Linda Torczon, and Todd Waterman. 2006. Exploring the structure of the space of compilation sequences using randomized search algorithms. J. Supercomput. 36, 2, 135--151. Google Scholar
Digital Library
- Xiaoning Ding, Kaibo Wang, and Xiaodong Zhang. 2011. ULCC: A user-level facility for optimizing shared cache performance on multicores. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). ACM, 103--112. http://dblp.uni-trier.de/db/conf/ppopp/ppopp2011.html. Google Scholar
Digital Library
- Haakon Dybdahl and Per Stenstrm. 2007. An adaptive shared/private NUCA cache partitioning scheme for chip multiprocessors. In Proceedings of the IEEE 13th International Symposium on High Performance Computer Architecture (HPCA’07). IEEE, 2--12. http://dblp.uni-trier.de/db/conf/hpca/hpca2007.html. Google Scholar
Digital Library
- M. Haneda, P. M. W. Khnijnenburg, and H. A. G. Wijshoff. 2005. Automatic selection of compiler options using non-parametric inferential statistics. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05). IEEE,, 123--132. Google Scholar
Digital Library
- T. Henretty, A. Rountev, H. Lin, J. Ramanujam, Q. Lu, T. F. Ngai, Y. Chen, C. Alias, U. Bondhugula, P. Sadayappan, and S. Krishnamoorthy. 2009. Data layout transformation for enhancing data locality on NUCA chip multiprocessors. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 348--357. Google Scholar
Digital Library
- Mahmut Kandemir, Sai Prashanth Muralidhara, Sri Hari Krishna Narayanan, Yuanrui Zhang, and Ozcan Ozturk. 2009. Optimizing shared cache behavior of chip multiprocessors. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42). ACM, 505--516. Google Scholar
Digital Library
- Mahmut Kandemir, Taylan Yemliha, SaiPrashanth Muralidhara, Shekhar Srikantaiah, Mary Jane Irwin, and Yuanrui Zhnag. 2010. Cache topology aware computation mapping for multicores. ACM SIGPLAN Not. 45, 6, 74--85. Google Scholar
Digital Library
- Dimitris Kaseridis, Jeffrey Stuecheli, and Lizy K. John. 2009. Bank-aware dynamic cache partitioning for multicore architectures. In Proceedings of the International Conference on Parallel Processing (ICPP’09). 18--25. Google Scholar
Digital Library
- Vasilios Kelefouras, Angeliki Kritikakou, and Costas Goutis. 2015. A methodology for speeding up loop kernels by exploiting the software information and the memory architecture. Comput. Lang. Syst. Structures 41, 21--41. http://dblp.uni-trier.de/db/journals/cl/cl41.html. Google Scholar
Digital Library
- DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Multi-level tiling: M for the price of one. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). ACM, Article 51, 12 pages. Google Scholar
Digital Library
- Hyoseung Kim, Arvind Kandhalu, and Ragunathan Rajkumar. 2013. A coordinated approach for practical OS-level cache management in multi-core real-time systems. In Proceedings of the 25th Euromicro Conference on Real-Time Systems, (ECRTS’13). 80--89. Google Scholar
Digital Library
- P. M. W. Knijnenburg, T. Kisuki, K. Gallivan, and M. F. P. O’Boyle. 2004. The effect of cache models on iterative compilation for combined tiling and unrolling. Concur. Comput. Pract. Exper. 16, 2--3, 247--270. Google Scholar
Digital Library
- Prasad Kulkarni, Stephen Hines, Jason Hiser, David Whalley, Jack Davidson, and Douglas Jones. 2004. Fast searches for effective optimization phase sequences. ACM SIGPLAN Not. 39, 6, 171--182. Google Scholar
Digital Library
- Prasad A. Kulkarni, David B. Whalley, and Gary S. Tyson. 2007. Evaluating heuristic optimization phase order search algorithms. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’07). IEEE, 157--169. Google Scholar
Digital Library
- Prasad A. Kulkarni, David B. Whalley, Gary S. Tyson, and Jack W. Davidson. 2009. Practical exhaustive optimization phase order exploration and evaluation. ACM Trans. Archit. Code Optim. 6, 1, Article 1, 36 pages. Google Scholar
Digital Library
- Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42). ACM, 469--480. Google Scholar
Digital Library
- Jacob Lidman, Daniel J. Quinlan, Chunhua Liao, and Sally A. McKee. 2012. ROSE: FTTransform-A source-to-source translation framework for exascale fault-tolerance research. In Proceedings of the 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W’12). IEEE, 1--6.Google Scholar
- Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan. 2008. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture (HPCA’08). IEEE, 367--378. http://dblp.uni-trier.de/db/conf/hpca/hpca2008.html.Google Scholar
- Jun Liu, Yuanrui Zhang, Wei Ding, and Mahmut T. Kandemir. 2011. On-chip cache hierarchy-aware tile scheduling for multicore machines. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE, 161--170. http://dblp.uni-trier.de/db/conf/cgo/cgo2011.html. Google Scholar
Digital Library
- Antoine Monsifrot, François Bodin, and Rene Quiniou. 2002. A machine learning approach to automatic production of compiler heuristics. In Proceedings of the 10th International Conference on Artificial Intelligence: Methodology, Systems, and Applications (AIMSA’02). 41--50. http://dl.acm.org/citation.cfm?id=646053.677574. Google Scholar
Digital Library
- Miquel Moret, Francisco J. Cazorla, Alex Ramrez, and Mateo Valero. 2008. MLP-aware dynamic cache partitioning. In HiPEAC. Lecture Notes in Computer Science, Vol. 4917. Springer, 337--352. Retrieved from http://dblp.uni-trier.de/db/conf/hipeac/hipeac2008.html. Google Scholar
Digital Library
- Dimitrios S. Nikolopoulos. 2003. Code and data transformations for improving shared cache performance on SMT processors. In Proceedings of the 5th International Symposium on High Performance Computing (ISHPC’03). 54--69.Google Scholar
Cross Ref
- Eunjung Park, Sameer Kulkarni, and John Cavazos. 2011. An evaluation of different modeling techniques for iterative compilation. In Proceedings of the 14th International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES’11). ACM, 65--74. Google Scholar
Digital Library
- L.-N. Pouchet. 2012. PolyBench/C Benchmark Suite. Retrieved from http://web.cs.ucla.edu/ pouchet/software/polybench/.Google Scholar
- Rakesh Reddy and Peter Petrov. 2010. Cache partitioning for energy-efficient and interference-free embedded multitasking. ACM Trans. Embed. Comput. Syst. 9, 3, Article 16, 35 pages. Google Scholar
Digital Library
- Lakshminarayanan Renganarayanan, DaeGon Kim, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Parameterized tiled loops for free. ACM SIGPLAN Not. 42, 6, 405--414. Google Scholar
Digital Library
- Mark Stephenson, Saman Amarasinghe, Martin Martin, and Una-May O’Reilly. 2003. Meta optimization: Improving compiler heuristics with machine learning. ACM SIGPLAN Not. 38, 5, 77--90. Google Scholar
Digital Library
- Karthik T. Sundararajan, Vasileios Porpodas, Timothy M. Jones, Nigel P. Topham, and Bjorn Franke. 2012. Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs. In Proceedings of the IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA’12). IEEE, 311--322. http://dblp.uni-trier.de/db/conf/hpca/hpca2012.html. Google Scholar
Digital Library
- I-Jui Sung, Nasser Anssari, John A. Stratton, and Wen-Mei W. Hwu. 2012. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. Int. J. Parallel Program. 40, 1, 4--24.Google Scholar
Cross Ref
- David Tam, Reza Azimi, Livio Soares, and Michael Stumm. 2007. Managing shared L2 caches on multicore systems in software. In Proceedings of the Workshop on the Interaction Between Operating Systems and Computer Architecture.Google Scholar
- Michele Tartara and Stefano Crespi Reghizzi. 2013. Continuous learning of compiler heuristics. ACM Trans. Archit. Code Optim. 9, 4, Article 46, 25 pages. Google Scholar
Digital Library
- R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated empirical optimization of software and the ATLAS project. Parallel Comput. 27, 1--2, 3--35.Google Scholar
Cross Ref
- Ying Ye, Richard West, Zhuoqun Cheng, and Ye Li. 2014. COLORIS: A dynamic cache partitioning system using page coloring. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, 381--392. Google Scholar
Digital Library
- Chenjie Yu and Peter Petrov. 2010. Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms. In Proceedings of the 47th Design Automation Conference (DAC’10). ACM, 132--137. Google Scholar
Digital Library
- Xiao Zhang, Sandhya Dwarkadas, and Kai Shen. 2009. Towards practical page coloring-based multicore cache management. In Proceedings of the 4th ACM European Conference on Computer Systems (EuroSys’09). ACM, 89--102. Google Scholar
Digital Library
- Xing Zhou, Jean-Pierre Giacalone, Marıa Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical overlapped tiling. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, 207--218. Google Scholar
Digital Library
Index Terms
Combining Software Cache Partitioning and Loop Tiling for Effective Shared Cache Management
Recommendations
Combining Process-Based Cache Partitioning and Pollute Region Isolation to Improve Shared Last Level Cache Utilization on Multicore Systems
TRUSTCOM '13: Proceedings of the 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and CommunicationsShared last level cache has been widely used in modern multicore processors. However, uncontrolled cache sharing on multicore leads to more serious cache pollution than that on single-core processor. A process with weak locality can evict strong ...
Reducing Cache Conflicts by Multi-Level Cache Partitioning and Array Elements Mapping
ICPADS '00: Proceedings of the Seventh International Conference on Parallel and Distributed SystemsThis paper presents an algorithm to reduce cache conflicts and improve cache localities. The proposed algorithm analyzes locality reference space for each reference pattern, partitions the multi-level cache into several parts with different size, and ...
Defensive loop tiling for shared cache
CGO '13: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)Loop tiling is a compiler transformation that tailors an application's working set to fit in a cache hierarchy. On today's multicore processors, part of the hierarchy especially the last level cache (LLC) is shared. The available cache space in shared ...






Comments