Abstract
Optimizing compilers implement program transformation strategies aimed at reducing data movement to or from main memory by exploiting the data-cache hierarchy. However, instead of attempting to minimize the number of cache misses, very approximate cost models are used, due to the lack of precise compile-time models for misses for hierarchical caches. The current state of practice for cache miss analysis is based on accurate simulation. However, simulation requires time proportional to the dataset/problem size, as well as the number of distinct cache configurations of interest to be evaluated.
This paper takes a fundamentally different approach, by focusing on polyhedral programs with static control flow. Instead of relying on costly simulation, a closed-form solution for modeling of misses in a set associative cache hierarchy is developed. This solution can enable program transformation choice at compile time to optimize cache misses. A tool implementing the approach has been developed and used for validation of the framework.
Supplemental Material
- M. Adams. 2014. HPGMG: a benchmark for ranking high performance computing systems. (2014). https://www.hpgmg.org/Google Scholar
- A. Agarwal, J. Hennessy, and M. Horowitz. 1989. An Analytical Cache Model. ACM Transactions on Computer Systems (1989), 184ś215.Google Scholar
- N. Ahmed, N. Mateev, and K. Pingali. 2001. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. International Journal of Parallel Programming (2001), 493ś544.Google Scholar
- M. Alt, C. Ferdinand, F. Martin, and R. Wilhelm. 1996. Cache behavior prediction by abstract interpretation. In International Static Analysis Symposium (SAS’96). 52ś66. Google Scholar
Cross Ref
- W. Bao, C. Hong, S. Chunduri, S. Krishnamoorthy, N. Pouchet, F. Rastello, and P. Sadayappan. 2016a. Static and Dynamic Frequency Scaling on Multicore CPUs. ACM Transactions on Architecture and Code Optimization (2016), 1ś26.Google Scholar
- W. Bao, S. Krishnamoorthy, L. Pouchet, F. Rastello, and P. Sadayappan. 2016b. PolyCheck: Dynamic Veriication of Iteration Space Transformations on Aine Programs. ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’16) (2016), 539ś554.Google Scholar
- W. Bao, P. Rawat, M. Kong, S. Krishnamoorthy, L. Pouchet, and P. Sadayappan. 2017. Eicient Cache Simulation for Aine Computations. In International Workshop on Languages and Compilers for Parallel Computing (LCPC’17).Google Scholar
- W. Bao, S. Tavarageri, F. Ozguner, and P. Sadayappan. 2014. PWCET: Power-Aware Worst Case Execution Time Analysis. In 43rd International Conference on Parallel Processing Workshops. 439ś447.Google Scholar
- E. Berg and E. Hagersten. 2004. StatCache: a probabilistic approach to eicient and accurate data locality analysis. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’04). 20ś27. Google Scholar
Cross Ref
- Kristof Beyls and Erik H. D’Hollander. 2005. Generating cache hints for improved program eiciency. Journal of Systems Architecture 51, 4 (2005), 223 ś 250.Google Scholar
Digital Library
- U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Program Optimization System. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’08).Google Scholar
- T. Carlson, W. Heirman, S. Eyerman, I. Hur, and L. Eeckhout. 2014. An Evaluation of High-Level Mechanistic Core Models. ACM Transactions on Architecture and Code Optimization (2014).Google Scholar
- S. Carr, S. McKinley, and C. Tseng. 1994. Compiler Optimizations for Improving Data Locality. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’94). 252ś262. Google Scholar
Digital Library
- C. Cascaval and A. Padua. 2003. Estimating cache misses and locality using stack distances. In 17th Annual International Conference on Supercomputing (ICS’03). 150ś159.Google Scholar
- S. Chatterjee, E. Parker, J. Hanlon, and R. Lebeck. 2001. Exact Analysis of the Cache Behavior of Nested Loops. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’01). 286ś297. Google Scholar
Digital Library
- J. Edler and M. Hill. 1999. Dinero IV Trace-Driven Uniprocessor Cache Simulator. http://pages.cs.wisc.edu/~markhill/ DineroIVGoogle Scholar
- C. Fang, S. Can, S. Onder, and Z. Wang. 2005. Instruction based memory distance analysis and its application to optimization. In International Conference on Parallel Architectures and Compilation Techniques (PACT’05). 27ś37.Google Scholar
- C. Fang, S. Carr, S. Önder, and Z. Wang. 2004. Reuse-distance-based miss-rate prediction on a per instruction basis. In Proc. 2004 Workshop on Memory System Performance. 60ś68. Google Scholar
Digital Library
- P. Feautrier. 1992. Some eicient solutions to the aine scheduling problem, part II: multidimensional time. International Journal of Parallel Programming (1992), 389ś420.Google Scholar
- J. Ferrante, V. Sarkar, and W. Thrash. 1991. On estimating and enhancing cache efectiveness. In International Workshop on Languages and Compilers for Parallel Computing (LCPC’91). 328ś343.Google Scholar
- B. Fraguela, R. Doallo, and L. Zapata. 1999. Automatic analytical modeling for the estimation of cache misses. In International Conference on Parallel Architectures and Compilation Techniques (PACT’99). 221ś231. Google Scholar
Cross Ref
- B. Fraguela, R. Doallo, and L. Zapata. 2003. Probabilistic miss equations: Evaluating memory hierarchy performance. IEEE Trans. Comput. (2003), 321ś336.Google Scholar
- A. Frumkin and Rob F. Van W. 2002. Tight bounds on cache use for stencil operations on rectangular grids. J. ACM (2002), 434ś453.Google Scholar
- S. Ghosh, M. Martonosi, and S. Malik. 1998. Precise Miss Analysis for Program Transformations with Caches of Arbitrary Associativity. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’98). 228ś239. Google Scholar
Digital Library
- S. Ghosh, M. Martonosi, and S. Malik. 1999. Cache miss equations: a compiler framework for analyzing and tuning memory behavior. ACM Transactions on Programming Languages and Systems (1999), 703ś746.Google Scholar
- S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam. 2006. Semi-Automatic Composition of Loop Transformations. International Journal of Parallel Programming (2006), 261ś317.Google Scholar
- S. Harper, J. Kerbyson, and R. Nudd. 1999. Analytical modeling of set-associative cache behavior. IEEE Trans. Comput. (1999), 1009ś1024.Google Scholar
- D. Hill and J. Smith. 1989. Evaluating associativity in CPU caches. IEEE Trans. Comput. (1989), 1612ś1630.Google Scholar
- C. Hong, W. Bao, A. Cohen, S. Krishnamoorthy, L. Pouchet, F. Rastello, J. Ramanujam, and P. Sadayappan. 2016. Efective Padding of Multidimensional Arrays to Avoid Cache Conlict Misses. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’16) (2016), 129ś144.Google Scholar
- W. Kelly and W. Pugh. 1993. A Framework for Unifying Reordering Transformations. Technical Report.Google Scholar
- M. Kong, R. Veras, K. Stock, F. Franchetti, L. Pouchet, and P. Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). 127ś138. Google Scholar
Digital Library
- W. Lim and S. Lam. 1997. Maximizing Parallelism and Minimizing Synchronization with Aine Transforms. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’97). 201ś214.Google Scholar
- C. Oppen. 1978. A 2 2 2pn upper bound on the complexity of Presburger arithmetic. J. Comput. System Sci. (1978), 323ś332.Google Scholar
- L. Pouchet. 2017a. PoCC, the Polyhedral Compiler Collection 1.4. http://pocc.sourceforge.netGoogle Scholar
- L. Pouchet. 2017b. PolyBench/C 4.0. http://polybench.sourceforge.netGoogle Scholar
- H. Ramaprasad and F. Mueller. 2005. Bounding worst-case data cache behavior by analytically deriving cache reference patterns. In 11th IEEE Real Time and Embedded Technology and Applications Symposium (RTAS’05). 148ś157. Google Scholar
Digital Library
- G. Rivera and C. Tseng. 1998. Data transformations for eliminating conlict misses. In ACM SIGPLAN conference on Programming language design and implementation (PLDI’98). 38ś49.Google Scholar
- V. Sarkar and N. Megiddo. 2000. An analytical model for loop tiling and its solution. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’00). IEEE, 146ś153. Google Scholar
Cross Ref
- J. Shirako, K. Sharma, N. Fauzia, L. Pouchet, J. Ramanujam, P Sadayappan, and V. Sarkar. 2012. Analytical bounds for optimal tile size selection. In International Conference on Compiler Construction (CC’12). Springer, 101ś121. Google Scholar
Digital Library
- A. Shrivastava, J. Lee, and R. Jeyapaul. 2010. Cache vulnerability equations for protecting data in embedded processor caches from soft errors. In ACM SIGPLAN Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES’10). 143ś152. Google Scholar
Digital Library
- P. Singh, S. Stone, and F. Thiebaut. 1992. A model of workloads and its use in miss-rate prediction for fully associative caches. IEEE Trans. Comput. (1992), 811ś825.Google Scholar
- M. Valiev, J. Bylaska, N. Govind, K. Kowalski, Tjerk P. Straatsma, Hubertus J J. Van D., D. Wang, J. Nieplocha, E. Apra, L. Windus, et al. 2010. NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications (2010), 1477ś1489.Google Scholar
- X. Vera, J. Abella, A. González, and J. Llosa. 2003. Optimizing program locality through CMEs and GAs. In International Conference on Parallel Architectures and Compilation Techniques (PACT’03). 68ś78. Google Scholar
Cross Ref
- X. Vera, J. Abella, J. Llosa, and A. González. 2005. An Accurate Cost Model for Guiding Data Locality Transformations. ACM Transactions on Programming Languages and Systems (2005), 946ś987.Google Scholar
- X. Vera, N. Bermudo, J. Llosa, and A. González. 2004. A fast and accurate framework to analyze and optimize cache memory behavior. ACM Transactions on Programming Languages and Systems (2004), 263ś300.Google Scholar
- X. Vera and J. Xue. 2002. Let’s study whole-program cache behaviour analytically. In International Symposium on HighPerformance Computer Architecture (HPCA’02). 175ś186. Google Scholar
Cross Ref
- S. Verdoolaege. 2007. Barvinok, a library for counting the integer points in parametric and non-parametric polytopes. http://barvinok.gforge.inria.frGoogle Scholar
- S. Verdoolaege. 2010a. ISL: An integer set library for the polyhedral model. In the 3rd International Congress on Mathematical Software.Google Scholar
Cross Ref
- S. Verdoolaege. 2010b. ISL, the Integer Set Library. http://repo.or.cz/w/isl.gitGoogle Scholar
- S. Verdoolaege and T. Grosser. 2012. Polyhedral extraction tool. In 2nd International Workshop on Polyhedral Compilation Techniques.Google Scholar
- S. Verdoolaege, R. Seghir, K. Beyls, V. Loechner, and M. Bruynooghe. 2007. Counting integer points in parametric polytopes using Barvinok’s rational functions. Algorithmica (2007), 37ś66.Google Scholar
- W. Wang and L. Baer. 1990. Eicient Trace-driven Simulation Method for Cache Performance Analysis. In ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’90). 27ś36.Google Scholar
- J. Xue and X. Vera. 2004. Eicient and accurate analytical modeling of whole-program data cache behavior. IEEE Trans. Comput. (2004), 547ś566.Google Scholar
- W. Zhang. 2005. Computing cache vulnerability to transient errors and its implication. In IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’05). 427ś435. Google Scholar
Digital Library
Index Terms
Analytical modeling of cache behavior for affine programs
Recommendations
An analytical model for cache replacement policy performance
SIGMETRICS '06/Performance '06: Proceedings of the joint international conference on Measurement and modeling of computer systemsDue to the increasing gap between CPU and memory speed, cache performance plays an increasingly critical role in determining the overall performance of microprocessor systems. One of the important factors that a affect cache performance is the cache ...
An analytical model for cache replacement policy performance
Performance evaluation reviewDue to the increasing gap between CPU and memory speed, cache performance plays an increasingly critical role in determining the overall performance of microprocessor systems. One of the important factors that a affect cache performance is the cache ...
Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesOptimizing cache locality has always been important since the emergence of caches, and numerous cache locality optimization schemes have been published in compiler literature. However, in modern architectures, cache locality is not the only factor that ...






Comments