Abstract
Locality analysis is important since accessing memory is much slower than computing. Compile-time locality analysis can provide detailed program-level feedback for compilers or runtime systems faster than trace-based locality analysis.
In this paper, we describe a new approach to locality analysis based on static parallel sampling. A compiler analyzes loop-based code and generates sampler code which is run to measure locality. Our approach can predict precise cache line granularity miss ratio curves for complex loops with non-linear array references and even branches. The precision and overhead of static sampling are evaluated using PolyBench and a bit-reversal loop. Our result shows that by randomly sampling 2% of loop iterations, a compiler can construct almost exact miss ratio curves as trace based analysis. Sampling 0.5% and 1% iterations can achieve good precision and efficiency with an average 0.6% to 1% the time of tracing respectively. Our analysis can also be parallelized. The analysis may assist program optimization techniques such as tiling, program co-location, cache hint selection and help to analyze write locality and parallel locality.
Supplemental Material
- Randy Allen and Ken Kennedy. 2001. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers.Google Scholar
- Alexander I Barvinok. 1994. A polynomial time algorithm for counting integral points in polyhedra when the dimension is fixed. Mathematics of Operations Research 19, 4 (1994), 769–779.Google Scholar
Digital Library
- Mohamed-Walid Benabderrahmane, Louis-Noël Pouchet, Albert Cohen, and Cédric Bastoul. 2010. The Polyhedral Model Is More Widely Applicable Than You Think. CC 6011 (2010), 283–303. Google Scholar
Digital Library
- Kristof Beyls and Erik H. D’Hollander. 2005. Generating cache hints for improved program efficiency. Journal of Systems Architecture 51, 4 (2005), 223–250. Google Scholar
Digital Library
- Kristof Beyls and Erik H. D’Hollander. 2006. Discovery of localityimproving refactoring by reuse path analysis. In Proceedings of High Performance Computing and Communications. Springer. Lecture Notes in Computer Science, Vol. 4208. 220–229. Google Scholar
Digital Library
- Paul Bourke. 1993. DFT (Discrete Fourier Transform) FFT (Fast Fourier Transform). Internet, http://astronomy. swin. edu. au/˜ pbourke/analysis/dft (1993).Google Scholar
- Jacob Brock, Chencheng Ye, Chen Ding, Yechen Li, Xiaolin Wang, and Yingwei Luo. 2015. Optimal cache partition-sharing. In Parallel Processing (ICPP), 2015 44th International Conference on. IEEE, 749–758. Google Scholar
Digital Library
- Carlos Carvalho. 2002. The gap between processor and memory speeds. In Proc. of IEEE International Conference on Control and Automation.Google Scholar
- Calin Cascaval, Evelyn Duesterwald, Peter F. Sweeney, and Robert W. Wisniewski. 2005. Multiple Page Size Modeling and Optimization. In Proceedings of PACT. 339–349. Google Scholar
Digital Library
- Calin Cascaval and David A. Padua. 2003. Estimating cache misses and locality using stack distances. In Proceedings of ICS. 150–159. Google Scholar
Digital Library
- Sanjay Chatterjee, Nick Vrvilo, Zoran Budimlic, Kathleen Knobe, and Vivek Sarkar. 2016. Declarative tuning for locality in parallel programs. In Parallel Processing (ICPP), 2016 45th International Conference on. IEEE, 452–457.Google Scholar
Cross Ref
- Arun Chauhan and Chun-Yu Shei. 2010. Static reuse distances for locality-based optimizations in MATLAB. In Proceedings of ICS. 295– 304. Google Scholar
Digital Library
- Dong Chen, Fangzhou Liu, Chen Ding, and Chucheow Lim. 2017. POSTER: Static Reuse Time Analysis Using Dependence Distance. In International Workshop on Languages and Compilers for Parallel Computing. Springer.Google Scholar
- Dong Chen, Chencheng Ye, and Chen Ding. 2016. Write Locality and Optimization for Persistent Memory. In Proceedings of the Second International Symposium on Memory Systems. ACM, 77–87. Google Scholar
Digital Library
- Philippe Clauss. 2014. Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: Applications to analyze and transform scientific programs. In ACM International Conference on Supercomputing 25th Anniversary Volume. ACM, 237–244. Google Scholar
Digital Library
- Stephanie Coleman and Kathryn S. McKinley. 1995. Tile Size Selection Using Cache Organization and Data Layout. In Proceedings of PLDI. 279–290. Google Scholar
Digital Library
- George B Dantzig and B Curtis Eaves. 1973. Fourier-Motzkin elimination and its dual. Journal of Combinatorial Theory, Series A 14, 3 (1973), 288–297.Google Scholar
Cross Ref
- C. Ding and K. Kennedy. 2004. Improving effective bandwidth through compiler enhancement of global cache reuse. J. Parallel and Distrib. Comput. 64, 1 (2004), 108–134. Google Scholar
Digital Library
- Johannes Doerfert, Clemens Hammacher, Kevin Streit, and Sebastian Hack. 2013. Spolly: speculative optimizations in the polyhedral model. IMPACT 2013 (2013), 55.Google Scholar
- David Eklov and Erik Hagersten. 2010. StatStack: Efficient modeling of LRU caches. In Proceedings of ISPASS. 55–65.Google Scholar
Cross Ref
- Karim Esseghir. 1993. Improving data locality for caches. Ph.D. Dissertation. Rice University.Google Scholar
- Naznin Fauzia. 2015. Characterization of Data Locality Potential of CPU and GPU Applications through Dynamic Analysis. Ph.D. Dissertation. The Ohio State University.Google Scholar
- Somnath Ghosh, Margaret Martonosi, and Sharad Malik. 1997. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 11th international conference on Supercomputing. ACM, 317–324. Google Scholar
Digital Library
- Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012. IEEE, 1–10.Google Scholar
Cross Ref
- Xiameng Hu, Xiaolin Wang, Yechen Li, Yingwei Luo, Chen Ding, and Zhenlin Wang. 2017. Optimal Symbiosis and Fair Scheduling in Shared Cache. IEEE TPDS 28, 4 (2017), 1134–1148. Google Scholar
Digital Library
- Xiameng Hu, Xiaolin Wang, Lan Zhou, Yingwei Luo, Chen Ding, and Zhenlin Wang. 2016. Kinetic modeling of data eviction in cache. In 2016 USENIX Annual Technical Conference (USENIX ATC 16). USENIX Association, 351–364. Google Scholar
Digital Library
- Ken Kennedy and Kathryn S McKinley. 1992. Optimizing for parallelism and data locality. In ACM International Conference on Supercomputing 25th Anniversary Volume. ACM, 151–162. Google Scholar
Digital Library
- David J Kuck, Yoichi Muraoka, and Shyh-Ching Chen. 1972. On the number of operations simultaneously executable in Fortran-like programs and their resulting speedup. IEEE Trans. Comput. 100, 12 (1972), 1293–1310. Google Scholar
Digital Library
- Leslie Lamport. 1974. The parallel execution of DO loops. Commun. ACM 17, 2 (1974), 83–93. Google Scholar
Digital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04). Palo Alto, California. Google Scholar
Digital Library
- Hao Luo, Guoyang Chen, Pengcheng Li, Chen Ding, and Xipeng Shen. 2016. Data-centric combinatorial optimization of parallel code. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 38. Google Scholar
Digital Library
- Hao Luo, Pengcheng Li, and Chen Ding. 2017. Thread data sharing in cache: Theory and measurement. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 103–115. Google Scholar
Digital Library
- R. L. Mattson, J. Gecsei, D. Slutz, and I. L. Traiger. 1970. Evaluation techniques for storage hierarchies. IBM System Journal 9, 2 (1970), 78–117. Google Scholar
Digital Library
- Miquel Moreto, Francisco J Cazorla, Alex Ramirez, and Mateo Valero. 2008. MLP-aware dynamic cache partitioning. In International Conference on High-Performance Embedded Architectures and Compilers. Springer, 337–352. Google Scholar
Digital Library
- Preeti Ranjan Panda, Hiroshi Nakamura, Nikil D Dutt, and Alexandru Nicolau. 1999. Augmenting loop tiling with data alignment for improved cache performance. IEEE transactions on computers 48, 2 (1999), 142–149. Google Scholar
Digital Library
- Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices 48, 6 (2013), 519–530. Google Scholar
Digital Library
- Derek L. Schuff, Milind Kulkarni, and Vijay S. Pai. 2010. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of PACT. 53–64. Google Scholar
Digital Library
- David K. Tam, Reza Azimi, Livio Soares, and Michael Stumm. 2009. RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations. In Proceedings of ASPLOS. 121–132. Google Scholar
Digital Library
- Xavier Vera, Josep Llosa, Antonio Gonzalez, and Carlos Ciuraneta. 2000. A fast implementation of cache miss equations. In Procs. of the 8th. Int. Workshop on Compilers for Parallel Computers. 319–326.Google Scholar
- Xavier Vera and Jingling Xue. 2002. Let’s study whole-program cache behaviour analytically. In High-Performance Computer Architecture, 2002. Proceedings. Eighth International Symposium on. IEEE, 175–186. Google Scholar
Digital Library
- Sven Verdoolaege, Rachid Seghir, Kristof Beyls, Vincent Loechner, and Maurice Bruynooghe. 2004. Analytical computation of Ehrhart polynomials: Enabling more compiler analyses and optimizations. In Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems. ACM, 248–258. Google Scholar
Digital Library
- Xiaolin Wang, Yechen Li, Yingwei Luo, Xiameng Hu, Jacob Brock, Chen Ding, and Zhenlin Wang. 2015. Optimal footprint symbiosis in shared cache. In Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on. IEEE, 412–422.Google Scholar
Digital Library
- Jake Wires, Stephen Ingram, Zachary Drudi, Nicholas JA Harvey, Andrew Warfield, and Coho Data. 2014. Characterizing storage workloads with counter stacks. In Proceedings of OSDI. USENIX Association, 335– 349. Google Scholar
Digital Library
- Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm. In Proceedings of PLDI. 30–44. Google Scholar
Digital Library
- Xiaoya Xiang, Bin Bao, Tongxin Bai, Chen Ding, and Trishul M. Chilimbi. 2011. All-window profiling and composable models of cache sharing. In Proceedings of PPoPP. 91–102. Google Scholar
Digital Library
- Xiaoya Xiang, Chen Ding, Hao Luo, and Bin Bao. 2013. HOTL: a higher order theory of locality. In Proceedings of ASPLOS. 343–356. Google Scholar
Digital Library
- Chencheng Ye, Jacob Brock, Chen Ding, and Hai Jin. 2017. Rochester elastic cache utility (recu): Unequal cache sharing is good economics. International Journal of Parallel Programming 45, 1 (2017), 30–44. Google Scholar
Digital Library
- Chencheng Ye, Chen Ding, Hao Luo, Jacob Brock, Dong Chen, and Hai Jin. 2017. Cache Exclusivity and Sharing: Theory and Optimization. ACM Trans. on Arch. and Code Opt. 14, 4, 34:1–34:26. Google Scholar
Digital Library
- Yutao Zhong and Wentao Chang. 2008. Sampling-based program locality approximation. In Proceedings of ISMM. 91–100. Google Scholar
Digital Library
Index Terms
Locality analysis through static parallel sampling
Recommendations
Locality analysis through static parallel sampling
PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and ImplementationLocality analysis is important since accessing memory is much slower than computing. Compile-time locality analysis can provide detailed program-level feedback for compilers or runtime systems faster than trace-based locality analysis.
In this paper, ...
Locality-Driven Dynamic GPU Cache Bypassing
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingThis paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number ...
Static Locality Analysis for Cache Management
PACT '97: Proceedings of the 1997 International Conference on Parallel Architectures and Compilation TechniquesMost memory references in numerical codes correspond to array references whose indices are affine functions of surrounding loop indices. These array references follow a regular predictable memory pattern that can be analyzed at compile time. This ...







Comments