skip to main content
research-article
Public Access
Artifacts Available
Artifacts Evaluated & Functional

Locality analysis through static parallel sampling

Published:11 June 2018Publication History
Skip Abstract Section

Abstract

Locality analysis is important since accessing memory is much slower than computing. Compile-time locality analysis can provide detailed program-level feedback for compilers or runtime systems faster than trace-based locality analysis.

In this paper, we describe a new approach to locality analysis based on static parallel sampling. A compiler analyzes loop-based code and generates sampler code which is run to measure locality. Our approach can predict precise cache line granularity miss ratio curves for complex loops with non-linear array references and even branches. The precision and overhead of static sampling are evaluated using PolyBench and a bit-reversal loop. Our result shows that by randomly sampling 2% of loop iterations, a compiler can construct almost exact miss ratio curves as trace based analysis. Sampling 0.5% and 1% iterations can achieve good precision and efficiency with an average 0.6% to 1% the time of tracing respectively. Our analysis can also be parallelized. The analysis may assist program optimization techniques such as tiling, program co-location, cache hint selection and help to analyze write locality and parallel locality.

Skip Supplemental Material Section

Supplemental Material

p557-chen.webm

References

  1. Randy Allen and Ken Kennedy. 2001. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers.Google ScholarGoogle Scholar
  2. Alexander I Barvinok. 1994. A polynomial time algorithm for counting integral points in polyhedra when the dimension is fixed. Mathematics of Operations Research 19, 4 (1994), 769–779.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mohamed-Walid Benabderrahmane, Louis-Noël Pouchet, Albert Cohen, and Cédric Bastoul. 2010. The Polyhedral Model Is More Widely Applicable Than You Think. CC 6011 (2010), 283–303. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Kristof Beyls and Erik H. D’Hollander. 2005. Generating cache hints for improved program efficiency. Journal of Systems Architecture 51, 4 (2005), 223–250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Kristof Beyls and Erik H. D’Hollander. 2006. Discovery of localityimproving refactoring by reuse path analysis. In Proceedings of High Performance Computing and Communications. Springer. Lecture Notes in Computer Science, Vol. 4208. 220–229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Paul Bourke. 1993. DFT (Discrete Fourier Transform) FFT (Fast Fourier Transform). Internet, http://astronomy. swin. edu. au/˜ pbourke/analysis/dft (1993).Google ScholarGoogle Scholar
  7. Jacob Brock, Chencheng Ye, Chen Ding, Yechen Li, Xiaolin Wang, and Yingwei Luo. 2015. Optimal cache partition-sharing. In Parallel Processing (ICPP), 2015 44th International Conference on. IEEE, 749–758. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Carlos Carvalho. 2002. The gap between processor and memory speeds. In Proc. of IEEE International Conference on Control and Automation.Google ScholarGoogle Scholar
  9. Calin Cascaval, Evelyn Duesterwald, Peter F. Sweeney, and Robert W. Wisniewski. 2005. Multiple Page Size Modeling and Optimization. In Proceedings of PACT. 339–349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Calin Cascaval and David A. Padua. 2003. Estimating cache misses and locality using stack distances. In Proceedings of ICS. 150–159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Sanjay Chatterjee, Nick Vrvilo, Zoran Budimlic, Kathleen Knobe, and Vivek Sarkar. 2016. Declarative tuning for locality in parallel programs. In Parallel Processing (ICPP), 2016 45th International Conference on. IEEE, 452–457.Google ScholarGoogle ScholarCross RefCross Ref
  12. Arun Chauhan and Chun-Yu Shei. 2010. Static reuse distances for locality-based optimizations in MATLAB. In Proceedings of ICS. 295– 304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dong Chen, Fangzhou Liu, Chen Ding, and Chucheow Lim. 2017. POSTER: Static Reuse Time Analysis Using Dependence Distance. In International Workshop on Languages and Compilers for Parallel Computing. Springer.Google ScholarGoogle Scholar
  14. Dong Chen, Chencheng Ye, and Chen Ding. 2016. Write Locality and Optimization for Persistent Memory. In Proceedings of the Second International Symposium on Memory Systems. ACM, 77–87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Philippe Clauss. 2014. Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: Applications to analyze and transform scientific programs. In ACM International Conference on Supercomputing 25th Anniversary Volume. ACM, 237–244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Stephanie Coleman and Kathryn S. McKinley. 1995. Tile Size Selection Using Cache Organization and Data Layout. In Proceedings of PLDI. 279–290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. George B Dantzig and B Curtis Eaves. 1973. Fourier-Motzkin elimination and its dual. Journal of Combinatorial Theory, Series A 14, 3 (1973), 288–297.Google ScholarGoogle ScholarCross RefCross Ref
  18. C. Ding and K. Kennedy. 2004. Improving effective bandwidth through compiler enhancement of global cache reuse. J. Parallel and Distrib. Comput. 64, 1 (2004), 108–134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Johannes Doerfert, Clemens Hammacher, Kevin Streit, and Sebastian Hack. 2013. Spolly: speculative optimizations in the polyhedral model. IMPACT 2013 (2013), 55.Google ScholarGoogle Scholar
  20. David Eklov and Erik Hagersten. 2010. StatStack: Efficient modeling of LRU caches. In Proceedings of ISPASS. 55–65.Google ScholarGoogle ScholarCross RefCross Ref
  21. Karim Esseghir. 1993. Improving data locality for caches. Ph.D. Dissertation. Rice University.Google ScholarGoogle Scholar
  22. Naznin Fauzia. 2015. Characterization of Data Locality Potential of CPU and GPU Applications through Dynamic Analysis. Ph.D. Dissertation. The Ohio State University.Google ScholarGoogle Scholar
  23. Somnath Ghosh, Margaret Martonosi, and Sharad Malik. 1997. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 11th international conference on Supercomputing. ACM, 317–324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012. IEEE, 1–10.Google ScholarGoogle ScholarCross RefCross Ref
  25. Xiameng Hu, Xiaolin Wang, Yechen Li, Yingwei Luo, Chen Ding, and Zhenlin Wang. 2017. Optimal Symbiosis and Fair Scheduling in Shared Cache. IEEE TPDS 28, 4 (2017), 1134–1148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Xiameng Hu, Xiaolin Wang, Lan Zhou, Yingwei Luo, Chen Ding, and Zhenlin Wang. 2016. Kinetic modeling of data eviction in cache. In 2016 USENIX Annual Technical Conference (USENIX ATC 16). USENIX Association, 351–364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Ken Kennedy and Kathryn S McKinley. 1992. Optimizing for parallelism and data locality. In ACM International Conference on Supercomputing 25th Anniversary Volume. ACM, 151–162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. David J Kuck, Yoichi Muraoka, and Shyh-Ching Chen. 1972. On the number of operations simultaneously executable in Fortran-like programs and their resulting speedup. IEEE Trans. Comput. 100, 12 (1972), 1293–1310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Leslie Lamport. 1974. The parallel execution of DO loops. Commun. ACM 17, 2 (1974), 83–93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04). Palo Alto, California. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Hao Luo, Guoyang Chen, Pengcheng Li, Chen Ding, and Xipeng Shen. 2016. Data-centric combinatorial optimization of parallel code. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Hao Luo, Pengcheng Li, and Chen Ding. 2017. Thread data sharing in cache: Theory and measurement. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 103–115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. L. Mattson, J. Gecsei, D. Slutz, and I. L. Traiger. 1970. Evaluation techniques for storage hierarchies. IBM System Journal 9, 2 (1970), 78–117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Miquel Moreto, Francisco J Cazorla, Alex Ramirez, and Mateo Valero. 2008. MLP-aware dynamic cache partitioning. In International Conference on High-Performance Embedded Architectures and Compilers. Springer, 337–352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Preeti Ranjan Panda, Hiroshi Nakamura, Nikil D Dutt, and Alexandru Nicolau. 1999. Augmenting loop tiling with data alignment for improved cache performance. IEEE transactions on computers 48, 2 (1999), 142–149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices 48, 6 (2013), 519–530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Derek L. Schuff, Milind Kulkarni, and Vijay S. Pai. 2010. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of PACT. 53–64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. David K. Tam, Reza Azimi, Livio Soares, and Michael Stumm. 2009. RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations. In Proceedings of ASPLOS. 121–132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Xavier Vera, Josep Llosa, Antonio Gonzalez, and Carlos Ciuraneta. 2000. A fast implementation of cache miss equations. In Procs. of the 8th. Int. Workshop on Compilers for Parallel Computers. 319–326.Google ScholarGoogle Scholar
  40. Xavier Vera and Jingling Xue. 2002. Let’s study whole-program cache behaviour analytically. In High-Performance Computer Architecture, 2002. Proceedings. Eighth International Symposium on. IEEE, 175–186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Sven Verdoolaege, Rachid Seghir, Kristof Beyls, Vincent Loechner, and Maurice Bruynooghe. 2004. Analytical computation of Ehrhart polynomials: Enabling more compiler analyses and optimizations. In Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems. ACM, 248–258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Xiaolin Wang, Yechen Li, Yingwei Luo, Xiameng Hu, Jacob Brock, Chen Ding, and Zhenlin Wang. 2015. Optimal footprint symbiosis in shared cache. In Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on. IEEE, 412–422.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jake Wires, Stephen Ingram, Zachary Drudi, Nicholas JA Harvey, Andrew Warfield, and Coho Data. 2014. Characterizing storage workloads with counter stacks. In Proceedings of OSDI. USENIX Association, 335– 349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm. In Proceedings of PLDI. 30–44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Xiaoya Xiang, Bin Bao, Tongxin Bai, Chen Ding, and Trishul M. Chilimbi. 2011. All-window profiling and composable models of cache sharing. In Proceedings of PPoPP. 91–102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Xiaoya Xiang, Chen Ding, Hao Luo, and Bin Bao. 2013. HOTL: a higher order theory of locality. In Proceedings of ASPLOS. 343–356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Chencheng Ye, Jacob Brock, Chen Ding, and Hai Jin. 2017. Rochester elastic cache utility (recu): Unequal cache sharing is good economics. International Journal of Parallel Programming 45, 1 (2017), 30–44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Chencheng Ye, Chen Ding, Hao Luo, Jacob Brock, Dong Chen, and Hai Jin. 2017. Cache Exclusivity and Sharing: Theory and Optimization. ACM Trans. on Arch. and Code Opt. 14, 4, 34:1–34:26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Yutao Zhong and Wentao Chang. 2008. Sampling-based program locality approximation. In Proceedings of ISMM. 91–100. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Locality analysis through static parallel sampling

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!