Abstract
As multi-core processors become commonplace and cloud computing is gaining acceptance, more applications are run in a shared cache environment. Cache sharing depends on a concept called footprint, which depends on all cache accesses not just cache misses. Previous work has recognized the importance of footprint but has not provided a method for accurate measurement, mainly because the complete measurement requires counting data access in all execution windows, which takes time quadratic in the length of a trace. The paper first presents an algorithm efficient enough for off-line use to approximately measure the footprint with a guaranteed precision. The cost of the analysis can be adjusted by changing the precision. Then the paper presents a composable model. For a set of programs, the model uses the all-window footprint of each program to predict its cache interference with other programs without running these programs together. The paper evaluates the efficiency of all-window profiling using the SPEC 2000 benchmarks and compares the footprint interference model with a miss-rate based model and with exhaustive testing.
- A. Agarwal, J. L. Hennessy, and M. Horowitz. Cache performance of operating system and multiprogramming workloads. ACM Transactions on Computer Systems, 6(4):393--431, 1988. Google Scholar
Digital Library
- B. T. Bennett and V. J. Kruskal. LRU stack processing. IBM Journal of Research and Development, pages 353--357, 1975. Google Scholar
Digital Library
- E. Berg and E. Hagersten. Statcache: a probabilistic approach to efficient and accurate data locality analysis. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, pages 20--27, 2004. Google Scholar
Digital Library
- K. Beyls and E. D'Hollander. Generating cache hints for improved program efficiency. Journal of Systems Architecture, 51(4):223--250, 2005. Google Scholar
Digital Library
- K. Beyls and E. D'Hollander. Discovery of locality-improving refactoring by reuse path analysis. In Proceedings of HPCC. Springer. Lecture Notes in Computer Science Vol. 4208, pages 220--229, 2006. Google Scholar
Digital Library
- C. Cascaval, E. Duesterwald, P. F. Sweeney, and R. W. Wisniewski. Multiple page size modeling and optimization. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, St. Louis, MO, 2005. Google Scholar
Digital Library
- C. Cascaval and D. A. Padua. Estimating cache misses and locality using stack distances. In International Conference on Supercomputing, pages 150--159, 2003. Google Scholar
Digital Library
- D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting interthread cache contention on a chip multi-processor architecture. In Proceedings of the International Symposium on High-Performance Computer Architecture, pages 340--351, 2005. Google Scholar
Digital Library
- A. Chauhan and C.-Y. Shei. Static reuse distances for locality-based optimizations in MATLAB. In International Conference on Supercomputing, pages 295--304, 2010. Google Scholar
Digital Library
- C. Ding and T. Chilimbi. All-window profiling of concurrent executions. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008. poster paper. Google Scholar
Digital Library
- C. Ding and T. Chilimbi. A composable model for analyzing locality of multi-threaded programs. Technical Report MSR-TR-2009-107, Microsoft Research, August 2009.Google Scholar
- B. Falsafi and D. A. Wood. Modeling cost/performance of a parallel computer simulator. ACM Transactions on Modeling and Computer Simulation, 7(1):104--130, 1997. Google Scholar
Digital Library
- A. Fedorova, M. Seltzer, and M. D. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2007. Google Scholar
Digital Library
- M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Transactions on Computers, 38(12):1612--1630, 1989. Google Scholar
Digital Library
- Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 220--229, 2008. Google Scholar
Digital Library
- Y. Jiang, E. Z. Zhang, K. Tian, and X. Shen. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Proceedings of the International Conference on Compiler Construction, pages 264--282, 2010. Google Scholar
Digital Library
- S. F. Kaplan, Y. Smaragdakis, and P. R. Wilson. Flexible reference trace reduction for VM simulations. ACM Transactions on Modeling and Computer Simulation, 13(1):1--38, 2003. Google Scholar
Digital Library
- G. Marin and J. Mellor-Crummey. Cross architecture performance predictions for scientific applications using parameterized models. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems, pages 2--13, 2004. Google Scholar
Digital Library
- R. L. Mattson, J. Gecsei, D. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM System Journal, 9(2):78--117,1970. Google Scholar
Digital Library
- D. L. Schuff, M. Kulkarni, and V. S. Pai. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2010. Google Scholar
Digital Library
- M. Schulz, B. S. White, S. A. McKee, H.-H. S. Lee, and J. Jeitner. Owl: next generation system monitoring. In Proceedings of the ACM Conference on Computing Frontiers, pages 116--124, 2005. Google Scholar
Digital Library
- X. Shen and J. Shaw. Scalable implementation of efficient locality approximation. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing, pages 202--216, 2008. Google Scholar
Digital Library
- X. Shen, J. Shaw, B. Meeker, and C. Ding. Locality approximation using time. In Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 55--61, 2007. Google Scholar
Digital Library
- A. Snavely and D. M. Tullsen. Symbiotic job scheduling for a simultaneous multi-threading processor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 234--244, 2000. Google Scholar
Digital Library
- G. E. Suh, S. Devadas, and L. Rudolph. Analytical cache models with applications to cache partitioning. In International Conference on Supercomputing, pages 1--12, 2001. Google Scholar
Digital Library
- D. Thiébaut and H. S. Stone. Footprints in the cache. ACM Transactions on Computer Systems, 5(4):305--329, 1987. Google Scholar
Digital Library
- X. Zhang, S. Dwarkadas, and K. Shen. Towards practical page coloring-based multi-core cache management. In Proceedings of the EuroSys Conference, 2009. Google Scholar
Digital Library
- Y. Zhong, X. Shen, and C. Ding. Program locality analysis using reuse distance. ACM Transactions on Programming Languages and Systems, 31(6):1--39, Aug. 2009. Google Scholar
Digital Library
- S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 129--142, 2010. Google Scholar
Digital Library
Index Terms
All-window profiling and composable models of cache sharing
Recommendations
All-window profiling and composable models of cache sharing
PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programmingAs multi-core processors become commonplace and cloud computing is gaining acceptance, more applications are run in a shared cache environment. Cache sharing depends on a concept called footprint, which depends on all cache accesses not just cache ...
Location-aware cache management for many-core processors with deep cache hierarchy
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisAs cache hierarchies become deeper and the number of cores on a chip increases, managing caches becomes more important for performance and energy. However, current hardware cache management policies do not always adapt optimally to the applications ...
Optimizing instruction cache performance for operating system intensive workloads
HPCA '95: Proceedings of the 1st IEEE Symposium on High-Performance Computer ArchitectureHigh instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however, has been ...







Comments