ABSTRACT
Modern microprocessor designs continue to obtain impressive performance gains through increasing clock rates and advances in the parallelism obtained via micro-architecture design. Unfortunately, corresponding improvements in memory design technology have not been realized, resulting in latencies of over 100 cycles between processors and main memory. This ever-increasing gap in speed has pushed the current memory-hierarchy approach to its limit.Traditional approaches to memory-hierarchy management have not yielded satisfactory results. Hardware solutions require more power and energy than desired and do not scale well. Compiler solutions tend to miss too many optimization opportunities because of limited compile-time knowledge of run-time behavior. This paper explores a different approach that combines both approaches by making use of the static knowledge obtained by the compiler in the dynamic decision making of the micro-architecture. We propose a memory-hierarchy design based on working sets that uses compile-time annotations regarding the working set of memory operations to guide cache placement decisionsOur experiments show that a working-set-based memory hierarchy can significantly reduce the miss rate for memory-intensive tiled kernels by limiting cross interference. The working-set-based memory hierarchy allows the compiler to tile many loops without concern for cross interference in the cache, making tile size choice easier. In addition, the compiler can more easily tailor tile choices to the separate needs of different working sets.
- J. Allen and K. Kennedy. Vector register allocation. IEEE Transactions on Computers, 41(10):1290--1317, Oct. 1992. Google Scholar
Digital Library
- F. Bodin and A. Seznec. Skewed associativity enhances performance predictability. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 265--274. ACM Press, 1995. Google Scholar
Digital Library
- D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. In Proceedings of the ACM SIGPLAN '90 Conference on Programming Language Design and Implementation, pages 53--65, White Plains, NY, June 1990. Google Scholar
Digital Library
- S. Carr and R. Lehoucq. Compiler blockability of dense matrix factorizations. ACM Transactions on Mathematical Software, 23(3):336--361, Sept. 1997. Google Scholar
Digital Library
- S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In Proceedings of the ACM SIGPLAN '95 Conference on Programming Language Design and Implementation, pages 279--280, La Jolla, CA, June 1995. Google Scholar
Digital Library
- D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformations. In Proceedings of the First International Conference on Supercomputing. Springer-Verlag, Athens, Greece, 1987. Google Scholar
Digital Library
- G. Gao, R. Olsen, V. Sarkar, and R. Thekkath. Collective loop fusion for array contraction. In Proceedings of the Fifth Workshop on Languages and Compilers for Parallel Computing, New Haven, CT USA, Aug. 1992. Google Scholar
Digital Library
- S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 1997 ACM International Conference on Supercomputing, pages 317--324, Vienna, Austria, July 1997. Google Scholar
Digital Library
- S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 228--239, San Jose, CA, Oct. 1998. Google Scholar
Digital Library
- M. Kandemir, A. Choudary, J. Ramanujam, and P. Banerjee. Improving locality using loop and data transformations in an integrated framework. In Proceedings of the 31st International Symposium on Microarchitecture (MICRO-31), pages 285--296, Dallas, TX, Dec. 1998. Google Scholar
Digital Library
- M. T. Kandemir, J. Ramanujam, and A. Choudary. A compiler algorithm for optimizing locality in loop nests. In International Conference on Supercomputing, pages 269--276, May 1997. Google Scholar
Digital Library
- K. Kennedy. Fast greedy weighted fusion. In Proceedings of the 2000 ACM International Conference on Supercomputing, May 2000. Google Scholar
Digital Library
- K. Kennedy and K. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Languages and Compiler for Parallel Computing, pages 301--321, Portland, OR USA, Aug. 1993. Google Scholar
Digital Library
- M. Kharbutli, K. Irwin, Y. Solohin, and J. Lee. Using prime numbers for cache indexing to eliminate conflict misses. In Tenth International Symposium on High-Performance Computer Architecture, pages 288--299. IEEE Computer Society, 2004. Google Scholar
Digital Library
- M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63--74, Santa Clara, California, 1991. Google Scholar
Digital Library
- K. McKinley and O. Temam. A quantitative analysis of loop nest locality. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 94--104, Cambridge, MA, Oct. 1996. Google Scholar
Digital Library
- K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4):424--453, 1996. Google Scholar
Digital Library
- G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation, pages 38--49, Montreal, Canada, June 1998. Google Scholar
Digital Library
- G. Rivera and C.-W. Tseng. A comparison of compiler tiling algorithms. In Proceedings of the 8th International Conference on Compiler Construction, Amsterdam, The Netherlands, Mar. 1999. Google Scholar
Digital Library
- V. Sarkar and G. Gao. Optimization of array accesses by collective loop transformations. In Proceedings of the 1991 ACM International Conference on Supercomputing, pages 194--205, June 1991. Google Scholar
Digital Library
- Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation, pages 215--228, Atlanta, GA USA, May 1999. Google Scholar
Digital Library
- N. Topham, A. González, and J. González. The design and performance of a conflict-avoiding cache. In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 71--80. IEEE Computer Society, 1997. Google Scholar
Digital Library
- X. Vera, J. Abella, A. González, and J. Llosa. Optimizing program locality through cmes and gas. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, pages 68--78, New Orleans, LA, September 2003. Google Scholar
Digital Library
- M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation, pages 30--44, Toronto, Ontario, June 1991. Google Scholar
Digital Library
- Q. Yang and L. W. Yang. A novel cache design for vector processing. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 362--371. ACM Press, 1992. Google Scholar
Digital Library
Index Terms
A case for a working-set-based memory hierarchy
Recommendations
Revisiting level-0 caches in embedded processors
Level-0 (L0) caches have been proposed in the past as an inexpensive way to improve performance and reduce energy consumption in resource-constrained embedded processors. This paper proposes new L0 data cache organizations using the assumption that an ...
Yet Another Compressed Cache: A Low-Cost Yet Effective Compressed Cache
Cache memories play a critical role in bridging the latency, bandwidth, and energy gaps between cores and off-chip memory. However, caches frequently consume a significant fraction of a multicore chip's area and thus account for a significant fraction ...
Way adaptable D-NUCA caches
Non-uniform cache architecture (NUCA) aims to limit the wire-delay problem typical of large on-chip last level caches: by partitioning a large cache into several banks, with the latency of each one depending on its physical location and by employing a ...





Comments