10.1145/1062261.1062304acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
Article

A case for a working-set-based memory hierarchy

Published:04 May 2005Publication History

ABSTRACT

Modern microprocessor designs continue to obtain impressive performance gains through increasing clock rates and advances in the parallelism obtained via micro-architecture design. Unfortunately, corresponding improvements in memory design technology have not been realized, resulting in latencies of over 100 cycles between processors and main memory. This ever-increasing gap in speed has pushed the current memory-hierarchy approach to its limit.Traditional approaches to memory-hierarchy management have not yielded satisfactory results. Hardware solutions require more power and energy than desired and do not scale well. Compiler solutions tend to miss too many optimization opportunities because of limited compile-time knowledge of run-time behavior. This paper explores a different approach that combines both approaches by making use of the static knowledge obtained by the compiler in the dynamic decision making of the micro-architecture. We propose a memory-hierarchy design based on working sets that uses compile-time annotations regarding the working set of memory operations to guide cache placement decisionsOur experiments show that a working-set-based memory hierarchy can significantly reduce the miss rate for memory-intensive tiled kernels by limiting cross interference. The working-set-based memory hierarchy allows the compiler to tile many loops without concern for cross interference in the cache, making tile size choice easier. In addition, the compiler can more easily tailor tile choices to the separate needs of different working sets.

References

  1. J. Allen and K. Kennedy. Vector register allocation. IEEE Transactions on Computers, 41(10):1290--1317, Oct. 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. F. Bodin and A. Seznec. Skewed associativity enhances performance predictability. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 265--274. ACM Press, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. In Proceedings of the ACM SIGPLAN '90 Conference on Programming Language Design and Implementation, pages 53--65, White Plains, NY, June 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Carr and R. Lehoucq. Compiler blockability of dense matrix factorizations. ACM Transactions on Mathematical Software, 23(3):336--361, Sept. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In Proceedings of the ACM SIGPLAN '95 Conference on Programming Language Design and Implementation, pages 279--280, La Jolla, CA, June 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformations. In Proceedings of the First International Conference on Supercomputing. Springer-Verlag, Athens, Greece, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Gao, R. Olsen, V. Sarkar, and R. Thekkath. Collective loop fusion for array contraction. In Proceedings of the Fifth Workshop on Languages and Compilers for Parallel Computing, New Haven, CT USA, Aug. 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 1997 ACM International Conference on Supercomputing, pages 317--324, Vienna, Austria, July 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 228--239, San Jose, CA, Oct. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Kandemir, A. Choudary, J. Ramanujam, and P. Banerjee. Improving locality using loop and data transformations in an integrated framework. In Proceedings of the 31st International Symposium on Microarchitecture (MICRO-31), pages 285--296, Dallas, TX, Dec. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. T. Kandemir, J. Ramanujam, and A. Choudary. A compiler algorithm for optimizing locality in loop nests. In International Conference on Supercomputing, pages 269--276, May 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Kennedy. Fast greedy weighted fusion. In Proceedings of the 2000 ACM International Conference on Supercomputing, May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. Kennedy and K. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Languages and Compiler for Parallel Computing, pages 301--321, Portland, OR USA, Aug. 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Kharbutli, K. Irwin, Y. Solohin, and J. Lee. Using prime numbers for cache indexing to eliminate conflict misses. In Tenth International Symposium on High-Performance Computer Architecture, pages 288--299. IEEE Computer Society, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63--74, Santa Clara, California, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. K. McKinley and O. Temam. A quantitative analysis of loop nest locality. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 94--104, Cambridge, MA, Oct. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4):424--453, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation, pages 38--49, Montreal, Canada, June 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Rivera and C.-W. Tseng. A comparison of compiler tiling algorithms. In Proceedings of the 8th International Conference on Compiler Construction, Amsterdam, The Netherlands, Mar. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Sarkar and G. Gao. Optimization of array accesses by collective loop transformations. In Proceedings of the 1991 ACM International Conference on Supercomputing, pages 194--205, June 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation, pages 215--228, Atlanta, GA USA, May 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. Topham, A. González, and J. González. The design and performance of a conflict-avoiding cache. In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 71--80. IEEE Computer Society, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. X. Vera, J. Abella, A. González, and J. Llosa. Optimizing program locality through cmes and gas. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, pages 68--78, New Orleans, LA, September 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation, pages 30--44, Toronto, Ontario, June 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Q. Yang and L. W. Yang. A novel cache design for vector processing. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 362--371. ACM Press, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A case for a working-set-based memory hierarchy

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CF '05: Proceedings of the 2nd conference on Computing frontiers
        May 2005
        467 pages
        ISBN:1595930191
        DOI:10.1145/1062261

        Copyright © 2005 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 May 2005

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate216of614submissions,35%
      • Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!