skip to main content
research-article
Public Access

Whirlpool: Improving Dynamic Cache Management with Static Data Classification

Authors Info & Claims
Published:25 March 2016Publication History
Skip Abstract Section

Abstract

Cache hierarchies are increasingly non-uniform and difficult to manage. Several techniques, such as scratchpads or reuse hints, use static information about how programs access data to manage the memory hierarchy. Static techniques are effective on regular programs, but because they set fixed policies, they are vulnerable to changes in program behavior or available cache space. Instead, most systems rely on dynamic caching policies that adapt to observed program behavior. Unfortunately, dynamic policies spend significant resources trying to learn how programs use memory, and yet they often perform worse than a static policy. We present Whirlpool, a novel approach that combines static information with dynamic policies to reap the benefits of each. Whirlpool statically classifies data into pools based on how the program uses memory. Whirlpool then uses dynamic policies to tune the cache to each pool. Hence, rather than setting policies statically, Whirlpool uses static analysis to guide dynamic policies. We present both an API that lets programmers specify pools manually and a profiling tool that discovers pools automatically in unmodified binaries.

We evaluate Whirlpool on a state-of-the-art NUCA cache. Whirlpool significantly outperforms prior approaches: on sequential programs, Whirlpool improves performance by up to 38% and reduces data movement energy by up to 53%; on parallel programs, Whirlpool improves performance by up to 67% and reduces data movement energy by up to 2.6x.

References

  1. N. Agarwal, D. Nellans, M. O'Connor, S. W. Keckler, and T. F. Wenisch, "Unlocking bandwidth for GPUs in CC-NUMA systems," in Proc. HPCA-21, 2015.Google ScholarGoogle Scholar
  2. N. Agarwal, D. Nellans, M. Stephenson, M. O'Connor, and S. W. Keckler, "Page Placement Strategies for GPUs within Heterogeneous Memory Systems," in Proc. ASPLOS-XX, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Aingaran, D. Smentek, T. Wicki, S. Jairath, G. Konstadinidis, S. Leung et al., "M7: Oracle's Next-Generation Sparc Processor," IEEE Micro, no. 2, 2015.Google ScholarGoogle Scholar
  4. M. Awasthi, K. Sudan, R. Balasubramonian, and J. Carter, "Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches," in Proc. HPCA-15, 2009.Google ScholarGoogle Scholar
  5. S. Beamer, K. Asanović, and D. Patterson, "The GAP benchmark suite," arXiv:1508.03619 [cs.DC], 2015.Google ScholarGoogle Scholar
  6. B. M. Beckmann, M. R. Marty, and D. A. Wood, "ASR: Adaptive selective replication for CMP caches," in Proc. MICRO-39, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. M. Beckmann and D. A. Wood, "Managing wire delay in large chip-multiprocessor caches," in Proc. MICRO-37, 2004.Google ScholarGoogle Scholar
  8. N. Beckmann, "Design and analysis of spatially-partitioned shared caches," Ph.D. dissertation, Massachusetts Institute of Technology, 2015.Google ScholarGoogle Scholar
  9. N. Beckmann and D. Sanchez, "Jigsaw: Scalable software-defined caches," in Proc. PACT-22, 2013.Google ScholarGoogle Scholar
  10. N. Beckmann and D. Sanchez, "Talus: A Simple Way to Remove Cliffs in Cache Performance," in Proc. HPCA-21, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  11. N. Beckmann, P.-A. Tsai, and D. Sanchez, "Scaling Distributed Cache Hierarchies through Computation and Data Co-Scheduling," in Proc. HPCA-21, 2015.Google ScholarGoogle Scholar
  12. K. Beyls and E. D'Hollander, "Generating cache hints for improved program efficiency," J. Syst. Architect., vol. 51, no. 4, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. D. Blumofe and C. E. Leiserson, "Scheduling multithreaded computations by work stealing," J. ACM, vol. 46, no. 5, 1999.Google ScholarGoogle Scholar
  14. J. Brock, X. Gu, B. Bao, and C. Ding, "Pacman: Program-assisted cache management," in Proc. ISMM, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Chandra, F. Guo, S. Kim, and Y. Solihin, "Predicting inter-thread cache contention on a chip multi-processor architecture," in Proc. HPCA-11, 2005.Google ScholarGoogle Scholar
  16. Q. Chen, M. Guo, and H. Guan, "LAWS: locality-aware work-stealing for multi-socket multi-core architectures," in Proc. ICS, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. X. E. Chen and T. M. Aamodt, "A first-order fine-grained multithreaded throughput model," in Proc. HPCA-15, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  18. Z. Chishti, M. D. Powell, and T. Vijaykumar, "Optimizing replication, communication, and capacity allocation in CMPs," in Proc. ISCA-32, 2005.Google ScholarGoogle Scholar
  19. S. Cho and L. Jin, "Managing distributed, shared L2 caches through OS-level page allocation," in Proc. MICRO-39, 2006.Google ScholarGoogle Scholar
  20. S. Coleman and K. S. McKinley, "Tile size selection using cache organization and data layout," in Proc. PLDI, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. J. Dally, "GPU Computing: To Exascale and Beyond," in Supercomputing '10, Plenary Talk, 2010.Google ScholarGoogle Scholar
  22. S. Das, T. M. Aamodt, and W. J. Dally, "SLIP: reducing wire energy in the memory hierarchy," in Proc. ISCA-42, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers et al., "Traffic management: a holistic approach to memory placement on NUMA systems," in Proc. ASPLOS-XVIII, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Ding and Y. Zhong, "Predicting whole-program locality through reuse distance analysis," in Proc. PLDI, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Eklov, D. Black-Schaffer, and E. Hagersten, "StatCC: a statistical cache contention model," in Proc. PACT-19, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Frigo, C. E. Leiserson, and K. H. Randall, "The Implementation of the Cilk-5 Multithreaded Language," in Proc. PLDI, 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. X. Gu, T. Bai, Y. Gao, C. Zhang, R. Archambault, and C. Ding, "P-OPT: Program-directed optimal cache management," in Proc. LCPC-21, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: near-optimal block placement and replication in distributed caches," in Proc. ISCA-36, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. E. Herrero, J. González, and R. Canal, "Distributed cooperative caching," in Proc. PACT-17, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. E. Herrero, J. González, and R. Canal, "Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors," in Proc. ISCA-37, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Hilton, N. Eswaran, and A. Roth, "FIESTA: A sample-balanced multi-program workload methodology," Proc. MoBS, 2009.Google ScholarGoogle Scholar
  32. A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely Jr, and J. Emer, "Adaptive insertion policies for managing shared caches," in Proc. PACT-17, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer, "High performance cache replacement using re-reference interval prediction (RRIP)," in Proc. ISCA-37, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. N. P. Jouppi, "Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers," in Proc. ISCA-17, 1990.Google ScholarGoogle Scholar
  35. G. Karypis and V. Kumar, "A fast and high quality multilevel scheme for partitioning irregular graphs," SIAM J. Sci. Comput., vol. 20, no. 1, 1998.Google ScholarGoogle Scholar
  36. S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the future of parallel computing," IEEE Micro, no. 5, 2011.Google ScholarGoogle Scholar
  37. C. Kim, D. Burger, and S. W. Keckler, "An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches," in Proc. ASPLOS-X, 2002.Google ScholarGoogle Scholar
  38. R. Komuravelli, M. D. Sinclair, M. Kotsifakou, P. Srivastava, S. V. Adve, and V. S. Adve, "Stash: Have your scratchpad and cache it too," in Proc. ISCA-42, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. D. Lam, E. E. Rothberg, and M. E. Wolf, "The cache performance and optimizations of blocked algorithms," in Proc. ASPLOS-IV, 1991.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. D. Lea, "A memory allocator," http://gee.cs.oswego.edu/dl/html/malloc.html, 2000.Google ScholarGoogle Scholar
  41. J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, M. Horowitz, and C. Kozyrakis, "Comparing memory systems for chip multiprocessors," in Proc. ISCA-34, 2007.Google ScholarGoogle Scholar
  42. S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures," in Proc. MICRO-42, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Lira, C. Molina, and A. González, "LRU-PEA: A smart replacement policy for non-uniform cache architectures on chip multiprocessors," in Proc. ICCD, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  44. C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney et al., "Pin: building customized program analysis tools with dynamic instrumentation," in Proc. PLDI, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. A. A. Melkman, "On-line construction of the convex hull of a simple polyline," Information Processing Letters, vol. 25, no. 1, 1987.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Micron, "1.35V DDR3L power calculator (4Gb x16 chips)," 2013.Google ScholarGoogle Scholar
  47. T. C. Mowry, M. S. Lam, and A. Gupta, "Design and evaluation of a compiler algorithm for prefetching," in Proc. ASPLOS-V, 1992.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. S. Palanca, V. Pentkovski, S. Tsai, and S. Maiyuran, "Method and apparatus for implementing non-temporal stores," 2001, US Patent 6,205,520.Google ScholarGoogle Scholar
  49. G. Pekhimenko, T. Huberty, R. Cai, O. Mutlu, P. B. Gibbons, M. Kozuch et al., "Exploiting compressed block size as an indicator of future reuse," in Proc. HPCA-21, 2015.Google ScholarGoogle Scholar
  50. P. Petoumenos, G. Keramidas, and S. Kaxiras, "Instruction-based reuse-distance prediction for effective cache management," in Proc. SAMOS, 2009.Google ScholarGoogle Scholar
  51. M. K. Qureshi, "Adaptive spill-receive for robust high-performance caching in CMPs," in Proc. HPCA-15, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  52. M. K. Qureshi and Y. N. Patt, "Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches," in Proc. MICRO-39, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. S. Rusu, H. Muljono, D. Ayers, S. Tam, W. Chen, A. Martin et al., "Ivytown: A 22nm 15-core enterprise Xeon® processor family," in Proc. ISSCC, 2014.Google ScholarGoogle Scholar
  54. D. Sanchez and C. Kozyrakis, "The ZCache: Decoupling ways and associativity," in Proc. MICRO-43, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. D. Sanchez and C. Kozyrakis, "Vantage: Scalable and efficient fine-grain cache partitioning," in Proc. ISCA-38, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. D. Sanchez and C. Kozyrakis, "ZSim: fast and accurate microarchitectural simulation of thousand-core systems," in Proc. ISCA-40, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. D. Sanchez, R. M. Yoo, and C. Kozyrakis, "Flexible architectural support for fine-grain scheduling," in Proc. ASPLOS-XV, 2010.Google ScholarGoogle Scholar
  58. N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan et al., "Navigating the maze of graph analytics frameworks using massive graph datasets," in Proc. SIGMOD, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. J. Shalf, S. Dosanjh, and J. Morrison, "Exascale computing technology challenges," in Proc. High Performance Computing for Computational Science (VECPAR), 2010.Google ScholarGoogle Scholar
  60. J. Shun, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, A. Kyrola, H. V. Simhadri et al., "Brief announcement: the problem based benchmark suite," in Proc. SPAA, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. D. K. Tam, R. Azimi, L. B. Soares, and M. Stumm, "RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations," in Proc. ASPLOS-XIV, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun, "A modified approach to data cache management," in Proc. MICRO-28, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  63. Z. Wang, K. McKinley, A. L. Rosenberg, and C. C. Weems, "Using the compiler to improve cache replacement decisions," in Proc. PACT-11, 2002.Google ScholarGoogle Scholar
  64. M. E. Wolf and M. S. Lam, "A data locality optimizing algorithm," in Proc. PLDI, 1991.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr, and J. Emer, "SHiP: signature-based hit predictor for high performance caching," in Proc. MICRO-44, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. X. Xiang, B. Bao, C. Ding, and Y. Gao, "Linear-time modeling of program working set in shared cache," in Proc. PACT-20, 2011.Google ScholarGoogle Scholar
  67. R. M. Yoo, C. J. Hughes, C. Kim, Y.-K. Chen, and C. Kozyrakis, "Locality-aware task management for unstructured parallelism: A quantitative limit study," in Proc. SPAA, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. M. Zhang and K. Asanovic, "Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors," in Proc. ISCA-32, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. S. Zhuravlev, S. Blagodurov, and A. Fedorova, "Addressing shared resource contention in multicore processors via scheduling," in Proc. ASPLOS-XV, 2010.Google ScholarGoogle Scholar

Index Terms

  1. Whirlpool: Improving Dynamic Cache Management with Static Data Classification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 51, Issue 4
      ASPLOS '16
      April 2016
      774 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2954679
      • Editor:
      • Andy Gill
      Issue’s Table of Contents
      • cover image ACM Conferences
        ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
        March 2016
        824 pages
        ISBN:9781450340915
        DOI:10.1145/2872362
        • General Chair:
        • Tom Conte,
        • Program Chair:
        • Yuanyuan Zhou

      Copyright © 2016 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 March 2016

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!