Abstract
Cache hierarchies are increasingly non-uniform and difficult to manage. Several techniques, such as scratchpads or reuse hints, use static information about how programs access data to manage the memory hierarchy. Static techniques are effective on regular programs, but because they set fixed policies, they are vulnerable to changes in program behavior or available cache space. Instead, most systems rely on dynamic caching policies that adapt to observed program behavior. Unfortunately, dynamic policies spend significant resources trying to learn how programs use memory, and yet they often perform worse than a static policy. We present Whirlpool, a novel approach that combines static information with dynamic policies to reap the benefits of each. Whirlpool statically classifies data into pools based on how the program uses memory. Whirlpool then uses dynamic policies to tune the cache to each pool. Hence, rather than setting policies statically, Whirlpool uses static analysis to guide dynamic policies. We present both an API that lets programmers specify pools manually and a profiling tool that discovers pools automatically in unmodified binaries.
We evaluate Whirlpool on a state-of-the-art NUCA cache. Whirlpool significantly outperforms prior approaches: on sequential programs, Whirlpool improves performance by up to 38% and reduces data movement energy by up to 53%; on parallel programs, Whirlpool improves performance by up to 67% and reduces data movement energy by up to 2.6x.
- N. Agarwal, D. Nellans, M. O'Connor, S. W. Keckler, and T. F. Wenisch, "Unlocking bandwidth for GPUs in CC-NUMA systems," in Proc. HPCA-21, 2015.Google Scholar
- N. Agarwal, D. Nellans, M. Stephenson, M. O'Connor, and S. W. Keckler, "Page Placement Strategies for GPUs within Heterogeneous Memory Systems," in Proc. ASPLOS-XX, 2015.Google Scholar
Digital Library
- K. Aingaran, D. Smentek, T. Wicki, S. Jairath, G. Konstadinidis, S. Leung et al., "M7: Oracle's Next-Generation Sparc Processor," IEEE Micro, no. 2, 2015.Google Scholar
- M. Awasthi, K. Sudan, R. Balasubramonian, and J. Carter, "Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches," in Proc. HPCA-15, 2009.Google Scholar
- S. Beamer, K. Asanović, and D. Patterson, "The GAP benchmark suite," arXiv:1508.03619 [cs.DC], 2015.Google Scholar
- B. M. Beckmann, M. R. Marty, and D. A. Wood, "ASR: Adaptive selective replication for CMP caches," in Proc. MICRO-39, 2006.Google Scholar
Digital Library
- B. M. Beckmann and D. A. Wood, "Managing wire delay in large chip-multiprocessor caches," in Proc. MICRO-37, 2004.Google Scholar
- N. Beckmann, "Design and analysis of spatially-partitioned shared caches," Ph.D. dissertation, Massachusetts Institute of Technology, 2015.Google Scholar
- N. Beckmann and D. Sanchez, "Jigsaw: Scalable software-defined caches," in Proc. PACT-22, 2013.Google Scholar
- N. Beckmann and D. Sanchez, "Talus: A Simple Way to Remove Cliffs in Cache Performance," in Proc. HPCA-21, 2015.Google Scholar
Cross Ref
- N. Beckmann, P.-A. Tsai, and D. Sanchez, "Scaling Distributed Cache Hierarchies through Computation and Data Co-Scheduling," in Proc. HPCA-21, 2015.Google Scholar
- K. Beyls and E. D'Hollander, "Generating cache hints for improved program efficiency," J. Syst. Architect., vol. 51, no. 4, 2005.Google Scholar
Digital Library
- R. D. Blumofe and C. E. Leiserson, "Scheduling multithreaded computations by work stealing," J. ACM, vol. 46, no. 5, 1999.Google Scholar
- J. Brock, X. Gu, B. Bao, and C. Ding, "Pacman: Program-assisted cache management," in Proc. ISMM, 2013.Google Scholar
Digital Library
- D. Chandra, F. Guo, S. Kim, and Y. Solihin, "Predicting inter-thread cache contention on a chip multi-processor architecture," in Proc. HPCA-11, 2005.Google Scholar
- Q. Chen, M. Guo, and H. Guan, "LAWS: locality-aware work-stealing for multi-socket multi-core architectures," in Proc. ICS, 2014.Google Scholar
Digital Library
- X. E. Chen and T. M. Aamodt, "A first-order fine-grained multithreaded throughput model," in Proc. HPCA-15, 2009.Google Scholar
Cross Ref
- Z. Chishti, M. D. Powell, and T. Vijaykumar, "Optimizing replication, communication, and capacity allocation in CMPs," in Proc. ISCA-32, 2005.Google Scholar
- S. Cho and L. Jin, "Managing distributed, shared L2 caches through OS-level page allocation," in Proc. MICRO-39, 2006.Google Scholar
- S. Coleman and K. S. McKinley, "Tile size selection using cache organization and data layout," in Proc. PLDI, 1995.Google Scholar
Digital Library
- W. J. Dally, "GPU Computing: To Exascale and Beyond," in Supercomputing '10, Plenary Talk, 2010.Google Scholar
- S. Das, T. M. Aamodt, and W. J. Dally, "SLIP: reducing wire energy in the memory hierarchy," in Proc. ISCA-42, 2015.Google Scholar
Digital Library
- M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers et al., "Traffic management: a holistic approach to memory placement on NUMA systems," in Proc. ASPLOS-XVIII, 2013.Google Scholar
Digital Library
- C. Ding and Y. Zhong, "Predicting whole-program locality through reuse distance analysis," in Proc. PLDI, 2003.Google Scholar
Digital Library
- D. Eklov, D. Black-Schaffer, and E. Hagersten, "StatCC: a statistical cache contention model," in Proc. PACT-19, 2010.Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall, "The Implementation of the Cilk-5 Multithreaded Language," in Proc. PLDI, 1998.Google Scholar
Digital Library
- X. Gu, T. Bai, Y. Gao, C. Zhang, R. Archambault, and C. Ding, "P-OPT: Program-directed optimal cache management," in Proc. LCPC-21, 2008.Google Scholar
Digital Library
- N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: near-optimal block placement and replication in distributed caches," in Proc. ISCA-36, 2009.Google Scholar
Digital Library
- E. Herrero, J. González, and R. Canal, "Distributed cooperative caching," in Proc. PACT-17, 2008.Google Scholar
Digital Library
- E. Herrero, J. González, and R. Canal, "Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors," in Proc. ISCA-37, 2010.Google Scholar
Digital Library
- A. Hilton, N. Eswaran, and A. Roth, "FIESTA: A sample-balanced multi-program workload methodology," Proc. MoBS, 2009.Google Scholar
- A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely Jr, and J. Emer, "Adaptive insertion policies for managing shared caches," in Proc. PACT-17, 2008.Google Scholar
Digital Library
- A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer, "High performance cache replacement using re-reference interval prediction (RRIP)," in Proc. ISCA-37, 2010.Google Scholar
Digital Library
- N. P. Jouppi, "Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers," in Proc. ISCA-17, 1990.Google Scholar
- G. Karypis and V. Kumar, "A fast and high quality multilevel scheme for partitioning irregular graphs," SIAM J. Sci. Comput., vol. 20, no. 1, 1998.Google Scholar
- S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the future of parallel computing," IEEE Micro, no. 5, 2011.Google Scholar
- C. Kim, D. Burger, and S. W. Keckler, "An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches," in Proc. ASPLOS-X, 2002.Google Scholar
- R. Komuravelli, M. D. Sinclair, M. Kotsifakou, P. Srivastava, S. V. Adve, and V. S. Adve, "Stash: Have your scratchpad and cache it too," in Proc. ISCA-42, 2015.Google Scholar
Digital Library
- M. D. Lam, E. E. Rothberg, and M. E. Wolf, "The cache performance and optimizations of blocked algorithms," in Proc. ASPLOS-IV, 1991.Google Scholar
Digital Library
- D. Lea, "A memory allocator," http://gee.cs.oswego.edu/dl/html/malloc.html, 2000.Google Scholar
- J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, M. Horowitz, and C. Kozyrakis, "Comparing memory systems for chip multiprocessors," in Proc. ISCA-34, 2007.Google Scholar
- S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures," in Proc. MICRO-42, 2009.Google Scholar
Digital Library
- J. Lira, C. Molina, and A. González, "LRU-PEA: A smart replacement policy for non-uniform cache architectures on chip multiprocessors," in Proc. ICCD, 2009.Google Scholar
Cross Ref
- C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney et al., "Pin: building customized program analysis tools with dynamic instrumentation," in Proc. PLDI, 2005.Google Scholar
Digital Library
- A. A. Melkman, "On-line construction of the convex hull of a simple polyline," Information Processing Letters, vol. 25, no. 1, 1987.Google Scholar
Digital Library
- Micron, "1.35V DDR3L power calculator (4Gb x16 chips)," 2013.Google Scholar
- T. C. Mowry, M. S. Lam, and A. Gupta, "Design and evaluation of a compiler algorithm for prefetching," in Proc. ASPLOS-V, 1992.Google Scholar
Digital Library
- S. Palanca, V. Pentkovski, S. Tsai, and S. Maiyuran, "Method and apparatus for implementing non-temporal stores," 2001, US Patent 6,205,520.Google Scholar
- G. Pekhimenko, T. Huberty, R. Cai, O. Mutlu, P. B. Gibbons, M. Kozuch et al., "Exploiting compressed block size as an indicator of future reuse," in Proc. HPCA-21, 2015.Google Scholar
- P. Petoumenos, G. Keramidas, and S. Kaxiras, "Instruction-based reuse-distance prediction for effective cache management," in Proc. SAMOS, 2009.Google Scholar
- M. K. Qureshi, "Adaptive spill-receive for robust high-performance caching in CMPs," in Proc. HPCA-15, 2009.Google Scholar
Cross Ref
- M. K. Qureshi and Y. N. Patt, "Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches," in Proc. MICRO-39, 2006.Google Scholar
Digital Library
- S. Rusu, H. Muljono, D. Ayers, S. Tam, W. Chen, A. Martin et al., "Ivytown: A 22nm 15-core enterprise Xeon® processor family," in Proc. ISSCC, 2014.Google Scholar
- D. Sanchez and C. Kozyrakis, "The ZCache: Decoupling ways and associativity," in Proc. MICRO-43, 2010.Google Scholar
Digital Library
- D. Sanchez and C. Kozyrakis, "Vantage: Scalable and efficient fine-grain cache partitioning," in Proc. ISCA-38, 2011.Google Scholar
Digital Library
- D. Sanchez and C. Kozyrakis, "ZSim: fast and accurate microarchitectural simulation of thousand-core systems," in Proc. ISCA-40, 2013.Google Scholar
Digital Library
- D. Sanchez, R. M. Yoo, and C. Kozyrakis, "Flexible architectural support for fine-grain scheduling," in Proc. ASPLOS-XV, 2010.Google Scholar
- N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan et al., "Navigating the maze of graph analytics frameworks using massive graph datasets," in Proc. SIGMOD, 2014.Google Scholar
Digital Library
- J. Shalf, S. Dosanjh, and J. Morrison, "Exascale computing technology challenges," in Proc. High Performance Computing for Computational Science (VECPAR), 2010.Google Scholar
- J. Shun, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, A. Kyrola, H. V. Simhadri et al., "Brief announcement: the problem based benchmark suite," in Proc. SPAA, 2012.Google Scholar
Digital Library
- D. K. Tam, R. Azimi, L. B. Soares, and M. Stumm, "RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations," in Proc. ASPLOS-XIV, 2009.Google Scholar
Digital Library
- G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun, "A modified approach to data cache management," in Proc. MICRO-28, 1995.Google Scholar
Cross Ref
- Z. Wang, K. McKinley, A. L. Rosenberg, and C. C. Weems, "Using the compiler to improve cache replacement decisions," in Proc. PACT-11, 2002.Google Scholar
- M. E. Wolf and M. S. Lam, "A data locality optimizing algorithm," in Proc. PLDI, 1991.Google Scholar
Digital Library
- C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr, and J. Emer, "SHiP: signature-based hit predictor for high performance caching," in Proc. MICRO-44, 2011.Google Scholar
Digital Library
- X. Xiang, B. Bao, C. Ding, and Y. Gao, "Linear-time modeling of program working set in shared cache," in Proc. PACT-20, 2011.Google Scholar
- R. M. Yoo, C. J. Hughes, C. Kim, Y.-K. Chen, and C. Kozyrakis, "Locality-aware task management for unstructured parallelism: A quantitative limit study," in Proc. SPAA, 2013.Google Scholar
Digital Library
- M. Zhang and K. Asanovic, "Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors," in Proc. ISCA-32, 2005.Google Scholar
Digital Library
- S. Zhuravlev, S. Blagodurov, and A. Fedorova, "Addressing shared resource contention in multicore processors via scheduling," in Proc. ASPLOS-XV, 2010.Google Scholar
Index Terms
Whirlpool: Improving Dynamic Cache Management with Static Data Classification
Recommendations
Whirlpool: Improving Dynamic Cache Management with Static Data Classification
ASPLOS'16Cache hierarchies are increasingly non-uniform and difficult to manage. Several techniques, such as scratchpads or reuse hints, use static information about how programs access data to manage the memory hierarchy. Static techniques are effective on ...
Whirlpool: Improving Dynamic Cache Management with Static Data Classification
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating SystemsCache hierarchies are increasingly non-uniform and difficult to manage. Several techniques, such as scratchpads or reuse hints, use static information about how programs access data to manage the memory hierarchy. Static techniques are effective on ...
pRedis: Penalty and Locality Aware Memory Allocation in Redis
SoCC '19: Proceedings of the ACM Symposium on Cloud ComputingDue to large data volume and low latency requirements of modern web services, the use of in-memory key-value (KV) cache often becomes an inevitable choice (e.g. Redis and Memcached). The in-memory cache holds hot data, reduces request latency, and ...







Comments