Abstract
Data prefetching and cache replacement algorithms have been intensively studied in the design of high performance microprocessors. Typically, the data prefetcher operates in the private caches and does not interact with the replacement policy in the shared Last-Level Cache (LLC). Similarly, most replacement policies do not consider demand and prefetch requests as different types of requests. In particular, program counter (PC)-based replacement policies cannot learn from prefetch requests since the data prefetcher does not generate a PC value. PC-based policies can also be negatively affected by compiler optimizations. In this paper, we propose a holistic cache management technique called Kill-the-PC (KPC) that overcomes the weaknesses of traditional prefetching and replacement policy algorithms. KPC cache management has three novel contributions. First, a prefetcher which approximates the future use distance of prefetch requests based on its prediction confidence. Second, a simple replacement policy provides similar or better performance than current state-of-the-art PC-based prediction using global hysteresis. Third, KPC integrates prefetching and replacement policy into a whole system which is greater than the sum of its parts. Information from the prefetcher is used to improve the performance of the replacement policy and vice-versa. Finally, KPC removes the need to propagate the PC through entire on-chip cache hierarchy while providing a holistic cache management approach with better performance than state-of-the-art PC-, and non-PC-based schemes. Our evaluation shows that KPC provides 8% better performance than the best combination of existing prefetcher and replacement policy for multi-core workloads.
- Standard Performance Evaluation Corporation CPU2006 Benchmark Suite. http://www.spec.org/cpu2006/.Google Scholar
- J.-L. Baer and T.-F. Chen. An effective on-chip preloading scheme to reduce data access penalty. In Supercomputing, 1991. Supercomputing'91. Proceedings of the 1991 ACM/IEEE Conference on, pages 176--186. IEEE, 1991. Google Scholar
Digital Library
- R. R. Curtin, J. R. Cline, N. P. Slagle, W. B. March, P. Ram, N. A. Mehta, and A. G. Gray. MLPACK: A scalable C++ machine learning library. Journal of Machine Learning Research, 14: 801--805, 2013.Google Scholar
Digital Library
- N. D. Enright Jerger, E. L. Hill, and M. H. Lipasti. Friendly fire: understanding the effects of multiprocessor prefetches. In International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 177--188, 2006. Google Scholar
Cross Ref
- H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on, pages 365--376. IEEE, 2011. Google Scholar
Digital Library
- V. V. Fedorov, S. Qiu, A. L. Reddy, and P. V. Gratz. Ari: Adaptive llc-memory traffic management. ACM Transactions on Architecture and Code Optimization (TACO), 10 (4): 46, 2013. Google Scholar
Digital Library
- M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In ACM SIGPLAN Notices, volume 47, pages 37--48. ACM, 2012. Google Scholar
Digital Library
- F. M. Harper and J. A. Konstan. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TiiS), 5 (4): 19, 2016.Google Scholar
- R. Hegde. Optimizing application performance on intel core microarchitecture using hardware-implemented prefetchers. Intel Software Network, 2008.Google Scholar
- Y. Ishii, M. Inaba, and K. Hiraki. Access map pattern matching for high performance data cache prefetch. Journal of Instruction-Level Parallelism, 13: 1--24, 2011.Google Scholar
- Y. Ishii, M. Inaba, and K. Hiraki. Unified memory optimizing architecture: memory subsystem control with a unified predictor. In Proceedings of the 26th ACM international conference on Supercomputing, pages 267--278. ACM, 2012. Google Scholar
Digital Library
- A. Jain and C. Lin. Back to the future: leveraging belady's algorithm for improved cache replacement. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 78--89. IEEE, 2016. Google Scholar
Digital Library
- A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely Jr, and J. Emer. Adaptive insertion policies for managing shared caches. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 208--219. ACM, 2008. Google Scholar
Digital Library
- A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer. High performance cache replacement using re-reference interval prediction (rrip). In ACM SIGARCH Computer Architecture News, volume 38, pages 60--71. ACM, 2010. Google Scholar
Digital Library
- D. A. Jiménez. Insertion and promotion for tree-based pseudolru last-level caches. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 284--296. ACM, 2013. Google Scholar
Digital Library
- D. Kadjo, J. Kim, P. Sharma, R. Panda, P. Gratz, and D. Jiménez. B-fetch: Branch prediction directed prefetching for chip-multiprocessors. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 623--634. IEEE Computer Society, 2014. Google Scholar
Digital Library
- S. Khan, Y. Tian, and D. A. Jiménez. Sampling dead block prediction for last-level caches. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on, pages 175--186. IEEE, 2010. Google Scholar
Digital Library
- S. Khan, A. R. Alameldeen, C. Wilkerson, O. Mutlu, and D. A. Jiménez. Improving cache performance by exploiting read-write disparity. In Proceedings of the 20th Internatial Symposiym on High Performance Computer Architecture (HPCA), pages 452--463. IEEE, 2014.Google Scholar
- J. Kim, S. H. Pugsley, P. V. Gratz, A. N. Reddy, C. Wilkerson, and Z. Chishti. Path confidence based lookahead prefetching. In Microarchitecture (MICRO), 2016 49rd Annual IEEE/ACM International Symposium on. IEEE, 2016. Google Scholar
Cross Ref
- A.-C. Lai, C. Fide, and B. Falsafi. Dead-block prediction & dead-block correlating prefetchers. In Computer Architecture, 2001. Proceedings. 28th Annual International Symposium on, pages 144--154. IEEE, 2001. Google Scholar
Digital Library
- M. Li, J. Tan, Y. Wang, L. Zhang, and V. Salapura. Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In Proceedings of the 12th ACM International Conference on Computing Frontiers, page 53. ACM, 2015. Google Scholar
Digital Library
- H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, pages 222--233, Los Alamitos, CA, USA, 2008. IEEE Computer Society. http://doi.ieeecomputersociety.org/10.1109/MICRO.2008.4771793.Google Scholar
Digital Library
- P. Michaud. A best-offset prefetcher. In High Performance Computer Architecture (HPCA), 2016 IEEE 20th International Symposium on. IEEE, 2016.Google Scholar
- E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder. Using simpoint for accurate and efficient simulation. In ACM SIGMETRICS Performance Evaluation Review, volume 31, pages 318--319. ACM, 2003. Google Scholar
Digital Library
- S. H. Pugsley, A. R. Alameldeen, C. Wilkerson, and H. Kim. The 2nd Data Prefetching Championship (DPC-2). http://comparch-conf.gatech.edu/dpc2/.Google Scholar
- S. H. Pugsley, Z. Chishti, C. Wilkerson, P.-f. Chuang, R. L. Scott, A. Jaleel, S.-L. Lu, K. Chow, and R. Balasubramonian. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 626--637. IEEE, 2014.Google Scholar
Cross Ref
- M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ACM SIGARCH Computer Architecture News, volume 35, pages 381--391. ACM, 2007. Google Scholar
Digital Library
- V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry. The evicted-address filter: A unified mechanism to address both cache pollution and thrashing. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques, pages 355--366. ACM, 2012. Google Scholar
Digital Library
- V. Seshadri, S. Yedkar, H. Xin, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks. ACM Transactions on Architecture and Code Optimization (TACO), 11 (4): 51, 2015. Google Scholar
Digital Library
- M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H. Pugsley, and Z. Chishti. Efficiently prefetching complex address patterns. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture, 2015. Google Scholar
Digital Library
- S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Spatial memory streaming. In ACM SIGARCH Computer Architecture News, volume 34, pages 252--263. IEEE Computer Society, 2006. Google Scholar
Digital Library
- S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on, pages 63--74. IEEE, 2007.Google Scholar
Digital Library
- E. Teran, Y. Tian, Z. Wang, and D. A. Jiménez. Minimal disturbance placement and promotion. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 201--211. IEEE, 2016 Google Scholar
Cross Ref
- E. Teran, Z. Wang, and D. A. Jiménez. Perceptron learning for reuse prediction. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pages 1--12. IEEE, 2016\natexlabb. Google Scholar
Cross Ref
- J.-Y. Won, P. Gratz, S. Shakkottai, and J. Hu. Having your cake and eating it too: Energy savings without performance loss through resource sharing driven power management. In Low Power Electronics and Design (ISLPED), 2015 IEEE/ACM International Symposium on, pages 255--260. IEEE, 2015.Google Scholar
Cross Ref
- C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr, and J. Emer. Ship: Signature-based hit predictor for high performance caching. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pages 430--441. ACM, 2011 Google Scholar
Digital Library
- C.-J. Wu, A. Jaleel, M. Martonosi, S. C. Steely Jr, and J. Emer. Pacman: prefetch-aware cache management for high performance caching. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pages 442--453. ACM, 2011\natexlabb. Google Scholar
Digital Library
- W. A. Wulf and S. A. McKee. Hitting the memory wall: implications of the obvious. SIGARCH Comp. Arch. News, 23: 20--24, March 1995. ISSN 0163--5964.Google Scholar
Digital Library
Recommendations
Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsData prefetching and cache replacement algorithms have been intensively studied in the design of high performance microprocessors. Typically, the data prefetcher operates in the private caches and does not interact with the replacement policy in the ...
Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy
Asplos'17Data prefetching and cache replacement algorithms have been intensively studied in the design of high performance microprocessors. Typically, the data prefetcher operates in the private caches and does not interact with the replacement policy in the ...
Stride-directed Prefetching for Secondary Caches
ICPP '97: Proceedings of the international Conference on Parallel ProcessingThis paper studies hardware prefetching for second-level (L2) caches. Previous work on prefetching has been extensive but largely directed at primary caches. In some cases only L2 prefetching is possible or is more appropriate. By studying L2 ...







Comments