skip to main content
research-article
Public Access

Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy

Published:04 April 2017Publication History
Skip Abstract Section

Abstract

Data prefetching and cache replacement algorithms have been intensively studied in the design of high performance microprocessors. Typically, the data prefetcher operates in the private caches and does not interact with the replacement policy in the shared Last-Level Cache (LLC). Similarly, most replacement policies do not consider demand and prefetch requests as different types of requests. In particular, program counter (PC)-based replacement policies cannot learn from prefetch requests since the data prefetcher does not generate a PC value. PC-based policies can also be negatively affected by compiler optimizations. In this paper, we propose a holistic cache management technique called Kill-the-PC (KPC) that overcomes the weaknesses of traditional prefetching and replacement policy algorithms. KPC cache management has three novel contributions. First, a prefetcher which approximates the future use distance of prefetch requests based on its prediction confidence. Second, a simple replacement policy provides similar or better performance than current state-of-the-art PC-based prediction using global hysteresis. Third, KPC integrates prefetching and replacement policy into a whole system which is greater than the sum of its parts. Information from the prefetcher is used to improve the performance of the replacement policy and vice-versa. Finally, KPC removes the need to propagate the PC through entire on-chip cache hierarchy while providing a holistic cache management approach with better performance than state-of-the-art PC-, and non-PC-based schemes. Our evaluation shows that KPC provides 8% better performance than the best combination of existing prefetcher and replacement policy for multi-core workloads.

References

  1. Standard Performance Evaluation Corporation CPU2006 Benchmark Suite. http://www.spec.org/cpu2006/.Google ScholarGoogle Scholar
  2. J.-L. Baer and T.-F. Chen. An effective on-chip preloading scheme to reduce data access penalty. In Supercomputing, 1991. Supercomputing'91. Proceedings of the 1991 ACM/IEEE Conference on, pages 176--186. IEEE, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. R. Curtin, J. R. Cline, N. P. Slagle, W. B. March, P. Ram, N. A. Mehta, and A. G. Gray. MLPACK: A scalable C++ machine learning library. Journal of Machine Learning Research, 14: 801--805, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. D. Enright Jerger, E. L. Hill, and M. H. Lipasti. Friendly fire: understanding the effects of multiprocessor prefetches. In International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 177--188, 2006. Google ScholarGoogle ScholarCross RefCross Ref
  5. H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on, pages 365--376. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. V. V. Fedorov, S. Qiu, A. L. Reddy, and P. V. Gratz. Ari: Adaptive llc-memory traffic management. ACM Transactions on Architecture and Code Optimization (TACO), 10 (4): 46, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In ACM SIGPLAN Notices, volume 47, pages 37--48. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F. M. Harper and J. A. Konstan. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TiiS), 5 (4): 19, 2016.Google ScholarGoogle Scholar
  9. R. Hegde. Optimizing application performance on intel core microarchitecture using hardware-implemented prefetchers. Intel Software Network, 2008.Google ScholarGoogle Scholar
  10. Y. Ishii, M. Inaba, and K. Hiraki. Access map pattern matching for high performance data cache prefetch. Journal of Instruction-Level Parallelism, 13: 1--24, 2011.Google ScholarGoogle Scholar
  11. Y. Ishii, M. Inaba, and K. Hiraki. Unified memory optimizing architecture: memory subsystem control with a unified predictor. In Proceedings of the 26th ACM international conference on Supercomputing, pages 267--278. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Jain and C. Lin. Back to the future: leveraging belady's algorithm for improved cache replacement. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 78--89. IEEE, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely Jr, and J. Emer. Adaptive insertion policies for managing shared caches. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 208--219. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer. High performance cache replacement using re-reference interval prediction (rrip). In ACM SIGARCH Computer Architecture News, volume 38, pages 60--71. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. A. Jiménez. Insertion and promotion for tree-based pseudolru last-level caches. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 284--296. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Kadjo, J. Kim, P. Sharma, R. Panda, P. Gratz, and D. Jiménez. B-fetch: Branch prediction directed prefetching for chip-multiprocessors. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 623--634. IEEE Computer Society, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Khan, Y. Tian, and D. A. Jiménez. Sampling dead block prediction for last-level caches. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on, pages 175--186. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Khan, A. R. Alameldeen, C. Wilkerson, O. Mutlu, and D. A. Jiménez. Improving cache performance by exploiting read-write disparity. In Proceedings of the 20th Internatial Symposiym on High Performance Computer Architecture (HPCA), pages 452--463. IEEE, 2014.Google ScholarGoogle Scholar
  19. J. Kim, S. H. Pugsley, P. V. Gratz, A. N. Reddy, C. Wilkerson, and Z. Chishti. Path confidence based lookahead prefetching. In Microarchitecture (MICRO), 2016 49rd Annual IEEE/ACM International Symposium on. IEEE, 2016. Google ScholarGoogle ScholarCross RefCross Ref
  20. A.-C. Lai, C. Fide, and B. Falsafi. Dead-block prediction & dead-block correlating prefetchers. In Computer Architecture, 2001. Proceedings. 28th Annual International Symposium on, pages 144--154. IEEE, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Li, J. Tan, Y. Wang, L. Zhang, and V. Salapura. Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In Proceedings of the 12th ACM International Conference on Computing Frontiers, page 53. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, pages 222--233, Los Alamitos, CA, USA, 2008. IEEE Computer Society. http://doi.ieeecomputersociety.org/10.1109/MICRO.2008.4771793.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Michaud. A best-offset prefetcher. In High Performance Computer Architecture (HPCA), 2016 IEEE 20th International Symposium on. IEEE, 2016.Google ScholarGoogle Scholar
  24. E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder. Using simpoint for accurate and efficient simulation. In ACM SIGMETRICS Performance Evaluation Review, volume 31, pages 318--319. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. H. Pugsley, A. R. Alameldeen, C. Wilkerson, and H. Kim. The 2nd Data Prefetching Championship (DPC-2). http://comparch-conf.gatech.edu/dpc2/.Google ScholarGoogle Scholar
  26. S. H. Pugsley, Z. Chishti, C. Wilkerson, P.-f. Chuang, R. L. Scott, A. Jaleel, S.-L. Lu, K. Chow, and R. Balasubramonian. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 626--637. IEEE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  27. M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ACM SIGARCH Computer Architecture News, volume 35, pages 381--391. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry. The evicted-address filter: A unified mechanism to address both cache pollution and thrashing. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques, pages 355--366. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. V. Seshadri, S. Yedkar, H. Xin, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks. ACM Transactions on Architecture and Code Optimization (TACO), 11 (4): 51, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H. Pugsley, and Z. Chishti. Efficiently prefetching complex address patterns. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Spatial memory streaming. In ACM SIGARCH Computer Architecture News, volume 34, pages 252--263. IEEE Computer Society, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on, pages 63--74. IEEE, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. E. Teran, Y. Tian, Z. Wang, and D. A. Jiménez. Minimal disturbance placement and promotion. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 201--211. IEEE, 2016 Google ScholarGoogle ScholarCross RefCross Ref
  34. E. Teran, Z. Wang, and D. A. Jiménez. Perceptron learning for reuse prediction. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pages 1--12. IEEE, 2016\natexlabb. Google ScholarGoogle ScholarCross RefCross Ref
  35. J.-Y. Won, P. Gratz, S. Shakkottai, and J. Hu. Having your cake and eating it too: Energy savings without performance loss through resource sharing driven power management. In Low Power Electronics and Design (ISLPED), 2015 IEEE/ACM International Symposium on, pages 255--260. IEEE, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  36. C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr, and J. Emer. Ship: Signature-based hit predictor for high performance caching. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pages 430--441. ACM, 2011 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C.-J. Wu, A. Jaleel, M. Martonosi, S. C. Steely Jr, and J. Emer. Pacman: prefetch-aware cache management for high performance caching. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pages 442--453. ACM, 2011\natexlabb. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. W. A. Wulf and S. A. McKee. Hitting the memory wall: implications of the obvious. SIGARCH Comp. Arch. News, 23: 20--24, March 1995. ISSN 0163--5964.Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 52, Issue 4
    ASPLOS '17
    April 2017
    811 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/3093336
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
      April 2017
      856 pages
      ISBN:9781450344654
      DOI:10.1145/3037697

    Copyright © 2017 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 4 April 2017

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!