skip to main content
research-article

REAL: REquest Arbitration in Last Level Caches

Published:15 November 2019Publication History
Skip Abstract Section

Abstract

Shared last level caches (LLC) of multicore systems-on-chip are subject to a significant amount of contention over a limited bandwidth, resulting in major performance bottlenecks that make the issue a first-order concern in modern multiprocessor systems-on-chip. Even though shared cache space partitioning has been extensively studied in the past, the problem of cache bandwidth partitioning has not received sufficient attention. We demonstrate the occurrence of such contention and the resulting impact on the overall system performance. To address the issue, we perform detailed simulations to study the impact of different parameters and propose a novel cache bandwidth partitioning technique, called REAL, that arbitrates among cache access requests originating from different processor cores. It monitors the LLC access patterns to dynamically assign a priority value to each core. Experimental results on different mixes of benchmarks show up to 2.13× overall system speedup over baseline policies, with minimal impact on energy.

References

  1. Ahmed Alhammad, Saud Wasly, and Rodolfo Pellizzoni. 2015. Memory efficient global scheduling of real-time tasks. In Proceedings of the 21st IEEE Real-Time and Embedded Technology and Applications Symposium. IEEE, 285--296.Google ScholarGoogle ScholarCross RefCross Ref
  2. Rajeev Balasubramonian, Andrew B. Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. 2017. CACTI 7: New tools for interconnect exploration in innovative off-chip memories. ACM Trans. Archit. Code Optim. 14, 2 (2017), 14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Abhishek Bhattacharjee and Margaret Martonosi. 2009. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. ACM SIGARCH Comput. Archit. News, 37, 3 (June 2009). ACM, 290--301.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH Comput. Archit. News 39, 2 (2011), 1--7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jichuan Chang and Gurindar S. Sohi. 2006. Cooperative caching for chip multiprocessors. InProceedings of the 33rd International Symposium on Computer Architecture (ISCA’06). ACM.Google ScholarGoogle Scholar
  6. Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. 2010. Aérgia: Exploiting packet latency slack in on-chip networks. ACM SIGARCH Comput. Archit. News 38, 3 (June 2010). ACM, 106--116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Robert I. Davis, Sebastian Altmeyer, Leandro S. Indrusiak, Claire Maiza, Vincent Nelis, and Jan Reineke. 2018. An extensible framework for multicore response time analysis. Real-Time Syst. 54, 3 (2018), 607--661.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Haakon Dybdahl and Per Stenstrom. 2007. An adaptive shared/private NUCA cache partitioning scheme for chip multiprocessors. In Proceedings of the IEEE 13th International Symposium on High Performance Computer Architecture. IEEE, 2--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Josue Feliu, Salvador Petit, Julio Sahuquillo, and Jose Duato. 2013. Cache-hierarchy contention-aware scheduling in CMPs. IEEE Trans. Parallel Distrib. Syst. 25, 3 (2013), 581--590.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Josue Feliu, Julio Sahuquillo, Salvador Petit, and Jose Duato. 2015. Bandwidth-aware on-line scheduling in SMT multicores. IEEE Trans. Comput. 65, 2 (2015), 422--434.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Brian Fields, Shai Rubin, and Rastislav Bodík. 2001. Focusing processor policies via critical-path prediction. ACM SIGARCH Comput. Archit. News, Vol. 29. ACM, 74--85.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. ACM SIGARCH Comput. Archit. News 37, 3 (June 2009). ACM, 184--195.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Engin Ipek, Onur Mutlu, José F Martínez, and Rich Caruana. 2008. Self-optimizing memory controllers: A reinforcement learning approach. ACM SIGARCH Comput. Archit. News 36, 3 (June 2008). IEEE Computer Society, 39--50.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Rahul Jain, Preeti Ranjan Panda, and Sreenivas Subramoney. 2016. Machine learned machines: Adaptive co-optimization of caches, cores, and on-chip network. In Proceedings of the Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’16). IEEE, 253--256.Google ScholarGoogle ScholarCross RefCross Ref
  15. Changkyu Kim, Doug Burger, and Stephen W. Keckler. 2002. An adaptive, non-uniform cache structure for wire-delay dominated on-chip cachesACM SIGPLAN Notices 36, 5 (Dec. 2002). ACM, 211--222.Google ScholarGoogle Scholar
  16. Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In Proceedings of the 16th International Symposium on High-Performance Computer Architecture. IEEE, 1--12.Google ScholarGoogle Scholar
  17. Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2010. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In Proceedings of the 43rd IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 65--76.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Zhaoying Li, Lei Ju, Hongjun Dai, Xin Li, Mengying Zhao, and Zhiping Jia. 2018. Set variation-aware shared LLC management for CPU-GPU heterogeneous architecture. In Proceedings of the Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’18). IEEE, 79--84.Google ScholarGoogle ScholarCross RefCross Ref
  19. Ankur Limaye and Tosiron Adegbija. 2018. A workload characterization of the SPEC CPU2017 benchmark suite. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’18). IEEE, 149--158.Google ScholarGoogle ScholarCross RefCross Ref
  20. Alessandra Melani, Marko Bertogna, Vincenzo Bonifaci, Alberto Marchetti-Spaccamela, and Giorgio Buttazzo. 2015. Memory-processor co-scheduling in fixed priority systems. In Proceedings of the 23rd International Conference on Real Time and Networks Systems. ACM, 87--96.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Joshua San Miguel and Natalie Enright Jerger. 2015. Data criticality in network-on-chip design. In Proceedings of the 9th International Symposium on Networks-on-Chip. ACM, 22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. ACM SIGARCH Comput. Archit. News 36, 3 (June 2008). IEEE Computer Society, 63--74.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA’03). IEEE, 129--140.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kyle J. Nesbit, Nidhi Aggarwal, James Laudon, and James E. Smith. 2006. Fair queuing memory systems. In Proceedings of the 39th IEEE/ACM international Symposium on Microarchitecture. IEEE Computer Society, 208--222.Google ScholarGoogle Scholar
  25. Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th IEEE/ACM International Symposium on Microarchitecture (MICRO’06). IEEE, 423--432.Google ScholarGoogle Scholar
  26. Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens. 2000. Memory access scheduling. ACM SIGARCH Comput. Archit. News 28, 2 (May 2000). ACM, 128--138.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yang Song, Olivier Alavoine, and Bill Lin. 2018. Row-buffer hit harvesting in orchestrated last-level cache and DRAM scheduling for heterogeneous multicore systems. In Proceedings of the Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’18). IEEE, 779--784.Google ScholarGoogle ScholarCross RefCross Ref
  28. Jeffrey Stuecheli, Dimitris Kaseridis, David Daly, Hillery C. Hunter, and Lizy K. John. 2010. The virtual write queue: Coordinating DRAM and last-level cache policies. ACM SIGARCH Comput. Archit. News 38, 3 (June 2010). ACM, 72--82.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, and Onur Mutlu. 2015. The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 62--75.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Richard S. Sutton, Andrew G. Barto, et al. 1998. Introduction to Reinforcement Learning, Vol. 2. MIT press Cambridge.Google ScholarGoogle Scholar
  31. Eric Tune, Dongning Liang, Dean M. Tullsen, and Brad Calder. 2001. Dynamic prediction of critical path instructions. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture. IEEE, 185--195.Google ScholarGoogle ScholarCross RefCross Ref
  32. Po-Han Wang, Cheng-Hsuan Li, and Chia-Lin Yang. 2016. Latency sensitivity-based cache partitioning for heterogeneous multi-core architecture. In Proceedings of the 53rd Design Automation Conference. ACM, 5.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. REAL: REquest Arbitration in Last Level Caches

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Article Metrics

          • Downloads (Last 12 months)90
          • Downloads (Last 6 weeks)9

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!