Abstract
Shared last level caches (LLC) of multicore systems-on-chip are subject to a significant amount of contention over a limited bandwidth, resulting in major performance bottlenecks that make the issue a first-order concern in modern multiprocessor systems-on-chip. Even though shared cache space partitioning has been extensively studied in the past, the problem of cache bandwidth partitioning has not received sufficient attention. We demonstrate the occurrence of such contention and the resulting impact on the overall system performance. To address the issue, we perform detailed simulations to study the impact of different parameters and propose a novel cache bandwidth partitioning technique, called REAL, that arbitrates among cache access requests originating from different processor cores. It monitors the LLC access patterns to dynamically assign a priority value to each core. Experimental results on different mixes of benchmarks show up to 2.13× overall system speedup over baseline policies, with minimal impact on energy.
- Ahmed Alhammad, Saud Wasly, and Rodolfo Pellizzoni. 2015. Memory efficient global scheduling of real-time tasks. In Proceedings of the 21st IEEE Real-Time and Embedded Technology and Applications Symposium. IEEE, 285--296.Google Scholar
Cross Ref
- Rajeev Balasubramonian, Andrew B. Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. 2017. CACTI 7: New tools for interconnect exploration in innovative off-chip memories. ACM Trans. Archit. Code Optim. 14, 2 (2017), 14.Google Scholar
Digital Library
- Abhishek Bhattacharjee and Margaret Martonosi. 2009. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. ACM SIGARCH Comput. Archit. News, 37, 3 (June 2009). ACM, 290--301.Google Scholar
Digital Library
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH Comput. Archit. News 39, 2 (2011), 1--7.Google Scholar
Digital Library
- Jichuan Chang and Gurindar S. Sohi. 2006. Cooperative caching for chip multiprocessors. InProceedings of the 33rd International Symposium on Computer Architecture (ISCA’06). ACM.Google Scholar
- Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. 2010. Aérgia: Exploiting packet latency slack in on-chip networks. ACM SIGARCH Comput. Archit. News 38, 3 (June 2010). ACM, 106--116.Google Scholar
Digital Library
- Robert I. Davis, Sebastian Altmeyer, Leandro S. Indrusiak, Claire Maiza, Vincent Nelis, and Jan Reineke. 2018. An extensible framework for multicore response time analysis. Real-Time Syst. 54, 3 (2018), 607--661.Google Scholar
Digital Library
- Haakon Dybdahl and Per Stenstrom. 2007. An adaptive shared/private NUCA cache partitioning scheme for chip multiprocessors. In Proceedings of the IEEE 13th International Symposium on High Performance Computer Architecture. IEEE, 2--12.Google Scholar
Digital Library
- Josue Feliu, Salvador Petit, Julio Sahuquillo, and Jose Duato. 2013. Cache-hierarchy contention-aware scheduling in CMPs. IEEE Trans. Parallel Distrib. Syst. 25, 3 (2013), 581--590.Google Scholar
Digital Library
- Josue Feliu, Julio Sahuquillo, Salvador Petit, and Jose Duato. 2015. Bandwidth-aware on-line scheduling in SMT multicores. IEEE Trans. Comput. 65, 2 (2015), 422--434.Google Scholar
Digital Library
- Brian Fields, Shai Rubin, and Rastislav Bodík. 2001. Focusing processor policies via critical-path prediction. ACM SIGARCH Comput. Archit. News, Vol. 29. ACM, 74--85.Google Scholar
Digital Library
- Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. ACM SIGARCH Comput. Archit. News 37, 3 (June 2009). ACM, 184--195.Google Scholar
Digital Library
- Engin Ipek, Onur Mutlu, José F Martínez, and Rich Caruana. 2008. Self-optimizing memory controllers: A reinforcement learning approach. ACM SIGARCH Comput. Archit. News 36, 3 (June 2008). IEEE Computer Society, 39--50.Google Scholar
Digital Library
- Rahul Jain, Preeti Ranjan Panda, and Sreenivas Subramoney. 2016. Machine learned machines: Adaptive co-optimization of caches, cores, and on-chip network. In Proceedings of the Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’16). IEEE, 253--256.Google Scholar
Cross Ref
- Changkyu Kim, Doug Burger, and Stephen W. Keckler. 2002. An adaptive, non-uniform cache structure for wire-delay dominated on-chip cachesACM SIGPLAN Notices 36, 5 (Dec. 2002). ACM, 211--222.Google Scholar
- Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In Proceedings of the 16th International Symposium on High-Performance Computer Architecture. IEEE, 1--12.Google Scholar
- Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2010. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In Proceedings of the 43rd IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 65--76.Google Scholar
Digital Library
- Zhaoying Li, Lei Ju, Hongjun Dai, Xin Li, Mengying Zhao, and Zhiping Jia. 2018. Set variation-aware shared LLC management for CPU-GPU heterogeneous architecture. In Proceedings of the Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’18). IEEE, 79--84.Google Scholar
Cross Ref
- Ankur Limaye and Tosiron Adegbija. 2018. A workload characterization of the SPEC CPU2017 benchmark suite. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’18). IEEE, 149--158.Google Scholar
Cross Ref
- Alessandra Melani, Marko Bertogna, Vincenzo Bonifaci, Alberto Marchetti-Spaccamela, and Giorgio Buttazzo. 2015. Memory-processor co-scheduling in fixed priority systems. In Proceedings of the 23rd International Conference on Real Time and Networks Systems. ACM, 87--96.Google Scholar
Digital Library
- Joshua San Miguel and Natalie Enright Jerger. 2015. Data criticality in network-on-chip design. In Proceedings of the 9th International Symposium on Networks-on-Chip. ACM, 22.Google Scholar
Digital Library
- Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. ACM SIGARCH Comput. Archit. News 36, 3 (June 2008). IEEE Computer Society, 63--74.Google Scholar
Digital Library
- Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA’03). IEEE, 129--140.Google Scholar
Digital Library
- Kyle J. Nesbit, Nidhi Aggarwal, James Laudon, and James E. Smith. 2006. Fair queuing memory systems. In Proceedings of the 39th IEEE/ACM international Symposium on Microarchitecture. IEEE Computer Society, 208--222.Google Scholar
- Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th IEEE/ACM International Symposium on Microarchitecture (MICRO’06). IEEE, 423--432.Google Scholar
- Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens. 2000. Memory access scheduling. ACM SIGARCH Comput. Archit. News 28, 2 (May 2000). ACM, 128--138.Google Scholar
Digital Library
- Yang Song, Olivier Alavoine, and Bill Lin. 2018. Row-buffer hit harvesting in orchestrated last-level cache and DRAM scheduling for heterogeneous multicore systems. In Proceedings of the Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’18). IEEE, 779--784.Google Scholar
Cross Ref
- Jeffrey Stuecheli, Dimitris Kaseridis, David Daly, Hillery C. Hunter, and Lizy K. John. 2010. The virtual write queue: Coordinating DRAM and last-level cache policies. ACM SIGARCH Comput. Archit. News 38, 3 (June 2010). ACM, 72--82.Google Scholar
Digital Library
- Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, and Onur Mutlu. 2015. The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 62--75.Google Scholar
Digital Library
- Richard S. Sutton, Andrew G. Barto, et al. 1998. Introduction to Reinforcement Learning, Vol. 2. MIT press Cambridge.Google Scholar
- Eric Tune, Dongning Liang, Dean M. Tullsen, and Brad Calder. 2001. Dynamic prediction of critical path instructions. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture. IEEE, 185--195.Google Scholar
Cross Ref
- Po-Han Wang, Cheng-Hsuan Li, and Chia-Lin Yang. 2016. Latency sensitivity-based cache partitioning for heterogeneous multi-core architecture. In Proceedings of the 53rd Design Automation Conference. ACM, 5.Google Scholar
Digital Library
Index Terms
REAL: REquest Arbitration in Last Level Caches
Recommendations
Understanding Cache Hierarchy Contention in CMPs to Improve Job Scheduling
IPDPS '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing SymposiumIn order to improve CMP performance, recent research has focused on scheduling to mitigate contention produced by the limited memory bandwidth. Nowadays, commercial CMPs implement multi-level cache hierarchies where last level caches are shared by at ...
Dynamic Partitioning of Shared Cache Memory
This paper proposes dynamic cache partitioning amongst simultaneously executing processes/threads. We present a general partitioning scheme that can be applied to set-associative caches.
Since memory reference characteristics of processes/threads can ...
Characterization and Evaluation of Cache Hierarchies for Web Servers
As Internet usage continues to expand rapidly, careful attention needs to be paid to the design of Internet servers for achieving high performance and end-user satisfaction. Currently, the memory system continues to remain a significant performance ...






Comments