ABSTRACT
Two factors primarily affect performance of multi-threaded tasks on many-core processors with both shared and physically distributed Last-Level Cache (LLC): the power budget associated with a certain task mapping that aims to guarantee thermally safe operation and the non-uniform LLC access latency of threads running on different cores. Spatially distributing threads across the many-core increases the power budget, but unfortunately also increases the associated LLC latency. On the other side, mapping more threads to cores near the center of the many-core decreases the LLC latency, but unfortunately also decreases the power budget. Consequently, both metrics (LLC latency and power budget) cannot be simultaneously optimal, which leads to a Pareto-optimization that has formerly not been exploited. We are the first to present a run-time task mapping algorithm called PCMap that exploits this trade-off. Our approach results in up to 8.6% reduction in the average task response time accompanied by a reduction of up to 8.5% in the energy consumption compared to the state-of-the-art.
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Parallel Architectures and Compilation Techniques (PACT), 2008. Google Scholar
Digital Library
- T. E. Carlson, W. Heirmant, and L. Eeckhout. Sniper: Exploring the Level of Abstraction for Scalable and Accurate Parallel Multi-Core Simulation. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2011. Google Scholar
Digital Library
- A. K. Coskun, T. S. Rosing, K. A. Whisnant, and K. C. Gross. Temperature-Aware MPSoC Scheduling for Reducing Hot Spots and Gradients. In Asia and South Pacific Design Automation Conference (ASP-DAC), 2008. Google Scholar
Digital Library
- E. L. de Souza Carvalho, N. L. V. Calazans, and F. G. Moraes. Dynamic Task Mapping for MPSoCs. IEEE Design & Test of Computers, 2010. Google Scholar
Digital Library
- T. Ebi, D. Kramer, W. Karl, and J. Henkel. Economic Learning for Thermal-Aware Power Budgeting in Many-Core Architectures. In Conference on Hardware/Software Codesign and System Synthesis (CODES), 2011. Google Scholar
Digital Library
- M. R. Garey and D. S. Johnson. Complexity Results for Multiprocessor Scheduling under Resource Constraints. SIAM Journal on Computing, 1975.Google Scholar
- W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan, K. Skadron, and M. R. Stan. HotSpot: A Compact Thermal Modeling mMethodology for Early-Stage VLSI Design. Transactions on Very Large Scale Integration Systems, 2006. Google Scholar
Digital Library
- A. Kanduri, M.-H. Haghbayan, A.-M. Rahmani, M. Shafique, A. Jantsch, and P. Liljeberg. adBoost: Thermal Aware Performance Boosting through Dark Silicon Patterning. Transactions on Computers (TC), 2018.Google Scholar
- C. Kim, D. Burger, and S. W. Keckler. An Adaptive, NonUniform Cache Structure for Wire-Delay Dominated OnChip Caches. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2002. Google Scholar
Digital Library
- S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing. Transactions on Architecture and Code Optimization (TACO), 2013. Google Scholar
Digital Library
- C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In ACM SIGPLAN Notices, 2005. Google Scholar
Digital Library
- S. Mittal. A Survey of Architectural Techniques for Improving Cache Power Efficiency. Sustainable Computing: Informatics and Systems (SUSCOM), 2014.Google Scholar
Cross Ref
- T. S. Muthukaruppan, M. Pricopi, V. Venkataramani, T. Mitra, and S. Vishin. Hierarchical Power Management for Asymmetric Multi-Core in Dark Silicon Era. In Design Automation Conference (DAC), 2013. Google Scholar
Digital Library
- S. Pagani, H. Khdr, W. Munawar, J.-J. Chen, M. Shafique, M. Li, and J. Henkel. TSP: Thermal Safe Power: Efficient Power Budgeting for Many-Core Systems in Dark Silicon. In Conference on Hardware/Software Codesign and System Synthesis (CODES), 2014. Google Scholar
Digital Library
- A. Pathania and J. Henkel. Task Scheduling for Many-Cores with S-NUCA Caches. In Design, Automation and Test in Europe (DATE), 2018.Google Scholar
- X. Wang, A. K. Singh, B. Li, Y. Yang, T. Mak, and H. Li. Bubble Budgeting: Throughput Optimization for Dynamic Workloads by Exploiting Dark Cores in Many Core Systems. In International Symposium on Networks-on-Chip, 2016.Google Scholar
Cross Ref
- S. Wildermann, M. Glaß, and J. Teich. Multi-Objective Distributed Run-time Resource Management for Many-Cores. In Design, Automation & Test in Europe (DATE), 2014. Google Scholar
Digital Library
- D. Zhu, L. Chen, T. M. Pinkston, and M. Pedram. TAPP: Temperature-Aware Application Mapping for NoC-Based Many-Core Processors. In Design, Automation & Test in Europe (DATE), 2015. Google Scholar
Digital Library
Index Terms
Pareto-Optimal Power- and Cache-Aware Task Mapping for Many-Cores with Distributed Shared Last-Level Cache
Recommendations
Managing shared last-level cache in a heterogeneous multicore processor
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesHeterogeneous multicore processors that integrate CPU cores and data-parallel accelerators such as GPU cores onto the same die raise several new issues for sharing various on-chip resources. The shared last-level cache (LLC) is one of the most important ...
Criticality aware tiered cache hierarchy: a fundamental relook at multi-level cache hierarchies
ISCA '18: Proceedings of the 45th Annual International Symposium on Computer ArchitectureOn-die caches are a popular method to help hide the main memory latency. However, it is difficult to build large caches without substantially increasing their access latency, which in turn hurts performance. To overcome this difficulty, on-die caches ...
CPU Cache Prefetching: Timing Evaluation of Hardware Implementations
Prefetching into CPU caches has long been known to be effective in reducing the cache miss ratio, but known implementations of prefetching have been unsuccessful in improving CPU performance. The reasons for this are that prefetches interfere with ...





Comments