Abstract
Reuse Distance (RD) analysis is a powerful memory analysis tool that can potentially help architects study multicore processor scaling. One key obstacle, however, is that multicore RD analysis requires measuring Concurrent Reuse Distance (CRD) and Private-LRU-stack Reuse Distance (PRD) profiles across thread-interleaved memory reference streams. Sensitivity to memory interleaving makes CRD and PRD profiles architecture dependent, preventing them from analyzing different processor configurations. For loop-based parallel programs, CRD and PRD profiles shift coherently across RD values with core count scaling because interleaving threads are symmetric. Simple techniques can predict such shifting, making the analysis of numerous multicore configurations from a small set of CRD and PRD profiles feasible. Given the ubiquity of parallel loops, such techniques will be extremely valuable for studying future large multicore designs.
This article investigates using RD analysis to efficiently analyze multicore cache performance for loop-based parallel programs, making several contributions. First, we provide an in-depth analysis on how CRD and PRD profiles change with core count scaling. Second, we develop techniques to predict CRD and PRD profile scaling, in particular employing reference groups [Zhong et al. 2003] to predict coherent shift, demonstrating 90% or greater prediction accuracy. Third, our CRD and PRD profile analyses define two application parameters with architectural implications: Ccore is the minimum shared cache capacity that “contains” locality degradation due to core count scaling, and Cshare is the capacity at which shared caches begin to provide a cache-miss reduction compared to private caches. And fourth, we apply CRD and PRD profiles to analyze multicore cache performance. When combined with existing problem scaling prediction, our techniques can predict shared LLC MPKI (private L2 cache MPKI) to within 10.7% (13.9%) of simulation across 1,728 (1,440) configurations using only 36 measured CRD (PRD) profiles.
- Agarwal, A., Bao, L., Brown, J., Edwards, B., Mattina, M., Miao, C.-C., Ramey, C., and Wentzlaff, D. 2007. Tile processor: Embedded multicore for networking and multimedia. In Proceedings of the Symposium on High Performance Chips (Hot Chips).Google Scholar
- Barrow-Williams, N., Fensch, C., and Moore, S. 2009. A communication characterisation of splash-2 and parsec. In Proceedings of the IEEE International Symposium on Workload Characterization. IEEE Computer Society, 86--97. Google Scholar
Digital Library
- Berg, E., Zeffer, H., and Hagersten, E. 2006. A statistical multiprocessor cache model. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. IEEE Computer Society, 89--99.Google Scholar
- Bienia, C., Kumar, S., and Li, K. 2008a. PARSEC vs. SPLASH2: A quantitative comparison of two multithreaded benchmark suites on chip-multiprocessors. In Proceedings of the IEEE International Symposium on Workload Characterization. IEEE Computer Society, 47--56.Google Scholar
- Bienia, C., Kumar, S., Singh, J. P., and Li, K. 2008b. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. ACM, 72--81. Google Scholar
Digital Library
- Binkert, N., Dreslinski, R., Hsu, L., Lim, K., Saidi, A., and Reinhardt, S. 2006. The M5 simulator: Modeling networked systems. IEEE Micro 26, 4, 52--60. Google Scholar
Digital Library
- Chandra, D., Guo, F., Kim, S., and Solihin, Y. 2005. Predicting inter-thread cache contention on a chip multi-processor architecture. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture. IEEE Computer Society, 340--351. Google Scholar
Digital Library
- Davis, J., Laudon, J., and Olukotun, K. 2005. Maximizing CMP throughput with mediocre cores. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 51--62. Google Scholar
Digital Library
- Ding, C. and Chilimbi, T. 2009. A composable model for analyzing locality of multi-threaded programs. Tech. rep. MSR-TR-2009-107, Microsoft Research.Google Scholar
- Ding, C. and Zhong, Y. 2003. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 245--257. Google Scholar
Digital Library
- Eyerman, S., Eeckhout, L., and Karkhanis, T. 2009. A mechanistic performance model for superscalar out-of-order processors. ACM Trans. Comput. Syst. 27, 2, 3:1--3:37. Google Scholar
Digital Library
- Hardavellas, N., Ferdman, M., Falsafi, B., and Ailamaki, A. 2009. Reactive NUCA: Near-Optimal block placement and replication in distributed caches. In Proceedings of the 36th International Symposium on Computer Architecture. ACM, 184--195. Google Scholar
Digital Library
- Hoskote, Y., Vangal, S., Dighe, S., Borkar, N., and Borkar, S. 2007. Teraflop prototype processor with 80 Cores. In Proceedings of Symposium on High Performance Chips (Hot Chips).Google Scholar
- Hsu, L., Iyer, R., Makineni, S., Reinhardt, S., and Newell, D. 2005. Exploring the cache design space for large scale CMPs. SIGARCH Comput. Archit. News 33, 4, 24--33. Google Scholar
Digital Library
- Huh, J., Burger, D., and Keckler, S. W. 2001. Exploring the design space of future CMPs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 199--210. Google Scholar
Digital Library
- Huh, J., Kim, C., Shafi, H., Zhang, L., Burger, D., and Keckler, S. W. 2005. A NUCA substrate for flexible CMP cache sharing. In Proceedings of the 19th International Conference on Supercomputing. ACM, 31--40. Google Scholar
Digital Library
- Jaleel, A., Theobald, K. B., Steely, Jr., S. C., and Emer, J. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th International Symposium on Computer Architecture. ACM, 60--71. Google Scholar
Digital Library
- Jiang, Y., Zhang, E. Z., Tian, K., and Shen, X. 2010. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Proceeding of the International Conference on Compiler Construction. Springer, 264--282. Google Scholar
Digital Library
- Karkhanis, T. S. and Smith, J. E. 2004. A first-order superscalar processor model. In Proceedings of the 31st International Symposium on Computer Architecture. ACM, 338--349. Google Scholar
Digital Library
- Li, J. and Martinez, J. F. 2005. Power-Performance implications of thread-level parallelism on chip multiprocessors. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE Computer Society, 124--134. Google Scholar
Digital Library
- Li, S., Ahn, J. H., Strong, R. D., Brockman, J. B., Tullsen, D. M., and Jouppi, N. P. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd International Symposium on Microarchitecture. ACM, 469--480. Google Scholar
Digital Library
- Li, Y., Lee, B., Brooks, D., Hu, Z., and Skadron, K. 2006. CMP design space exploration subject to physical constraints. In Proceedings of the International Symposium on High-Performance Computer Architecture. IEEE Computer Society, 17--28.Google Scholar
- Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., and Hazelwood, K. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 190--200. Google Scholar
Digital Library
- Mattson, R. L., Gecsei, J., Slutz, D. R., and Traiger, I. L. 1970. Evaluation techniques for storage hierarchies. IBM Syst. J. 9, 2, 78--117. Google Scholar
Digital Library
- McCurdy, C. and Fischer, C. 2005. Using pin as a memory reference generator for multiprocessor simulation. SIGARCH Comput. Archit. News 33, 5, 39--44. Google Scholar
Digital Library
- Narayanan, R., Ozisikyilmaz, B., Zambreno, J., Memik, G., and Choudhary, A. 2006. MineBench: A benchmark suite for data mining workloads. In Proceedings of the IEEE International Symposium on Workload Characterization. IEEE Computer Society, 182--188.Google Scholar
- Nayfeh, B. A. and Olukotun, K. 1994. Exploring the design space for a shared-cache multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture. 166--175. Google Scholar
Digital Library
- Qasem, A. and Kennedy, K. 2005. Evaluating a model for cache conflict miss prediction. Tech. rep. CS-TR05-457, Rice University.Google Scholar
- Rogers, B., Krishna, A., Bell, G., Vu, K., Jiang, X., and Solihin, Y. 2009. Scaling the bandwidth wall: Challenges in and avenues for CMP scaling. In Proceedings of the 36th International Symposium on Computer Architecture. ACM, 371--382. Google Scholar
Digital Library
- Schuff, D. L., Parsons, B. S., and Pai, V. S. 2009. Multicore-Aware reuse distance analysis. Tech. rep. TR-ECE-09-08, Purdue University.Google Scholar
- Schuff, D. L., Kulkarni, M., and Pai, V. S. 2010. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. ACM, 53--64. Google Scholar
Digital Library
- Suh, G. E., Devadas, S., and Rudolph, L. 2001. Analytical cache models with applications to cache partitioning. In Proceedings of the 15th International Conference on Supercomputing. ACM, 1--12. Google Scholar
Digital Library
- Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. 1995. The SPLASH-2 Programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture. ACM, 24--36. Google Scholar
Digital Library
- Wu, M.-J. and Yeung, D. 2011. Coherent profiles: Enabling efficient reuse distance analysis of multicore scaling for loop-based parallel programs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 264--275. Google Scholar
Digital Library
- Xiang, X., Bao, B., Ding, C., and Gao, Y. 2011. Linear-time modeling of program working set in shared cache. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 350--360. Google Scholar
Digital Library
- Zhang, M. and Asanovic, K. 2005. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proceedings of the 32nd International Symposium on Computer Architecture. IEEE Computer Society, 336--345. Google Scholar
Digital Library
- Zhao, L., Iyer, R., Makineni, S., Moses, J., Illikkal, R., and Newell, D. 2007. Performance, area and bandwidth implications on large-scale CMP cache design. In Proceedings of the Workshop on Chip Multiprocessor Memory Systems and Interconnects.Google Scholar
- Zhong, Y. and Chang, W. 2008. Sampling-based program locality approximation. In Proceedings of the 7th International Symposium on Memory Management. ACM, 91--100. Google Scholar
Digital Library
- Zhong, Y., Dropsho, S. G., and Ding, C. 2003. Miss rate prediction across all program inputs. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 79--90. Google Scholar
Digital Library
- Zhong, Y., Shen, X., and Ding, C. 2009. Program locality analysis using reuse distance. ACM Trans. Program. Lang. Syst. 31, 6, 20:1--20:39. Google Scholar
Digital Library
Index Terms
Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs
Recommendations
Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis
MSPC '12: Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and CorrectnessUnderstanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies employed in modern CPUs. In today's hierarchies, performance is determined by complicated thread interactions, such as interference in shared ...
Identifying Power-Efficient Multicore Cache Hierarchies via Reuse Distance Analysis
To enable performance improvements in a power-efficient manner, computer architects have been building CPUs that exploit greater amounts of thread-level parallelism. A key consideration in such CPUs is properly designing the on-chip cache hierarchy. ...
Studying multicore processor scaling via reuse distance analysis
ICSA '13The trend for multicore processors is towards increasing numbers of cores, with 100s of cores--i.e. large-scale chip multiprocessors (LCMPs)--possible in the future. The key to realizing the potential of LCMPs is the cache hierarchy, so studying how ...






Comments