skip to main content
research-article

Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs

Published:01 February 2013Publication History
Skip Abstract Section

Abstract

Reuse Distance (RD) analysis is a powerful memory analysis tool that can potentially help architects study multicore processor scaling. One key obstacle, however, is that multicore RD analysis requires measuring Concurrent Reuse Distance (CRD) and Private-LRU-stack Reuse Distance (PRD) profiles across thread-interleaved memory reference streams. Sensitivity to memory interleaving makes CRD and PRD profiles architecture dependent, preventing them from analyzing different processor configurations. For loop-based parallel programs, CRD and PRD profiles shift coherently across RD values with core count scaling because interleaving threads are symmetric. Simple techniques can predict such shifting, making the analysis of numerous multicore configurations from a small set of CRD and PRD profiles feasible. Given the ubiquity of parallel loops, such techniques will be extremely valuable for studying future large multicore designs.

This article investigates using RD analysis to efficiently analyze multicore cache performance for loop-based parallel programs, making several contributions. First, we provide an in-depth analysis on how CRD and PRD profiles change with core count scaling. Second, we develop techniques to predict CRD and PRD profile scaling, in particular employing reference groups [Zhong et al. 2003] to predict coherent shift, demonstrating 90% or greater prediction accuracy. Third, our CRD and PRD profile analyses define two application parameters with architectural implications: Ccore is the minimum shared cache capacity that “contains” locality degradation due to core count scaling, and Cshare is the capacity at which shared caches begin to provide a cache-miss reduction compared to private caches. And fourth, we apply CRD and PRD profiles to analyze multicore cache performance. When combined with existing problem scaling prediction, our techniques can predict shared LLC MPKI (private L2 cache MPKI) to within 10.7% (13.9%) of simulation across 1,728 (1,440) configurations using only 36 measured CRD (PRD) profiles.

References

  1. Agarwal, A., Bao, L., Brown, J., Edwards, B., Mattina, M., Miao, C.-C., Ramey, C., and Wentzlaff, D. 2007. Tile processor: Embedded multicore for networking and multimedia. In Proceedings of the Symposium on High Performance Chips (Hot Chips).Google ScholarGoogle Scholar
  2. Barrow-Williams, N., Fensch, C., and Moore, S. 2009. A communication characterisation of splash-2 and parsec. In Proceedings of the IEEE International Symposium on Workload Characterization. IEEE Computer Society, 86--97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Berg, E., Zeffer, H., and Hagersten, E. 2006. A statistical multiprocessor cache model. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. IEEE Computer Society, 89--99.Google ScholarGoogle Scholar
  4. Bienia, C., Kumar, S., and Li, K. 2008a. PARSEC vs. SPLASH2: A quantitative comparison of two multithreaded benchmark suites on chip-multiprocessors. In Proceedings of the IEEE International Symposium on Workload Characterization. IEEE Computer Society, 47--56.Google ScholarGoogle Scholar
  5. Bienia, C., Kumar, S., Singh, J. P., and Li, K. 2008b. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. ACM, 72--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Binkert, N., Dreslinski, R., Hsu, L., Lim, K., Saidi, A., and Reinhardt, S. 2006. The M5 simulator: Modeling networked systems. IEEE Micro 26, 4, 52--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chandra, D., Guo, F., Kim, S., and Solihin, Y. 2005. Predicting inter-thread cache contention on a chip multi-processor architecture. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture. IEEE Computer Society, 340--351. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Davis, J., Laudon, J., and Olukotun, K. 2005. Maximizing CMP throughput with mediocre cores. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 51--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ding, C. and Chilimbi, T. 2009. A composable model for analyzing locality of multi-threaded programs. Tech. rep. MSR-TR-2009-107, Microsoft Research.Google ScholarGoogle Scholar
  10. Ding, C. and Zhong, Y. 2003. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 245--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Eyerman, S., Eeckhout, L., and Karkhanis, T. 2009. A mechanistic performance model for superscalar out-of-order processors. ACM Trans. Comput. Syst. 27, 2, 3:1--3:37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hardavellas, N., Ferdman, M., Falsafi, B., and Ailamaki, A. 2009. Reactive NUCA: Near-Optimal block placement and replication in distributed caches. In Proceedings of the 36th International Symposium on Computer Architecture. ACM, 184--195. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Hoskote, Y., Vangal, S., Dighe, S., Borkar, N., and Borkar, S. 2007. Teraflop prototype processor with 80 Cores. In Proceedings of Symposium on High Performance Chips (Hot Chips).Google ScholarGoogle Scholar
  14. Hsu, L., Iyer, R., Makineni, S., Reinhardt, S., and Newell, D. 2005. Exploring the cache design space for large scale CMPs. SIGARCH Comput. Archit. News 33, 4, 24--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Huh, J., Burger, D., and Keckler, S. W. 2001. Exploring the design space of future CMPs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 199--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Huh, J., Kim, C., Shafi, H., Zhang, L., Burger, D., and Keckler, S. W. 2005. A NUCA substrate for flexible CMP cache sharing. In Proceedings of the 19th International Conference on Supercomputing. ACM, 31--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jaleel, A., Theobald, K. B., Steely, Jr., S. C., and Emer, J. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th International Symposium on Computer Architecture. ACM, 60--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jiang, Y., Zhang, E. Z., Tian, K., and Shen, X. 2010. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Proceeding of the International Conference on Compiler Construction. Springer, 264--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Karkhanis, T. S. and Smith, J. E. 2004. A first-order superscalar processor model. In Proceedings of the 31st International Symposium on Computer Architecture. ACM, 338--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Li, J. and Martinez, J. F. 2005. Power-Performance implications of thread-level parallelism on chip multiprocessors. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE Computer Society, 124--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Li, S., Ahn, J. H., Strong, R. D., Brockman, J. B., Tullsen, D. M., and Jouppi, N. P. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd International Symposium on Microarchitecture. ACM, 469--480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Li, Y., Lee, B., Brooks, D., Hu, Z., and Skadron, K. 2006. CMP design space exploration subject to physical constraints. In Proceedings of the International Symposium on High-Performance Computer Architecture. IEEE Computer Society, 17--28.Google ScholarGoogle Scholar
  23. Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., and Hazelwood, K. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 190--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mattson, R. L., Gecsei, J., Slutz, D. R., and Traiger, I. L. 1970. Evaluation techniques for storage hierarchies. IBM Syst. J. 9, 2, 78--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. McCurdy, C. and Fischer, C. 2005. Using pin as a memory reference generator for multiprocessor simulation. SIGARCH Comput. Archit. News 33, 5, 39--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Narayanan, R., Ozisikyilmaz, B., Zambreno, J., Memik, G., and Choudhary, A. 2006. MineBench: A benchmark suite for data mining workloads. In Proceedings of the IEEE International Symposium on Workload Characterization. IEEE Computer Society, 182--188.Google ScholarGoogle Scholar
  27. Nayfeh, B. A. and Olukotun, K. 1994. Exploring the design space for a shared-cache multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture. 166--175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Qasem, A. and Kennedy, K. 2005. Evaluating a model for cache conflict miss prediction. Tech. rep. CS-TR05-457, Rice University.Google ScholarGoogle Scholar
  29. Rogers, B., Krishna, A., Bell, G., Vu, K., Jiang, X., and Solihin, Y. 2009. Scaling the bandwidth wall: Challenges in and avenues for CMP scaling. In Proceedings of the 36th International Symposium on Computer Architecture. ACM, 371--382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Schuff, D. L., Parsons, B. S., and Pai, V. S. 2009. Multicore-Aware reuse distance analysis. Tech. rep. TR-ECE-09-08, Purdue University.Google ScholarGoogle Scholar
  31. Schuff, D. L., Kulkarni, M., and Pai, V. S. 2010. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. ACM, 53--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Suh, G. E., Devadas, S., and Rudolph, L. 2001. Analytical cache models with applications to cache partitioning. In Proceedings of the 15th International Conference on Supercomputing. ACM, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. 1995. The SPLASH-2 Programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture. ACM, 24--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Wu, M.-J. and Yeung, D. 2011. Coherent profiles: Enabling efficient reuse distance analysis of multicore scaling for loop-based parallel programs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 264--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Xiang, X., Bao, B., Ding, C., and Gao, Y. 2011. Linear-time modeling of program working set in shared cache. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 350--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Zhang, M. and Asanovic, K. 2005. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proceedings of the 32nd International Symposium on Computer Architecture. IEEE Computer Society, 336--345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Zhao, L., Iyer, R., Makineni, S., Moses, J., Illikkal, R., and Newell, D. 2007. Performance, area and bandwidth implications on large-scale CMP cache design. In Proceedings of the Workshop on Chip Multiprocessor Memory Systems and Interconnects.Google ScholarGoogle Scholar
  38. Zhong, Y. and Chang, W. 2008. Sampling-based program locality approximation. In Proceedings of the 7th International Symposium on Memory Management. ACM, 91--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Zhong, Y., Dropsho, S. G., and Ding, C. 2003. Miss rate prediction across all program inputs. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 79--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Zhong, Y., Shen, X., and Ding, C. 2009. Program locality analysis using reuse distance. ACM Trans. Program. Lang. Syst. 31, 6, 20:1--20:39. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Computer Systems
              ACM Transactions on Computer Systems  Volume 31, Issue 1
              February 2013
              96 pages
              ISSN:0734-2071
              EISSN:1557-7333
              DOI:10.1145/2427631
              Issue’s Table of Contents

              Copyright © 2013 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 February 2013
              • Accepted: 1 November 2012
              • Revised: 1 October 2012
              • Received: 1 May 2012
              Published in tocs Volume 31, Issue 1

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!