skip to main content
research-article
Public Access

Using Multicore Reuse Distance to Study Coherence Directories

Published:28 July 2017Publication History
Skip Abstract Section

Abstract

Researchers have proposed numerous techniques to improve the scalability of coherence directories. The effectiveness of these techniques not only depends on application behavior, but also on the CPU's configuration, for example, its core count and cache size. As CPUs continue to scale, it is essential to explore the directory's application and architecture dependencies. However, this is challenging given the slow speed of simulators. While it is common practice to simulate different applications, previous research on directory designs have explored only a few—and in most cases, only one—CPU configuration, which can lead to an incomplete and inaccurate view of the directory's behavior.

This article proposes to use multicore reuse distance analysis to study coherence directories. We develop a framework to extract the directory access stream from parallel least recently used (LRU) stacks, enabling rapid analysis of the directory's accesses and contents across both core count and cache size scaling. A key part of our framework is the notion of relative reuse distance between sharers, which defines sharing in a capacity-dependent fashion and facilitates our analyses along the data cache size dimension.

We implement our framework in a profiler and then apply it to gain insights into the impact of multicore CPU scaling on directory behavior. Our profiling results show that directory accesses reduce by 3.3× when scaling the data cache size from 16KB to 1MB, despite an increase in sharing-based directory accesses. We also show that increased sharing caused by data cache scaling allows the portion of on-chip memory occupied by the directory to be reduced by 43.3%, compared to a reduction of only 2.6% when scaling the number of cores. And, we show certain directory entries exhibit high temporal reuse. In addition to gaining insights, we also validate our profile-based results, and find they are within 2--10% of cache simulations on average, across different validation experiments. Finally, we conduct four case studies that illustrate our insights on existing directory techniques. In particular, we demonstrate our directory occupancy insights on a Cuckoo directory; we apply our sharing insights to provide bounds on the size of Scalable Coherence Directories (SCD) and Dual-Grain Directories (DGD); and, we demonstrate our directory entry reuse insights on a multilevel directory design.

References

  1. Manuel E. Acacio, Jose Gonzalez, Jose M. Garcia, and Jose Duato. 2001. A new scalable directory architecture for large-scale multiprocessors. In Proceedings of the 7th International Symposium on High Performance Computer Architecture. Google ScholarGoogle ScholarCross RefCross Ref
  2. Anant Agarwal, Liewei Bao, John Brown, Bruce Edwards, Matt Mattina, Chyi-Chang Miao, Carl Ramey, and David Wentzlaff. 2007. Tile processor: Embedded multicore for networking and multimedia. In Proceedings of the Symposium on High Performance Chips.Google ScholarGoogle Scholar
  3. Anant Agarwal, Richard Simoni, John Hennessy, and Mark Horowitz. 1988. An evaluation of directory schemes for cache coherence. In Proceedings of the 15th International Symposium on Computer Architecture. Google ScholarGoogle ScholarCross RefCross Ref
  4. David H. Albonesi. 1999. Selective cache ways: On-demand cache resource allocation. In Proceedings of the 32nd Annual International Symposium on Microarchitecture. 248--259. Google ScholarGoogle ScholarCross RefCross Ref
  5. Mohammad Alisafaee. 2012. Spatiotemporal coherence tracking. In Proceedings of the 45th Annual International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Luiz Andre Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese. 2000. Piranha: A scalable architecture based on single-chip multiprocessing. In Proceedings of the 27th Annual International Symposium on Computer Architecture. 282--293. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Berg, H. Zeffer, and E. Hagersten. 2006. A statistical multiprocessor cache model. In Proceedings of the 2006 IEEE International Symposium on Performance Analysis of Systems and Software. 89--99. http://dx.doi.org/10.1109/ISPASS.2006.1620793 Google ScholarGoogle ScholarCross RefCross Ref
  8. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. David Chaiken, John Kubiatowicz, and Anant Agarwal. 1991. LimitLESS directories: A scalable cache coherence scheme. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Guoying Chen. 1993. SLiD—A cost-effective and scalable limited-directory scheme for cache coherence. In Proceedings of the Parallel Architectures and Languages Europe. Google ScholarGoogle ScholarCross RefCross Ref
  11. Jong Hyuk Choi and Kyu Ho Park. 1999. Segment directory enhancing the limited directory cache coherence schemes. In Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing. Washington, D.C.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Blas Cuesta, Alberto Ros, Maria E. Gomez, Antonio Robles, and Jose Duato. 2013. Increasing the effectiveness of directory caches by avoiding the tracking of noncoherent memory blocks. IEEE Trans. Comput. 62, 3 (2013), 482--495. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Blas A. Cuesta, Alberto Ros, María E. Gómez, Antonio Robles, and José F. Duato. 2011. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proceedings of the 38th Annual International Symposium on Computer Architecture. ACM, New York, NY, 93--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chen Ding and Trishul Chilimbi. 2009. A Composable Model for Analyzing Locality of Multi-threaded Programs. Technical Report MSR-TR-2009-107. Microsoft Research.Google ScholarGoogle Scholar
  15. David Eklov, David Black-Schaffer, and Erik Hagersten. 2011. Fast modeling of shared caches in multicore systems. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC’11). ACM, New York, NY, 147--157. DOI:http://dx.doi.org/10.1145/1944862.1944885 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Michael Ferdman, Pejman Lotfi-Kamran, Ken Balet, and Babak Falsafi. 2011. Cuckoo directory: A scalable directory for many-core systems. In Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA’11). 169--180. Google ScholarGoogle ScholarCross RefCross Ref
  17. Song-Liu Guo, Hai-Xia Wang, Yi-Bo Xue, Chong-Min Li, and Dong-Sheng Wang. 2010. Hierarchical cache directory for CMP. J. Comput. Sci. Technol. 25, 2 (March 2010), 246--256. Google ScholarGoogle ScholarCross RefCross Ref
  18. Anoop Gupta, Wolf dietrich Weber, and Todd Mowry. 1990. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proceedings of the International Conference on Parallel Processing. 312--321.Google ScholarGoogle Scholar
  19. Lisa Hsu, Ravi Iyer, Srihari Makineni, Steve Reinhardt, and Donald Newell. 2005. Exploring the cache design space for large scale CMPs. ACM SIGARCH Comput. Arch. News 33 (2005), 24--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Intel. 2014. Intel Xeon Phi Product Family (July 2014). http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html.Google ScholarGoogle Scholar
  21. Yunlian Jiang, Eddy Z. Zhang, Kai Tian, and Xipeng Shen. 2010. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Proceeding of Compiler Construction. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. John H. Kelm, Matthew R. Johnson, Steven S. Lumettta, and Sanjay J. Patel. 2010. WAYPOINT: Scaling coherence to thousand-core architectures. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, 99--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Niti Madan, Li Zhao, naveen Muralimanohar, Aniruddha Udipi, Rajeev Balasubramonian, Ravishankar Iyer, Srihari Makineni, and Donald Newell. 2009. Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy. In Proceedings of the International Symposium on High Performance Computer Architecture. Google ScholarGoogle ScholarCross RefCross Ref
  25. Afzal Malik, Bill Moyer, and Dan Cermak. 2000. A low power unified cache architecture providing power and performance flexibility. In Proceedings of the International Symposium on Low Power Electronics and Design. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. 1970. Evaluation techniques for storage hierarchies. IBM Syst. J. 9, 2 (1970), 78--117. DOI:http://dx.doi.org/10.1147/sj.92.0078 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Collin McCurdy and Charles Fischer. 2005. Using pin as a memory reference generator for multiprocessor simulation. ACM SIGARCH Comput. Arch. News 33 (5), 39--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ramanathan Narayanan, Berkin Ozisikyilmaz, Joseph Zambreno, Gokham Memik, and Alok Choudhary. 2006. MineBench: A benchmark suite for data mining workloads. In Proceedings of the International Symposium on Workload Characterization. Google ScholarGoogle ScholarCross RefCross Ref
  29. Michael Powell, Se-Hyun Yang, Babak Falsafi, Kaushik Roy, and T. N. Vijaykumar. 2000. Gated-Vdd: A circuit technique to reduce leakage in deep-submicron cache memories. In Proceedings of the IEEE/ACM International Symposium on Low Power Electronics 8 Design. 90--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Daniel Sanchez and Christos Kozyrakis. 2012. SCD: A scalable coherence directory with flexible sharer set encoding. In Proceedings of the 18th International Symposium on High Performance Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Derek L. Schuff, Milind Kulkarni, and Vijay S. Pai. 2010. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Derek L. Schuff, Benjamin S. Parsons, and Jivay S. Pai. 2009. Multicore-Aware Reuse Distance Analysis. Technical Report TR-ECE-09-08. Purdue University.Google ScholarGoogle Scholar
  33. Joan J. Valls, Alberto Ros, Julio Sahuquillo, María E. Gómez, and José Duato. 2012. PS-Dir: A scalable two-level directory cache. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 451--452. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Deborah A. Wallach. 1993. PHD: A Hierarchical Cache Coherent Protocol (Master’s Thesis). (1993).Google ScholarGoogle Scholar
  35. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Meng-Ju Wu and Donald Yeung. 2011. Coherent profiles: Enabling efficient reuse distance analysis of multicore scaling for loop-based parallel programs. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Meng-Ju Wu and Donald Yeung. 2013. Efficient reuse distance analysis of multicore scaling for loop-based parallel programs. ACM Trans. Comput. Syst. 31, 1 (2013), 1--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Meng-Ju Wu, Minshu Zhao, and Donald Yeung. 2013. Studying multicore processor scaling via reuse distance analysis. In Proceedings of the International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Se-Hyun Yang, Babak Falsafi, Michael D. Powell, and T. N. Vijaykumar. 2002. Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture(HPCA’02). IEEE Computer Society, Washington, D.C., 151--161. Google ScholarGoogle ScholarCross RefCross Ref
  40. Jason Zebchuk, Babak Falsafi, and Andreas Moshovos. 2013. Multi-grain coherence directories. In Proceedings of the 46th Annual International Symposium on Microarchitecture. Davis, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jason Zebchuk, Vijayalakshmi Srinivasan, Moinuddin K. Qureshi, and Andreas Moshovos. 2009. A tagless coherence directory. In Proceedings of the 42nd International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, and Donald Newell. 2007. Performance, area and bandwidth implications on large-scale CMP cache design. In Proceedings of the Workshop on Chip Multiprocessor Memory Systems and Interconnect.Google ScholarGoogle Scholar
  43. Minshu Zhao and Donald Yeung. 2015. Studying the impact of multicore processor scaling on directory techniques via reuse distance analysis. In Proceeding of the 21st International Symposium on High Performance Computer Architecture. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Using Multicore Reuse Distance to Study Coherence Directories

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Computer Systems
        ACM Transactions on Computer Systems  Volume 35, Issue 2
        May 2017
        113 pages
        ISSN:0734-2071
        EISSN:1557-7333
        DOI:10.1145/3129286
        Issue’s Table of Contents

        Copyright © 2017 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 28 July 2017
        • Accepted: 1 April 2017
        • Revised: 1 September 2016
        • Received: 1 May 2015
        Published in tocs Volume 35, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!