Abstract
Researchers have proposed numerous techniques to improve the scalability of coherence directories. The effectiveness of these techniques not only depends on application behavior, but also on the CPU's configuration, for example, its core count and cache size. As CPUs continue to scale, it is essential to explore the directory's application and architecture dependencies. However, this is challenging given the slow speed of simulators. While it is common practice to simulate different applications, previous research on directory designs have explored only a few—and in most cases, only one—CPU configuration, which can lead to an incomplete and inaccurate view of the directory's behavior.
This article proposes to use multicore reuse distance analysis to study coherence directories. We develop a framework to extract the directory access stream from parallel least recently used (LRU) stacks, enabling rapid analysis of the directory's accesses and contents across both core count and cache size scaling. A key part of our framework is the notion of relative reuse distance between sharers, which defines sharing in a capacity-dependent fashion and facilitates our analyses along the data cache size dimension.
We implement our framework in a profiler and then apply it to gain insights into the impact of multicore CPU scaling on directory behavior. Our profiling results show that directory accesses reduce by 3.3× when scaling the data cache size from 16KB to 1MB, despite an increase in sharing-based directory accesses. We also show that increased sharing caused by data cache scaling allows the portion of on-chip memory occupied by the directory to be reduced by 43.3%, compared to a reduction of only 2.6% when scaling the number of cores. And, we show certain directory entries exhibit high temporal reuse. In addition to gaining insights, we also validate our profile-based results, and find they are within 2--10% of cache simulations on average, across different validation experiments. Finally, we conduct four case studies that illustrate our insights on existing directory techniques. In particular, we demonstrate our directory occupancy insights on a Cuckoo directory; we apply our sharing insights to provide bounds on the size of Scalable Coherence Directories (SCD) and Dual-Grain Directories (DGD); and, we demonstrate our directory entry reuse insights on a multilevel directory design.
- Manuel E. Acacio, Jose Gonzalez, Jose M. Garcia, and Jose Duato. 2001. A new scalable directory architecture for large-scale multiprocessors. In Proceedings of the 7th International Symposium on High Performance Computer Architecture. Google Scholar
Cross Ref
- Anant Agarwal, Liewei Bao, John Brown, Bruce Edwards, Matt Mattina, Chyi-Chang Miao, Carl Ramey, and David Wentzlaff. 2007. Tile processor: Embedded multicore for networking and multimedia. In Proceedings of the Symposium on High Performance Chips.Google Scholar
- Anant Agarwal, Richard Simoni, John Hennessy, and Mark Horowitz. 1988. An evaluation of directory schemes for cache coherence. In Proceedings of the 15th International Symposium on Computer Architecture. Google Scholar
Cross Ref
- David H. Albonesi. 1999. Selective cache ways: On-demand cache resource allocation. In Proceedings of the 32nd Annual International Symposium on Microarchitecture. 248--259. Google Scholar
Cross Ref
- Mohammad Alisafaee. 2012. Spatiotemporal coherence tracking. In Proceedings of the 45th Annual International Symposium on Microarchitecture. Google Scholar
Digital Library
- Luiz Andre Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese. 2000. Piranha: A scalable architecture based on single-chip multiprocessing. In Proceedings of the 27th Annual International Symposium on Computer Architecture. 282--293. Google Scholar
Digital Library
- E. Berg, H. Zeffer, and E. Hagersten. 2006. A statistical multiprocessor cache model. In Proceedings of the 2006 IEEE International Symposium on Performance Analysis of Systems and Software. 89--99. http://dx.doi.org/10.1109/ISPASS.2006.1620793 Google Scholar
Cross Ref
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. Google Scholar
Digital Library
- David Chaiken, John Kubiatowicz, and Anant Agarwal. 1991. LimitLESS directories: A scalable cache coherence scheme. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, NY. Google Scholar
Digital Library
- Guoying Chen. 1993. SLiD—A cost-effective and scalable limited-directory scheme for cache coherence. In Proceedings of the Parallel Architectures and Languages Europe. Google Scholar
Cross Ref
- Jong Hyuk Choi and Kyu Ho Park. 1999. Segment directory enhancing the limited directory cache coherence schemes. In Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing. Washington, D.C.Google Scholar
Digital Library
- Blas Cuesta, Alberto Ros, Maria E. Gomez, Antonio Robles, and Jose Duato. 2013. Increasing the effectiveness of directory caches by avoiding the tracking of noncoherent memory blocks. IEEE Trans. Comput. 62, 3 (2013), 482--495. Google Scholar
Digital Library
- Blas A. Cuesta, Alberto Ros, María E. Gómez, Antonio Robles, and José F. Duato. 2011. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proceedings of the 38th Annual International Symposium on Computer Architecture. ACM, New York, NY, 93--104. Google Scholar
Digital Library
- Chen Ding and Trishul Chilimbi. 2009. A Composable Model for Analyzing Locality of Multi-threaded Programs. Technical Report MSR-TR-2009-107. Microsoft Research.Google Scholar
- David Eklov, David Black-Schaffer, and Erik Hagersten. 2011. Fast modeling of shared caches in multicore systems. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC’11). ACM, New York, NY, 147--157. DOI:http://dx.doi.org/10.1145/1944862.1944885 Google Scholar
Digital Library
- Michael Ferdman, Pejman Lotfi-Kamran, Ken Balet, and Babak Falsafi. 2011. Cuckoo directory: A scalable directory for many-core systems. In Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA’11). 169--180. Google Scholar
Cross Ref
- Song-Liu Guo, Hai-Xia Wang, Yi-Bo Xue, Chong-Min Li, and Dong-Sheng Wang. 2010. Hierarchical cache directory for CMP. J. Comput. Sci. Technol. 25, 2 (March 2010), 246--256. Google Scholar
Cross Ref
- Anoop Gupta, Wolf dietrich Weber, and Todd Mowry. 1990. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proceedings of the International Conference on Parallel Processing. 312--321.Google Scholar
- Lisa Hsu, Ravi Iyer, Srihari Makineni, Steve Reinhardt, and Donald Newell. 2005. Exploring the cache design space for large scale CMPs. ACM SIGARCH Comput. Arch. News 33 (2005), 24--33. Google Scholar
Digital Library
- Intel. 2014. Intel Xeon Phi Product Family (July 2014). http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html.Google Scholar
- Yunlian Jiang, Eddy Z. Zhang, Kai Tian, and Xipeng Shen. 2010. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Proceeding of Compiler Construction. Google Scholar
Digital Library
- John H. Kelm, Matthew R. Johnson, Steven S. Lumettta, and Sanjay J. Patel. 2010. WAYPOINT: Scaling coherence to thousand-core architectures. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, 99--110. Google Scholar
Digital Library
- Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation. Google Scholar
Digital Library
- Niti Madan, Li Zhao, naveen Muralimanohar, Aniruddha Udipi, Rajeev Balasubramonian, Ravishankar Iyer, Srihari Makineni, and Donald Newell. 2009. Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy. In Proceedings of the International Symposium on High Performance Computer Architecture. Google Scholar
Cross Ref
- Afzal Malik, Bill Moyer, and Dan Cermak. 2000. A low power unified cache architecture providing power and performance flexibility. In Proceedings of the International Symposium on Low Power Electronics and Design. Google Scholar
Digital Library
- R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. 1970. Evaluation techniques for storage hierarchies. IBM Syst. J. 9, 2 (1970), 78--117. DOI:http://dx.doi.org/10.1147/sj.92.0078 Google Scholar
Digital Library
- Collin McCurdy and Charles Fischer. 2005. Using pin as a memory reference generator for multiprocessor simulation. ACM SIGARCH Comput. Arch. News 33 (5), 39--44. Google Scholar
Digital Library
- Ramanathan Narayanan, Berkin Ozisikyilmaz, Joseph Zambreno, Gokham Memik, and Alok Choudhary. 2006. MineBench: A benchmark suite for data mining workloads. In Proceedings of the International Symposium on Workload Characterization. Google Scholar
Cross Ref
- Michael Powell, Se-Hyun Yang, Babak Falsafi, Kaushik Roy, and T. N. Vijaykumar. 2000. Gated-Vdd: A circuit technique to reduce leakage in deep-submicron cache memories. In Proceedings of the IEEE/ACM International Symposium on Low Power Electronics 8 Design. 90--95. Google Scholar
Digital Library
- Daniel Sanchez and Christos Kozyrakis. 2012. SCD: A scalable coherence directory with flexible sharer set encoding. In Proceedings of the 18th International Symposium on High Performance Computer Architecture. Google Scholar
Digital Library
- Derek L. Schuff, Milind Kulkarni, and Vijay S. Pai. 2010. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. Google Scholar
Digital Library
- Derek L. Schuff, Benjamin S. Parsons, and Jivay S. Pai. 2009. Multicore-Aware Reuse Distance Analysis. Technical Report TR-ECE-09-08. Purdue University.Google Scholar
- Joan J. Valls, Alberto Ros, Julio Sahuquillo, María E. Gómez, and José Duato. 2012. PS-Dir: A scalable two-level directory cache. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 451--452. Google Scholar
Digital Library
- Deborah A. Wallach. 1993. PHD: A Hierarchical Cache Coherent Protocol (Master’s Thesis). (1993).Google Scholar
- Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture. Google Scholar
Digital Library
- Meng-Ju Wu and Donald Yeung. 2011. Coherent profiles: Enabling efficient reuse distance analysis of multicore scaling for loop-based parallel programs. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques. Google Scholar
Digital Library
- Meng-Ju Wu and Donald Yeung. 2013. Efficient reuse distance analysis of multicore scaling for loop-based parallel programs. ACM Trans. Comput. Syst. 31, 1 (2013), 1--37. Google Scholar
Digital Library
- Meng-Ju Wu, Minshu Zhao, and Donald Yeung. 2013. Studying multicore processor scaling via reuse distance analysis. In Proceedings of the International Symposium on Computer Architecture. Google Scholar
Digital Library
- Se-Hyun Yang, Babak Falsafi, Michael D. Powell, and T. N. Vijaykumar. 2002. Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture(HPCA’02). IEEE Computer Society, Washington, D.C., 151--161. Google Scholar
Cross Ref
- Jason Zebchuk, Babak Falsafi, and Andreas Moshovos. 2013. Multi-grain coherence directories. In Proceedings of the 46th Annual International Symposium on Microarchitecture. Davis, CA. Google Scholar
Digital Library
- Jason Zebchuk, Vijayalakshmi Srinivasan, Moinuddin K. Qureshi, and Andreas Moshovos. 2009. A tagless coherence directory. In Proceedings of the 42nd International Symposium on Microarchitecture. Google Scholar
Digital Library
- Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, and Donald Newell. 2007. Performance, area and bandwidth implications on large-scale CMP cache design. In Proceedings of the Workshop on Chip Multiprocessor Memory Systems and Interconnect.Google Scholar
- Minshu Zhao and Donald Yeung. 2015. Studying the impact of multicore processor scaling on directory techniques via reuse distance analysis. In Proceeding of the 21st International Symposium on High Performance Computer Architecture. Google Scholar
Cross Ref
Index Terms
Using Multicore Reuse Distance to Study Coherence Directories
Recommendations
Improving Memory Utilization in Cache Coherence Directories
Efficiently maintaining cache coherence is a major problem in large-scale shared memorymultiprocessors. Hardware directory coherence schemes have very high memoryrequirements, while software-directed schemes must rely on imprecise compile-timememory ...
Multi-grain coherence directories
MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on MicroarchitectureConventional directory coherence operates at the finest granularity possible, that of a cache block. While simple, this organization fails to exploit frequent application behavior: at any given point in time, large, continuous chunks of memory are often ...
Efficient Cache Performance Modeling in GPUs Using Reuse Distance Analysis
Reuse distance analysis (RDA) is a popular method for calculating locality profiles and modeling cache performance. The present article proposes a framework to apply the RDA algorithm to obtain reuse distance profiles in graphics processing unit (GPU) ...






Comments