Abstract
On modern multi-core processors, independent workloads often interfere with each other by competing for shared cache space. However, for multi-threaded workloads, where a single copy of data can be accessed by multiple threads, the threads can cooperatively share cache. Because data sharing consolidates the collective working set of threads, the effective size of shared cache becomes larger than it would have been when data are not shared. This paper presents a new theory of data sharing. It includes (1) a new metric called the shared footprint to mathematically compute the amount of data shared by any group of threads in any size cache, and (2) a linear-time algorithm to measure shared footprint by scanning the memory trace of a multi-threaded program. The paper presents the practical implementation and evaluates the new theory using 14 PARSEC and SPEC OMP benchmarks, including an example use of shared footprint in program optimization.
- Anant Agarwal, Mark Horowitz, and John L. Hennessy. An analytical cache model. ACM Transactions on Computer Systems, 7(2):184--215, 1989. Google Scholar
Digital Library
- George Almasi, Calin Cascaval, and David A. Padua. Calculating stack distances efficiently. In Proceedings of the ACM SIGPLAN Workshop on Memory System Performance, pages 37--43, Berlin, Germany, June 2002. Google Scholar
Digital Library
- Matthew Arnold and Barbara G. Ryder. A framework for reducing the cost of instrumented code. In Proceedings of PLDI, pages 168--179, Snowbird, Utah, June 2001. Google Scholar
Digital Library
- John K. Bennett, John B. Carter, and Willy Zwaenepoel. Adaptive software cache management for distributed shared memory architectures. In Proceedings of ISCA, pages 125--134, 1990.Google Scholar
- Erik Berg and Erik Hagersten. StatCache: A probabilistic approach to efficient and accurate data locality analysis. In Proceedings of ISPASS, pages 20--27, 2004. Google Scholar
Cross Ref
- Kristof Beyls and Erik H. D'Hollander. Generating cache hints for improved program efficiency. Journal of Systems Architecture, 51(4):223--250, 2005. Google Scholar
Digital Library
- Kristof Beyls and Erik H. D'Hollander. Discovery of locality-improving refactoring by reuse path analysis. In Proceedings of High Performance Computing and Communications. Springer. Lecture Notes in Computer Science, volume 4208, pages 220--229, 2006. Google Scholar
Digital Library
- Christian Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011.Google Scholar
Digital Library
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PARSEC benchmark suite: characterization and architectural implications. In Proceedings of PACT, pages 72--81, 2008. Google Scholar
Digital Library
- Christian Bienia and Kai Li. Fidelity and scaling of the PARSEC benchmark inputs. In Proceedings of the 2010 International Symposium on Workload Characterization, December 2010. Google Scholar
Digital Library
- Michael D. Bond, Katherine E. Coons, and Kathryn S. McKinley. PACER: proportional detection of data races. In Proceedings of PLDI, pages 255--268, 2010. Google Scholar
Digital Library
- Jacob Brock, Chencheng Ye, Chen Ding, Yechen Li, Xiaolin Wang, and Yingwei Luo. Optimal cache partition-sharing. In Proceedings of ICPP, 2015. Google Scholar
Digital Library
- Calin Cascaval, Evelyn Duesterwald, Peter F. Sweeney, and Robert W. Wisniewski. Multiple page size modeling and optimization. In Proceedings of PACT, pages 339--349, 2005. Google Scholar
Digital Library
- Calin Cascaval and David A. Padua. Estimating cache misses and locality using stack distances. In Proceedings of ICS, pages 150--159, 2003.Google Scholar
Digital Library
- Dhruba Chandra, Fei Guo, Seongbeom Kim, and Yan Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In Proceedings of HPCA, pages 340--351, 2005. Google Scholar
Digital Library
- Arun Chauhan and Chun-Yu Shei. Static reuse distances for locality-based optimizations in MATLAB. In Proceedings of ICS, pages 295--304, 2010. Google Scholar
Digital Library
- T. M. Chilimbi and M. Hirzel. Dynamic hot data stream prefetching for general-purpose programs. In Proceedings of PLDI, pages 199--209, 2002. Google Scholar
Digital Library
- Huimin Cui, Qing Yi, Jingling Xue, Lei Wang, Yang Yang, and Xiaobing Feng. A highly parallel reuse distance analysis algorithm on GPUs. In Proceedings of IPDPS, 2012. Google Scholar
Digital Library
- Peter J. Denning. Working sets past and present. IEEE Transactions on Software Engineering, SE-6(1), January 1980. Google Scholar
Digital Library
- Peter J. Denning and Stuart C. Schwartz. Properties of the working set model. Communications of the ACM, 15(3):191--198, 1972. Google Scholar
Digital Library
- Chen Ding and Trishul Chilimbi. All-window profiling of concurrent executions. In Proceedings of PPoPP, 2008. phPoster paper. Google Scholar
Digital Library
- Chen Ding and Trishul Chilimbi. A composable model for analyzing locality of multi-threaded programs. Technical Report MSR-TR-2009--107, Microsoft Research, August 2009.Google Scholar
- Susan J. Eggers and Randy H. Katz. A characterization of sharing in parallel programs and its application to coherency protocol evaluation. In Proceedings of ISCA, pages 373--382, 1988. Google Scholar
Cross Ref
- David Eklov, David Black-Schaffer, and Erik Hagersten. Fast modeling of shared caches in multicore systems. In Proceedings of HiPEAC, pages 147--157, 2011. phBest paper. Google Scholar
Digital Library
- Babak Falsafi and David A. Wood. Modeling cost/performance of a parallel computer simulator. ACM Transactions on Modeling and Computer Simulation, 7(1):104--130, 1997. Google Scholar
Digital Library
- Saurabh Gupta, Ping Xiang, Yi Yang, and Huiyang Zhou. Locality principle revisited: A probability-based quantitative approach. In Proceedings of IPDPS, 2012. Google Scholar
Digital Library
- Mark D. Hill and Alan Jay Smith. Evaluating associativity in CPU caches. IEEE Transactions on Computers, 38(12):1612--1630, 1989. Google Scholar
Digital Library
- Xiameng Hu, Xiaolin Wang, Yechen Li, Yingwei Luo, Chen Ding, and Zhenlin Wang. Optimal program symbiosis in shared cache. In Proceedings of CCGrid, June 2015.Google Scholar
- Xiameng Hu, Xiaolin Wang, Lan Zhou, Yingwei Luo, Chen Ding, and Zhenlin Wang. Kinetic modeling of data eviction in cache. In Proceedings of USENIX ATC, pages 351--364, 2016.Google Scholar
- Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual. Number 325462-051US. June 2014.Google Scholar
- Yunlian Jiang, Kai Tian, and Xipeng Shen. Combining locality analysis with online proactive job co-scheduling in chip multiprocessors. In Proceedings of HiPEAC, pages 201--215, 2010. Google Scholar
Digital Library
- Yunlian Jiang, Eddy Z. Zhang, Kai Tian, and Xipeng Shen. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Proceedings of CC, pages 264--282, 2010. Google Scholar
Digital Library
- Chi-Keung Luk, Robert S. Cohn, Robert Muth, Harish Patil, Artur Klauser, P. Geoffrey Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim M. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of PLDI, pages 190--200, 2005.Google Scholar
Digital Library
- Matthias S. Müller, John Baron, William C. Brantley, Huiyu Feng, Daniel Hackenberg, Robert Henschel, Gabriele Jost, Daniel Molka, Chris Parrott, Joe Robichaux, Pavel Shelepugin, Matthijs van Waveren, Brian Whitney, and Kalyan Kumaran. SPEC OMP2012 -- an application benchmark suite for parallel systems using OpenMP. In Proceedings of the International Workshop on OpenMP, pages 223--236, Berlin, Heidelberg, 2012. Springer-Verlag. Google Scholar
Digital Library
- Qingpeng Niu, James Dinan, Qingda Lu, and P. Sadayappan. PARDA: A fast parallel reuse distance analysis algorithm. In Proceedings of IPDPS, 2012. Google Scholar
Digital Library
- Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri E. Bal. A detailed GPU cache model based on reuse distance theory. In Proceedings of HPCA, 2014. Google Scholar
Cross Ref
- F. Olken. Efficient methods for calculating the success function of fixed space replacement policies. Technical Report LBL-12370, Lawrence Berkeley Laboratory, 1981.Google Scholar
- Kishore Kumar Pusukuri, Rajiv Gupta, and Laxmi N. Bhuyan. No more backstabbing... a faithful scheduling policy for multithreaded programs. In Proceedings of PACT, pages 12--21, 2011. Google Scholar
Digital Library
- Derek L. Schuff, Milind Kulkarni, and Vijay S. Pai. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of PACT, pages 53--64, 2010. Google Scholar
Digital Library
- Rathijit Sen and David A. Wood. Reuse-based online models for caches. In Proceedings of SIGMETRICS, pages 279--292, 2013. Google Scholar
Digital Library
- A. J. Smith. On the effectiveness of set associative page mapping and its applications in main memory management. In Proceedings of ICSE, 1976.Google Scholar
Digital Library
- G. Edward Suh, Srinivas Devadas, and Larry Rudolph. Analytical cache models with applications to cache partitioning. In Proceedings of ICS, pages 1--12, 2001.Google Scholar
Digital Library
- David K. Tam, Reza Azimi, Livio Soares, and Michael Stumm. RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations. In Proceedings of ASPLOS, pages 121--132, 2009. Google Scholar
Digital Library
- Dominique Thiébaut and Harold S. Stone. Footprints in the cache. ACM Transactions on Computer Systems, 5(4):305--329, 1987. Google Scholar
Digital Library
- Carl A Waldspurger, Nohhyun Park, Alexander Garthwaite, and Irfan Ahmad. Efficient mrc construction with shards. In 13th USENIX Conference on File and Storage Technologies (FAST 15), pages 95--110, 2015.Google Scholar
Digital Library
- Richard West, Puneet Zaroo, Carl A. Waldspurger, and Xiao Zhang. Online cache modeling for commodity multicore processors. Operating Systems Review, 44(4):19--29, 2010. Google Scholar
Digital Library
- Jake Wires, Stephen Ingram, Zachary Drudi, Nicholas JA Harvey, Andrew Warfield, and Coho Data. Characterizing storage workloads with counter stacks. In Proceedings of OSDI, pages 335--349. USENIX Association, 2014.Google Scholar
- Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. The 22nd annual international symposium on Computer architecture (ISCA '95), pages 24--36, 1995.Google Scholar
- Meng-Ju Wu and Donald Yeung. Coherent profiles: Enabling efficient reuse distance analysis of multicore scaling for loop-based parallel programs. In Proceedings of PACT, pages 264--275, 2011. Google Scholar
Digital Library
- Meng-Ju Wu, Minshu Zhao, and Donald Yeung. Studying multicore processor scaling via reuse distance analysis. In Proceedings of ISCA, pages 499--510, 2013. Google Scholar
Digital Library
- Xiaoya Xiang, Bin Bao, Tongxin Bai, Chen Ding, and Trishul M. Chilimbi. All-window profiling and composable models of cache sharing. In Proceedings of PPoPP, pages 91--102, 2011. Google Scholar
Digital Library
- Xiaoya Xiang, Bin Bao, Chen Ding, and Yaoqing Gao. Linear-time modeling of program working set in shared cache. In Proceedings of PACT, pages 350--360, 2011. Google Scholar
Digital Library
- Xiaoya Xiang, Chen Ding, Hao Luo, and Bin Bao. HOTL: a higher order theory of locality. In Proceedings of ASPLOS, pages 343--356, 2013. Google Scholar
Digital Library
- Seyed Majid Zahedi and Benjamin C. Lee. REF: resource elasticity fairness with sharing incentives for multiprocessors. In Proceedings of ASPLOS, pages 145--160, 2014. Google Scholar
Digital Library
- Yutao Zhong and Wentao Chang. Sampling-based program locality approximation. In Proceedings of ISMM, pages 91--100, 2008. Google Scholar
Digital Library
- Yutao Zhong, Xipeng Shen, and Chen Ding. Program locality analysis using reuse distance. ACM TOPLAS, 31(6):1--39, August 2009. Google Scholar
Digital Library
- Pin Zhou, Vivek Pandey, Jagadeesan Sundaresan, Anand Raghuraman, Yuanyuan Zhou, and Sanjeev Kumar. Dynamic tracking of page miss ratio curve for memory management. In Proceedings of ASPLOS, pages 177--188, 2004. Google Scholar
Digital Library
Index Terms
Thread Data Sharing in Cache: Theory and Measurement
Recommendations
Thread Data Sharing in Cache: Theory and Measurement
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingOn modern multi-core processors, independent workloads often interfere with each other by competing for shared cache space. However, for multi-threaded workloads, where a single copy of data can be accessed by multiple threads, the threads can ...
A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines
Cache memory plays a crucial role in determining the performance of processors, especially for embedded processors where area and power are tightly constrained. It is necessary to have effective management mechanisms, such as cache replacement policies, ...
Increasing hardware data prefetching performance using the second-level cache
Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...







Comments