ABSTRACT
The main contribution of this paper is a compiler based, cache topology aware code optimization scheme for emerging multicore systems. This scheme distributes the iterations of a loop to be executed in parallel across the cores of a target multicore machine and schedules the iterations assigned to each core. Our goal is to improve the utilization of the on-chip multi-layer cache hierarchy and to maximize overall application performance. We evaluate our cache topology aware approach using a set of twelve applications and three different commercial multicore machines. In addition, to study some of our experimental parameters in detail and to explore future multicore machines (with higher core counts and deeper on-chip cache hierarchies), we also conduct a simulation based study. The results collected from our experiments with three Intel multicore machines show that the proposed compiler-based approach is very effective in enhancing performance. In addition, our simulation results indicate that optimizing for the on-chip cache hierarchy will be even more important in future multicores with increasing numbers of cores and cache levels.
- J. M. Anderson. Automatic Computation and Data Decomposition for Multiprocessors. Ph.D Thesis, Stanford University, March 1997. Google Scholar
Digital Library
- Z. R. Anderson et al. Lightweight annotations for controlling sharing in concurrent data structures. SIGPLAN Not., 44(6):98--109, 2009. Google Scholar
Digital Library
- R. Bagnara et al. The PARMA polyhedra library: Toward a complete set of numerical abstractions for the analysis and verification of hardware and software systems. Sci. Comput. Program., 72(1-2):3--21, 2008. Google Scholar
Digital Library
- D. Bailey et al. The NAS Parallel Benchmarks 2.0, NASA. Technical Report, 1995.Google Scholar
- B. M. Beckmann et al. ASR: Adaptive selective replication for CMP caches. In Proc. MICRO, 2006. Google Scholar
Digital Library
- B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. In Proc. MICRO, 2004. Google Scholar
Digital Library
- C. Bienia et al. The PARSEC benchmark suite: characterization and architectural implications. In Proc. PACT, 2008. Google Scholar
Digital Library
- R. Bitirgen et al. Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach. In Proc. MICRO, 2008. Google Scholar
Digital Library
- S. Borkar et al. Platform 2015: Intel processor and platform evolution for the next decade. Technical Report, Intel Corporation, 2005.Google Scholar
- J. Chang and G. S. Sohi. Cooperative caching for chip multiprocessors. In Proc. ISCA, 2006. Google Scholar
Digital Library
- J. Chang and G. S. Sohi. Cooperative cache partitioning for chip multiprocessors. In Proc. ICS, 2007. Google Scholar
Digital Library
- G. Chen et al. Compiler-directed channel allocation for saving power in on-chip networks. In In Proc. POPL, 2006. Google Scholar
Digital Library
- S. Chen et al. Scheduling threads for constructive cache sharing on CMPs. In Proc. SPAA, 2007. Google Scholar
Digital Library
- F. Catthoor et al. Data Access and Storage Management for Embedded Programmable Processors. Kluwer Academic Publishers, Boston, 2002.Google Scholar
Digital Library
- A. Darte et al. Scheduling the Computations of a loop nest with respect to a given mapping. In Proc. Europar, 2000. Google Scholar
Digital Library
- SPEC OMP V3.2. http://www.spec.org/omp/Google Scholar
- P. Feautrier. Scalable and structured scheduling. Int. J. Parallel Program. 34, 5, 2006. Google Scholar
Digital Library
- J. L. Henning. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News, 34(4):1--17, 2006. Google Scholar
Digital Library
- L. R. Hsu et al. Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource. In Proc. PACT, 2006. Google Scholar
Digital Library
- INCITE Leadership Computing. Technical Report, http://www.er.doe.gov/ ascr/incite/.Google Scholar
- R. Iyer et al. QoS policies and architecture for cache/memory in CMP platforms. SIGMETRICS Perform. Eval. Rev., 35(1):25--36, 2007. Google Scholar
Digital Library
- M. Kandemir et al. Optimizing shared cache behavior of chip multiprocessors. In Proc. MICRO, 2009. Google Scholar
Digital Library
- W. Kelly et al. The Omega Library interface guide. Technical Report, University of Maryland, 1995. Google Scholar
Digital Library
- C. Kim et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. SIGPLAN Not., 37(10):211--222, 2002. Google Scholar
Digital Library
- S. Kim et al. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proc. PACT, 2004. Google Scholar
Digital Library
- A. Legrand et al. Mapping and Load-Balancing Iterative Computations. IEEE TPDS, 2004. Google Scholar
Digital Library
- C. Liu et al. Organizing the last line of defense before hitting the memory wall for CMPs. In Proc. HPCA, 2004. Google Scholar
Digital Library
- P. S.Magnusson et al. Simics: A full system simulation platform. IEEE Computer, 35(2):50--58, 2002. Google Scholar
Digital Library
- M. M. K. Martin et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News, 33(4):92--99, 2005. Google Scholar
Digital Library
- The OPENMP API specification for parallel programming. http://openmp.org/wp/Google Scholar
- Phoenix Compiler Infrastructure. Technical Report, Microsoft. https://connect.microsoft.com/Phoenix.Google Scholar
- M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low overhead, high-performance, runtime mechanism to partition shared caches. In Proc. MICRO, 2006. Google Scholar
Digital Library
- N. Rafique et al. Architectural support for operating system-driven CMP cache management. In Proc. PACT, 2006. Google Scholar
Digital Library
- S. Sarkar and D. M. Tullsen. Compiler techniques for reducing data cache miss rate on a multithreaded architecture. In Proc. HiPEAC, 2008. Google Scholar
Digital Library
- S.K. Singhai and K.S. McKinley. A Parameterized Loop Fusion Algorithm for Improving Parallelism and Cache Locality. The Computer Journal, vol. 40, no. 6, 1997.Google Scholar
Cross Ref
- E. Speight et al. Adaptive mechanisms and policies for managing cache hierarchies in chip multiprocessors. SIGARCH Comput. Archit. News, 33(2):346--356, 2005. Google Scholar
Digital Library
- S. Srikantaiah et al. Adaptive set pinning: managing shared caches in chip multiprocessors. In Proc. ASPLOS, 2008. Google Scholar
Digital Library
- G. E. Suh et al. Dynamic partitioning of shared cache memory. Journal of Supercomputing, 28(1):7--26, 2004. Google Scholar
Digital Library
- P. Viana et al. Configurable cache subsetting for fast cache tuning. In Proc. DAC, 2006. Google Scholar
Digital Library
- S. Youn et al. A reusability-aware cache memory sharing technique for high-performance low-power CMPs with private l2 caches. In Proc. ISLPED, 2007. Google Scholar
Digital Library
- C. Zhang et al. A hierarchical model of data locality. In Proc. POPL, 2006. Google Scholar
Digital Library
- E. P. Markatos and T. J. LeBlanc. Using Processor Affinity in Loop Scheduling on Shared-Memory Multiprocessors. In Proc. IPDPS, 1994.Google Scholar
Digital Library
- H. Li et al. Locality and Loop Scheduling on NUMA Multiprocessors. In Proc. ICPP, 1993. Google Scholar
Digital Library
- M. Zhang and K. Asanovic. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proc. ISCA, 2005. Google Scholar
Digital Library
- E. Zhang et al. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?. In In Proc. PPOPP, 2010. Google Scholar
Digital Library
Index Terms
Cache topology aware computation mapping for multicores
Recommendations
Cache topology aware computation mapping for multicores
PLDI '10The main contribution of this paper is a compiler based, cache topology aware code optimization scheme for emerging multicore systems. This scheme distributes the iterations of a loop to be executed in parallel across the cores of a target multicore ...
Compiler Optimization to Reduce Cache Power with Victim Cache
UIC-ATC '12: Proceedings of the 2012 9th International Conference on Ubiquitous Intelligence and Computing and 9th International Conference on Autonomic and Trusted ComputingVictim cache can buffer blocks discarded from the cache on a miss before going to the next lower-level memory to improve performance. Compared with the previous work, rather than only improve performance, we design a modified victim cache to reduce ...
ULCC: a user-level facility for optimizing shared cache performance on multicores
PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programmingScientific applications face serious performance challenges on multicore processors, one of which is caused by access contention in last level shared caches from multiple running threads. The contention increases the number of long latency memory ...







Comments