skip to main content
10.1145/1806596.1806605acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Cache topology aware computation mapping for multicores

Published:05 June 2010Publication History

ABSTRACT

The main contribution of this paper is a compiler based, cache topology aware code optimization scheme for emerging multicore systems. This scheme distributes the iterations of a loop to be executed in parallel across the cores of a target multicore machine and schedules the iterations assigned to each core. Our goal is to improve the utilization of the on-chip multi-layer cache hierarchy and to maximize overall application performance. We evaluate our cache topology aware approach using a set of twelve applications and three different commercial multicore machines. In addition, to study some of our experimental parameters in detail and to explore future multicore machines (with higher core counts and deeper on-chip cache hierarchies), we also conduct a simulation based study. The results collected from our experiments with three Intel multicore machines show that the proposed compiler-based approach is very effective in enhancing performance. In addition, our simulation results indicate that optimizing for the on-chip cache hierarchy will be even more important in future multicores with increasing numbers of cores and cache levels.

References

  1. J. M. Anderson. Automatic Computation and Data Decomposition for Multiprocessors. Ph.D Thesis, Stanford University, March 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Z. R. Anderson et al. Lightweight annotations for controlling sharing in concurrent data structures. SIGPLAN Not., 44(6):98--109, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Bagnara et al. The PARMA polyhedra library: Toward a complete set of numerical abstractions for the analysis and verification of hardware and software systems. Sci. Comput. Program., 72(1-2):3--21, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Bailey et al. The NAS Parallel Benchmarks 2.0, NASA. Technical Report, 1995.Google ScholarGoogle Scholar
  5. B. M. Beckmann et al. ASR: Adaptive selective replication for CMP caches. In Proc. MICRO, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. In Proc. MICRO, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Bienia et al. The PARSEC benchmark suite: characterization and architectural implications. In Proc. PACT, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Bitirgen et al. Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach. In Proc. MICRO, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Borkar et al. Platform 2015: Intel processor and platform evolution for the next decade. Technical Report, Intel Corporation, 2005.Google ScholarGoogle Scholar
  10. J. Chang and G. S. Sohi. Cooperative caching for chip multiprocessors. In Proc. ISCA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Chang and G. S. Sohi. Cooperative cache partitioning for chip multiprocessors. In Proc. ICS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Chen et al. Compiler-directed channel allocation for saving power in on-chip networks. In In Proc. POPL, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Chen et al. Scheduling threads for constructive cache sharing on CMPs. In Proc. SPAA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F. Catthoor et al. Data Access and Storage Management for Embedded Programmable Processors. Kluwer Academic Publishers, Boston, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Darte et al. Scheduling the Computations of a loop nest with respect to a given mapping. In Proc. Europar, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. SPEC OMP V3.2. http://www.spec.org/omp/Google ScholarGoogle Scholar
  17. P. Feautrier. Scalable and structured scheduling. Int. J. Parallel Program. 34, 5, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. L. Henning. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News, 34(4):1--17, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. R. Hsu et al. Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource. In Proc. PACT, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. INCITE Leadership Computing. Technical Report, http://www.er.doe.gov/ ascr/incite/.Google ScholarGoogle Scholar
  21. R. Iyer et al. QoS policies and architecture for cache/memory in CMP platforms. SIGMETRICS Perform. Eval. Rev., 35(1):25--36, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Kandemir et al. Optimizing shared cache behavior of chip multiprocessors. In Proc. MICRO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. W. Kelly et al. The Omega Library interface guide. Technical Report, University of Maryland, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Kim et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. SIGPLAN Not., 37(10):211--222, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Kim et al. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proc. PACT, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Legrand et al. Mapping and Load-Balancing Iterative Computations. IEEE TPDS, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Liu et al. Organizing the last line of defense before hitting the memory wall for CMPs. In Proc. HPCA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. P. S.Magnusson et al. Simics: A full system simulation platform. IEEE Computer, 35(2):50--58, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. M. K. Martin et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News, 33(4):92--99, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. The OPENMP API specification for parallel programming. http://openmp.org/wp/Google ScholarGoogle Scholar
  31. Phoenix Compiler Infrastructure. Technical Report, Microsoft. https://connect.microsoft.com/Phoenix.Google ScholarGoogle Scholar
  32. M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low overhead, high-performance, runtime mechanism to partition shared caches. In Proc. MICRO, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. N. Rafique et al. Architectural support for operating system-driven CMP cache management. In Proc. PACT, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Sarkar and D. M. Tullsen. Compiler techniques for reducing data cache miss rate on a multithreaded architecture. In Proc. HiPEAC, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S.K. Singhai and K.S. McKinley. A Parameterized Loop Fusion Algorithm for Improving Parallelism and Cache Locality. The Computer Journal, vol. 40, no. 6, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  36. E. Speight et al. Adaptive mechanisms and policies for managing cache hierarchies in chip multiprocessors. SIGARCH Comput. Archit. News, 33(2):346--356, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Srikantaiah et al. Adaptive set pinning: managing shared caches in chip multiprocessors. In Proc. ASPLOS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. G. E. Suh et al. Dynamic partitioning of shared cache memory. Journal of Supercomputing, 28(1):7--26, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. P. Viana et al. Configurable cache subsetting for fast cache tuning. In Proc. DAC, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Youn et al. A reusability-aware cache memory sharing technique for high-performance low-power CMPs with private l2 caches. In Proc. ISLPED, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. C. Zhang et al. A hierarchical model of data locality. In Proc. POPL, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. E. P. Markatos and T. J. LeBlanc. Using Processor Affinity in Loop Scheduling on Shared-Memory Multiprocessors. In Proc. IPDPS, 1994.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. H. Li et al. Locality and Loop Scheduling on NUMA Multiprocessors. In Proc. ICPP, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. Zhang and K. Asanovic. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proc. ISCA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. E. Zhang et al. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?. In In Proc. PPOPP, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cache topology aware computation mapping for multicores

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation
        June 2010
        514 pages
        ISBN:9781450300193
        DOI:10.1145/1806596
        • cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 45, Issue 6
          PLDI '10
          June 2010
          496 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/1809028
          Issue’s Table of Contents

        Copyright © 2010 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 June 2010

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate406of2,067submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!