skip to main content
10.1145/1504176.1504208acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

A compiler-directed data prefetching scheme for chip multiprocessors

Published:14 February 2009Publication History

ABSTRACT

Data prefetching has been widely used in the past as a technique for hiding memory access latencies. However, data prefetching in multi-threaded applications running on chip multiprocessors (CMPs) can be problematic when multiple cores compete for a shared on-chip cache (L2 or L3). In this paper, we (i) quantify the impact of conventional data prefetching on shared caches in CMPs. The experimental data collected using multi-threaded applications indicates that, while data prefetching improves performance in small number of cores, its benefits reduce significantly as the number of cores is increased, that is, it is not scalable; (ii) identify harmful prefetches as one of the main contributors for degraded performance with a large number of cores; and (iii) propose and evaluate a compiler-directed data prefetching scheme for shared on-chip cache based CMPs. The proposed scheme first identifies program phases using static compiler analysis, and then divides the threads into groups within each phase and assigns a customized prefetcher thread (helper thread) to each group of threads. This helps to reduce the total number of prefetches issued, prefetch overheads, and negative interactions on the shared cache space due to data prefetches, and more importantly, makes compiler-directed prefetching a scalable optimization for CMPs. Our experiments with the applications from the SPEC OMP benchmark suite indicate that the proposed scheme improves overall parallel execution latency by 18.3% over the no-prefetch case and 6.4% over the conventional data prefetching scheme (where each core prefetches its data independently), on average, when 12 cores are used. The corresponding average performance improvements with 24 cores are 16.4% (over the no-prefetch case) and 11.7% (over the conventional prefetching case). We also demonstrate that the proposed scheme is robust under a wide range of values of our major simulation parameters, and the improvements it achieves come very close to those that can be achieved using an optimal scheme.

References

  1. A. R. Alameldeen and D. A. Wood. Interactions Between Compression and Prefetching in Chip Multiprocessors. In HPCA, pages 228--239, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bala et al. Dynamo: a transparent dynamic optimization system. In PLDI, pages 1--12, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Balasubramonian et al. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures. In MICRO, pages 245--257, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. D. Brown and T. C. Mowry. Taming the Memory Hogs: Using Compiler-Inserted Releases to Manage Physical Memory Intelligently. In OSDI, pages 31--44, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Li et al. Competitive Prefetching for Concurrent Sequential I/O. In EuroSys, pages 189--202, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T.-F. Chen and J.-L. Baer. A performance study of software and hardware data prefetching schemes. In ISCA, pages 223--232, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Cooksey et al. A stateless, content-directed data prefetching mechanism. In ASPLOS, pages 279--290, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dahlgren et al. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In ICPP, pages 56--63, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. S. Dhodapkar and J. E. Smith. Managing Multi-Configuration Hardware via Dynamic Working Set Analysis. In ISCA, pages 233--244, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. S. Dhodapkar and J. E. Smith. Comparing Program Phase Detection Techniques. In MICRO, pages 217--227, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ding et al. DiskSeen: Exploiting Disk Layout and Access History to Enhance I/O Prefetch. In USENIX, pages 261--274, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Doshi et al. Optimizing Software Data Prefetches with Rotating Registers. In PACT, pages 257--267, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. et al. Informed Prefetching and Caching. In SOSP, pages 79--95, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. I. Ganusov and M. Burtscher. Efficient Emulation of Hardware Prefetchers via Event-Driven Helper Threading. In PACT, pages 144--153, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. S. Gill and L. A. D. Bathen. AMP: Adaptive Multi-Stream Prefetching in a Shared Cache. In USENIX FAST, pages 185--198, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. H. Gornish and A. Veidenbaum. An integrated hardware/software data prefetching scheme for shared-memory multiprocessors. Int. J. Parallel Program., 27(1):35--70, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hammond et al. A Single-Chip Multiprocessor. Computer, 30(9):79--85, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hsu et al. Exploring the cache design space for large scale CMPs. SIGARCH Comput. Archit. News, 33(4):24--33, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Huang et al. Positional Adaptation of Processors: Application to Energy Reduction. In ISCA, pages 157--168, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Intel. Intel Core Duo Processor and Intel Core Solo Processor on 65 nm Process, January 2007. Datasheet.Google ScholarGoogle Scholar
  21. Intel Corporation. Intel Develops Tera-Scale Research Chips, 2006. http://www.intel.com/pressroom/archive/releases/20060926corp_b.htm.Google ScholarGoogle Scholar
  22. Jung et al. Helper Thread Prefetching for Loosely-Coupled Multiprocessor Systems. In IPDPS, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kalla et al. IBM Power5 Chip: A Dual-Core Multithreaded Processor. IEEE Micro, 24(2):40--47, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Ki and A. E. Knowles. Adaptive data prefetching using cache information. In ICS, pages 204--212, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Kim and D. Yeung. Design and Evaluation of Compiler Algorithms for Pre-Execution. In ASPLOS, pages 159--170, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kongetira et al. Niagara: A 32-Way Multithreaded Sparc Processor. IEEEMicro, 25(2):21--29, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Li and K. Shen. Managing prefetch memory for data-intensive online servers. In USENIX FAST, pages 253--266, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Liao et al. Post-Pass Binary Adaptation for Software-Based Speculative Precomputation. In PLDI, pages 117--128, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lu et al. The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization System. In MICRO, page 180, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Lu et al. Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor. In MICRO, pages 93--104, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C.-K. Luk. Tolerating Memory Latency through Software-controlled preexecution in Simultaneous Multithreading Processors. In ISCA, pages 40--51, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C.-K. Luk and T. C. Mowry. Architectural and compiler support for effective instruction prefetching: a cooperative approach. ACM Trans. Comput. Syst., 19(1):71--109, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Luk et al. Profile-guided post-link stride prefetching. In ICS, pages 167--178, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Magnusson et al. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50--58, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. McNairy and R. Bhatia. Montecito -- The next product in the Itanium(R) Processor Family, 2004. In Hot Chips 16, http://www.hotchips.org/archives/.Google ScholarGoogle Scholar
  36. Microsoft. Phoenix as a Tool in Research and Instruction. http://research.microsoft.com/phoenix/.Google ScholarGoogle Scholar
  37. Mowry et al. Design and Evaluation of a Compiler Algorithm for Prefetching. In OSDI, pages 62--73, 1992.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Mowry et al. Automatic Compiler--Inserted I/O Prefetching for Out-of-Core Applications. In OSDI, pages 3--17, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. W. Pugh and D.Wonnacott. Going Beyond Integer Programming with the Omega Test to Eliminate False Data Dependences. IEEE Trans. Parallel Distrib. Syst., 6(2):204--211, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Rabbah et al. Compiler orchestrated prefetching via speculation and predication. In ASPLOS, pages 189--198, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Roth et al. Dependance Based Prefetching for Linked Data Structures. In ASPLOS, pages 115--126, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. T. Sherwood, S. Sair, and B. Calder. Phase Tracking and Prediction. In ISCA, pages 336--349, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Shi et al. Coterminous locality and coterminous group data prefetching on chip multiprocessors. In IPDPS, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Song et al. Design and Implementation of a Compiler Framework for Helper Threading on Multi-Core Processors. In PACT, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. SPEC. SPEC OMP Version 3.0 Documentation (OpenMP Benchmark Suite). http://www.spec.org/omp/.Google ScholarGoogle Scholar
  46. Spracklen et al. Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications. In HPCA, pages 225--236, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Srikantaiah et al. Adaptive set pinning: managing shared caches in chip multiprocessors. In ASPLOS, pages 135--144, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Sun Microsystems. UltraSPARC--II Enhancements: Support for Software Controlled Prefetch, 1997. White Paper WPR-0002.Google ScholarGoogle Scholar
  49. Tian et al. Impact of Compiler-based Data-Prefetching Techniques on SPEC OMP Application Performance. In IPDPS, page 53.1, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Tomkins et al. Informed Multi-Process Prefetching and Caching. In SIGMETRICS, pages 100--114, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. S. P. Vanderwiel and D. J. Lilja. Data prefetch mechanisms. ACM Comput. Surv., 32(2):174--199, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Wang et al. Guided Region Prefetching: A Cooperative Hardware/Software Approach. In ISCA, pages 388--398, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. S. P. V. Wiel and D. J. Lilja. A compiler-assisted data prefetch controller. In ICCD, pages 372--377, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. M. E. Wolf and M. S. Lam. A Data Locality Optimizing Algorithm. In PLDI, pages 30--44, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Wolf et al. Combining Loop Transformations Considering Caches and Scheduling. In MICRO, pages 274--286, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley Longman Publishing Co., Inc., 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A compiler-directed data prefetching scheme for chip multiprocessors

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
        February 2009
        322 pages
        ISBN:9781605583976
        DOI:10.1145/1504176
        • cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 44, Issue 4
          PPoPP '09
          April 2009
          294 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/1594835
          Issue’s Table of Contents

        Copyright © 2009 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 February 2009

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate230of1,014submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!