ABSTRACT
Data prefetching has been widely used in the past as a technique for hiding memory access latencies. However, data prefetching in multi-threaded applications running on chip multiprocessors (CMPs) can be problematic when multiple cores compete for a shared on-chip cache (L2 or L3). In this paper, we (i) quantify the impact of conventional data prefetching on shared caches in CMPs. The experimental data collected using multi-threaded applications indicates that, while data prefetching improves performance in small number of cores, its benefits reduce significantly as the number of cores is increased, that is, it is not scalable; (ii) identify harmful prefetches as one of the main contributors for degraded performance with a large number of cores; and (iii) propose and evaluate a compiler-directed data prefetching scheme for shared on-chip cache based CMPs. The proposed scheme first identifies program phases using static compiler analysis, and then divides the threads into groups within each phase and assigns a customized prefetcher thread (helper thread) to each group of threads. This helps to reduce the total number of prefetches issued, prefetch overheads, and negative interactions on the shared cache space due to data prefetches, and more importantly, makes compiler-directed prefetching a scalable optimization for CMPs. Our experiments with the applications from the SPEC OMP benchmark suite indicate that the proposed scheme improves overall parallel execution latency by 18.3% over the no-prefetch case and 6.4% over the conventional data prefetching scheme (where each core prefetches its data independently), on average, when 12 cores are used. The corresponding average performance improvements with 24 cores are 16.4% (over the no-prefetch case) and 11.7% (over the conventional prefetching case). We also demonstrate that the proposed scheme is robust under a wide range of values of our major simulation parameters, and the improvements it achieves come very close to those that can be achieved using an optimal scheme.
- A. R. Alameldeen and D. A. Wood. Interactions Between Compression and Prefetching in Chip Multiprocessors. In HPCA, pages 228--239, 2007. Google Scholar
Digital Library
- Bala et al. Dynamo: a transparent dynamic optimization system. In PLDI, pages 1--12, 2000. Google Scholar
Digital Library
- Balasubramonian et al. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures. In MICRO, pages 245--257, 2000. Google Scholar
Digital Library
- A. D. Brown and T. C. Mowry. Taming the Memory Hogs: Using Compiler-Inserted Releases to Manage Physical Memory Intelligently. In OSDI, pages 31--44, 2000. Google Scholar
Digital Library
- C. Li et al. Competitive Prefetching for Concurrent Sequential I/O. In EuroSys, pages 189--202, 2007. Google Scholar
Digital Library
- T.-F. Chen and J.-L. Baer. A performance study of software and hardware data prefetching schemes. In ISCA, pages 223--232, 1994. Google Scholar
Digital Library
- Cooksey et al. A stateless, content-directed data prefetching mechanism. In ASPLOS, pages 279--290, 2002. Google Scholar
Digital Library
- Dahlgren et al. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In ICPP, pages 56--63, 1993. Google Scholar
Digital Library
- A. S. Dhodapkar and J. E. Smith. Managing Multi-Configuration Hardware via Dynamic Working Set Analysis. In ISCA, pages 233--244, 2002. Google Scholar
Digital Library
- A. S. Dhodapkar and J. E. Smith. Comparing Program Phase Detection Techniques. In MICRO, pages 217--227, 2003. Google Scholar
Digital Library
- Ding et al. DiskSeen: Exploiting Disk Layout and Access History to Enhance I/O Prefetch. In USENIX, pages 261--274, 2007. Google Scholar
Digital Library
- Doshi et al. Optimizing Software Data Prefetches with Rotating Registers. In PACT, pages 257--267, 2001. Google Scholar
Digital Library
- P. et al. Informed Prefetching and Caching. In SOSP, pages 79--95, 1995. Google Scholar
Digital Library
- I. Ganusov and M. Burtscher. Efficient Emulation of Hardware Prefetchers via Event-Driven Helper Threading. In PACT, pages 144--153, 2006. Google Scholar
Digital Library
- B. S. Gill and L. A. D. Bathen. AMP: Adaptive Multi-Stream Prefetching in a Shared Cache. In USENIX FAST, pages 185--198, 2007. Google Scholar
Digital Library
- E. H. Gornish and A. Veidenbaum. An integrated hardware/software data prefetching scheme for shared-memory multiprocessors. Int. J. Parallel Program., 27(1):35--70, 1999. Google Scholar
Digital Library
- Hammond et al. A Single-Chip Multiprocessor. Computer, 30(9):79--85, 1997. Google Scholar
Digital Library
- Hsu et al. Exploring the cache design space for large scale CMPs. SIGARCH Comput. Archit. News, 33(4):24--33, 2005. Google Scholar
Digital Library
- Huang et al. Positional Adaptation of Processors: Application to Energy Reduction. In ISCA, pages 157--168, 2003. Google Scholar
Digital Library
- Intel. Intel Core Duo Processor and Intel Core Solo Processor on 65 nm Process, January 2007. Datasheet.Google Scholar
- Intel Corporation. Intel Develops Tera-Scale Research Chips, 2006. http://www.intel.com/pressroom/archive/releases/20060926corp_b.htm.Google Scholar
- Jung et al. Helper Thread Prefetching for Loosely-Coupled Multiprocessor Systems. In IPDPS, 2006. Google Scholar
Digital Library
- Kalla et al. IBM Power5 Chip: A Dual-Core Multithreaded Processor. IEEE Micro, 24(2):40--47, 2004. Google Scholar
Digital Library
- A. Ki and A. E. Knowles. Adaptive data prefetching using cache information. In ICS, pages 204--212, 1997. Google Scholar
Digital Library
- D. Kim and D. Yeung. Design and Evaluation of Compiler Algorithms for Pre-Execution. In ASPLOS, pages 159--170, 2002. Google Scholar
Digital Library
- Kongetira et al. Niagara: A 32-Way Multithreaded Sparc Processor. IEEEMicro, 25(2):21--29, 2005. Google Scholar
Digital Library
- C. Li and K. Shen. Managing prefetch memory for data-intensive online servers. In USENIX FAST, pages 253--266, 2005. Google Scholar
Digital Library
- Liao et al. Post-Pass Binary Adaptation for Software-Based Speculative Precomputation. In PLDI, pages 117--128, 2002. Google Scholar
Digital Library
- Lu et al. The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization System. In MICRO, page 180, 2003. Google Scholar
Digital Library
- Lu et al. Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor. In MICRO, pages 93--104, 2005. Google Scholar
Digital Library
- C.-K. Luk. Tolerating Memory Latency through Software-controlled preexecution in Simultaneous Multithreading Processors. In ISCA, pages 40--51, 2001. Google Scholar
Digital Library
- C.-K. Luk and T. C. Mowry. Architectural and compiler support for effective instruction prefetching: a cooperative approach. ACM Trans. Comput. Syst., 19(1):71--109, 2001. Google Scholar
Digital Library
- Luk et al. Profile-guided post-link stride prefetching. In ICS, pages 167--178, 2002. Google Scholar
Digital Library
- Magnusson et al. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50--58, 2002. Google Scholar
Digital Library
- C. McNairy and R. Bhatia. Montecito -- The next product in the Itanium(R) Processor Family, 2004. In Hot Chips 16, http://www.hotchips.org/archives/.Google Scholar
- Microsoft. Phoenix as a Tool in Research and Instruction. http://research.microsoft.com/phoenix/.Google Scholar
- Mowry et al. Design and Evaluation of a Compiler Algorithm for Prefetching. In OSDI, pages 62--73, 1992.Google Scholar
Digital Library
- Mowry et al. Automatic Compiler--Inserted I/O Prefetching for Out-of-Core Applications. In OSDI, pages 3--17, 1996. Google Scholar
Digital Library
- W. Pugh and D.Wonnacott. Going Beyond Integer Programming with the Omega Test to Eliminate False Data Dependences. IEEE Trans. Parallel Distrib. Syst., 6(2):204--211, 1995. Google Scholar
Digital Library
- Rabbah et al. Compiler orchestrated prefetching via speculation and predication. In ASPLOS, pages 189--198, 2004. Google Scholar
Digital Library
- Roth et al. Dependance Based Prefetching for Linked Data Structures. In ASPLOS, pages 115--126, 1998. Google Scholar
Digital Library
- T. Sherwood, S. Sair, and B. Calder. Phase Tracking and Prediction. In ISCA, pages 336--349, 2003. Google Scholar
Digital Library
- Shi et al. Coterminous locality and coterminous group data prefetching on chip multiprocessors. In IPDPS, 2006. Google Scholar
Digital Library
- Song et al. Design and Implementation of a Compiler Framework for Helper Threading on Multi-Core Processors. In PACT, 2005. Google Scholar
Digital Library
- SPEC. SPEC OMP Version 3.0 Documentation (OpenMP Benchmark Suite). http://www.spec.org/omp/.Google Scholar
- Spracklen et al. Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications. In HPCA, pages 225--236, 2005. Google Scholar
Digital Library
- Srikantaiah et al. Adaptive set pinning: managing shared caches in chip multiprocessors. In ASPLOS, pages 135--144, 2008. Google Scholar
Digital Library
- Sun Microsystems. UltraSPARC--II Enhancements: Support for Software Controlled Prefetch, 1997. White Paper WPR-0002.Google Scholar
- Tian et al. Impact of Compiler-based Data-Prefetching Techniques on SPEC OMP Application Performance. In IPDPS, page 53.1, 2005. Google Scholar
Digital Library
- Tomkins et al. Informed Multi-Process Prefetching and Caching. In SIGMETRICS, pages 100--114, 1997. Google Scholar
Digital Library
- S. P. Vanderwiel and D. J. Lilja. Data prefetch mechanisms. ACM Comput. Surv., 32(2):174--199, 2000. Google Scholar
Digital Library
- Wang et al. Guided Region Prefetching: A Cooperative Hardware/Software Approach. In ISCA, pages 388--398, 2003. Google Scholar
Digital Library
- S. P. V. Wiel and D. J. Lilja. A compiler-assisted data prefetch controller. In ICCD, pages 372--377, 1999. Google Scholar
Digital Library
- M. E. Wolf and M. S. Lam. A Data Locality Optimizing Algorithm. In PLDI, pages 30--44, 1991. Google Scholar
Digital Library
- Wolf et al. Combining Loop Transformations Considering Caches and Scheduling. In MICRO, pages 274--286, 1996. Google Scholar
Digital Library
- M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley Longman Publishing Co., Inc., 1995. Google Scholar
Digital Library
Index Terms
A compiler-directed data prefetching scheme for chip multiprocessors
Recommendations
A compiler-directed data prefetching scheme for chip multiprocessors
PPoPP '09Data prefetching has been widely used in the past as a technique for hiding memory access latencies. However, data prefetching in multi-threaded applications running on chip multiprocessors (CMPs) can be problematic when multiple cores compete for a ...
Maintaining Cache Coherence through Compiler-Directed Data Prefetching
In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses ...
Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems
This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely coupled processors have an advantage in that ...







Comments