ABSTRACT
Multicore processors have become ubiquitous in today's systems, but exploiting the parallelism they offer remains difficult, especially for legacy application and applications with large serial components. The challenge, then, is to develop techniques that allow multiple cores to work in concert to accelerate a single thread. This paper describes inter-core prefetching, a technique to exploit multiple cores to accelerate a single thread. Inter-core prefetching extends existing work on helper threads for SMT machines to multicore machines.
Inter-core prefetching uses one compute thread and one or more prefetching threads. The prefetching threads execute on cores that would otherwise be idle, prefetching the data that the compute thread will need. The compute thread then migrates between cores, following the path of the prefetch threads, and finds the data already waiting for it. Inter-core prefetching works with existing hardware and existing instruction set architectures. Using a range of state-of-the-art multiprocessors, this paper characterizes the potential benefits of the technique with microbenchmarks and then measures its impact on a range of memory intensive applications. The results show that inter-core prefetching improves performance by an average of 31 to 63%, depending on the architecture, and speeds up some applications by as much as 2.8×. It also demonstrates that inter-core prefetching reduces energy consumption by between 11 and 26% on average.
- T. M. Aamodt, P. Chow, P. Hammarlund, H. Wang, and J. P. Shen. Hardware support for prescient instruction prefetch. In Proceedings of the 10th International Symposium on High Performance Computer Architecture, 2004. Google Scholar
Digital Library
- M. Annavaram, J. M. Patel, and E. S. Davidson. Data prefetching by dependence graph precomputation. In Proceedings of the 28th annual international symposium on Computer architecture, 2001. Google Scholar
Digital Library
- J. A. Brown, H. Wang, G. Chrysos, P. H. Wang, and J. P. Shen. Speculative precomputation on chip multiprocessors. In In Proceedings of the 6th Workshop on Multithreaded Execution, Architecture, and Compilation, 2001.Google Scholar
- J. Chang and G. S. Sohi. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd annual International Symposium on Computer Architecture, June 2006. Google Scholar
Digital Library
- R. Chappell, J. Stark, S. Kim, S. Reinhardt, and Y. Patt. Simultaneous subordinate microthreading (ssmt). In Proceedings of the international symposium on Computer Architecture, May 1999. Google Scholar
Digital Library
- T.-F. Chen and J.-L. Baer. Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers, (5), May 1995. Google Scholar
Digital Library
- T. M. Chilimbi and M. Hirzel. Dynamic hot data stream prefetching for general-purpose programs. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation, 2002. Google Scholar
Digital Library
- Collins, Tullsen, Wang, and Shen}collins-dspJ. Collins, D. Tullsen, H. Wang, and J. Shen. Dynamic speculative precompuation. In Proceedings of the International Symposium on Microarchitecture, December 2001. Google Scholar
Digital Library
- Collins, Wang, Tullsen, Hughes, Lee, Lavery, and Shen}collins01J. Collins, H. Wang, D. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. Shen. Speculative precomputation: Long-range prefetching of delinquent loads. In Proceedings of the International Symposium on Computer Architecture, July 2001. Google Scholar
Digital Library
- J. Dundas and T. Mudge. Improving data cache performance by pre-executing instructions under a cache miss. In Proceedings of the 11th international conference on Supercomputing, 1997. Google Scholar
Digital Library
- A. Garg and M. C. Huang. A performance-correctness explicitly-decoupled architecture. In Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture, 2008. Google Scholar
Digital Library
- J. Gummaraju and M. Rosenblum. Stream programming on general-purpose processors. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, 2005. Google Scholar
Digital Library
- D. Hackenberg, D. Molka, and W. E. Nagel. Comparing cache architectures and coherence protocols on x86-64 multicore smp systems. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009. Google Scholar
Digital Library
- K. Z. Ibrahim, G. T. Byrd, and E. Rotenberg. Slipstream execution mode for cmp-based multiprocessors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture, 2003. Google Scholar
Digital Library
- N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the international symposium on Computer Architecture, June 1990. Google Scholar
Digital Library
- M. Kamruzzaman, S. Swanson, and D. M. Tullsen. Software data spreading: leveraging distributed caches to improve single thread performance. In Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, 2010. Google Scholar
Digital Library
- D. Kim, S. Liao, P. Wang, J. Cuvillo, X. Tian, X. Zou, H. Wang, D. Yeung, M. Girkar, and J. Shen. Physical experiment with prefetching helper threads on Intel's hyper-threaded processors. In International Symposium on Code Generation and Optimization, March 2004. Google Scholar
Digital Library
- D. Kim and D. Yeung. Design and evaluation of compiler algorithm for pre-execution. In Proceedings of the international conference on Architectural support for programming languages and operating systems, October 2002. Google Scholar
Digital Library
- V. Krishnan and J. Torrellas. A chip-multiprocessor architecture with speculative multithreading". IEEE Transactions on Computers, September 1999. Google Scholar
Digital Library
- S. Liao, P. Wang, H. Wang, G. Hoflehner, D. Lavery, and J. Shen. Post-pass binary adaptation for software-based speculative precomputation. In Proceedings of the conference on Programming Language Design and Implementation, October 2002. Google Scholar
Digital Library
- J. Lu, A. Das, W.-C. Hsu, K. Nguyen, and S. G. Abraham. Dynamic helper threaded prefetching on the sun ultrasparc cmp processor. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, 2005. Google Scholar
Digital Library
- C.-K. Luk. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings of the 28th annual international symposium on Computer architecture, July 2001. Google Scholar
Digital Library
- P. Marcuello, A. González, and J. Tubella. Speculative multithreaded processors. In 12th International Conference on Supercomputing, November 1998. Google Scholar
Digital Library
- P. Michaud. Exploiting the cache capacity of a single-chip multi-core processor with execution migration. In Proceedings of the 10th International Symposium on High Performance Computer Architecture, February 2004. Google Scholar
Digital Library
- T. C. Mowry, M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In ASPLOS-V: Proceedings of the fifth international conference on Architectural support for programming languages and operating systems, 1992. Google Scholar
Digital Library
- O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture, 2003. Google Scholar
Digital Library
- J. Pisharath, Y. Liu, W. Liao, A. Choudhary, G. Memik, and J. Parhi. Nu-minebench 2.0. technical report. Technical Report CUCIS-2005-08-01, Center for Ultra-Scale Computing and Information Security, Northwestern University, August 2006. URL http://cucis.ece.northwestern.edu/techreports/pdf/CUCIS-2004-08-001.pdf%.Google Scholar
- C. G. Quiñones, C. Madriles, J. Sánchez, P. Marcuello, A. González, and D. M. Tullsen. Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices. In ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2005. Google Scholar
Digital Library
- J. E. Smith. Decoupled access/execute computer architectures. In ISCA '82: Proceedings of the 9th annual symposium on Computer Architecture, 1982. Google Scholar
Digital Library
- G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. In Proceedings of the International Symposium on Computer Architecture, June 1995. Google Scholar
Digital Library
- K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slipstream processors: improving both performance and fault tolerance. SIGPLAN Not., 35 (11), 2000. Google Scholar
Digital Library
- W. Zhang, D. Tullsen, and B. Calder. Accelerating and adapting precomputation threads for efficient prefetching. In Proceedings of the International Symposium on High Performance Computer Architecture, January 2007. Google Scholar
Digital Library
- C. Zilles and G. Sohi. Execution-based prediction using speculative slices. In Proceedings of the International Symposium on Computer Architecture, July 2001. Google Scholar
Digital Library
Index Terms
Inter-core prefetching for multicore processors using migrating helper threads
Recommendations
Inter-core prefetching for multicore processors using migrating helper threads
ASPLOS '11Multicore processors have become ubiquitous in today's systems, but exploiting the parallelism they offer remains difficult, especially for legacy application and applications with large serial components. The challenge, then, is to develop techniques ...
Inter-core prefetching for multicore processors using migrating helper threads
ASPLOS '11Multicore processors have become ubiquitous in today's systems, but exploiting the parallelism they offer remains difficult, especially for legacy application and applications with large serial components. The challenge, then, is to develop techniques ...
Software data spreading: leveraging distributed caches to improve single thread performance
PLDI '10Single thread performance remains an important consideration even for multicore, multiprocessor systems. As a result, techniques for improving single thread performance using multiple cores have received considerable attention. This work describes a ...








Comments