skip to main content
10.1145/1950365.1950411acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Inter-core prefetching for multicore processors using migrating helper threads

Published:05 March 2011Publication History

ABSTRACT

Multicore processors have become ubiquitous in today's systems, but exploiting the parallelism they offer remains difficult, especially for legacy application and applications with large serial components. The challenge, then, is to develop techniques that allow multiple cores to work in concert to accelerate a single thread. This paper describes inter-core prefetching, a technique to exploit multiple cores to accelerate a single thread. Inter-core prefetching extends existing work on helper threads for SMT machines to multicore machines.

Inter-core prefetching uses one compute thread and one or more prefetching threads. The prefetching threads execute on cores that would otherwise be idle, prefetching the data that the compute thread will need. The compute thread then migrates between cores, following the path of the prefetch threads, and finds the data already waiting for it. Inter-core prefetching works with existing hardware and existing instruction set architectures. Using a range of state-of-the-art multiprocessors, this paper characterizes the potential benefits of the technique with microbenchmarks and then measures its impact on a range of memory intensive applications. The results show that inter-core prefetching improves performance by an average of 31 to 63%, depending on the architecture, and speeds up some applications by as much as 2.8×. It also demonstrates that inter-core prefetching reduces energy consumption by between 11 and 26% on average.

References

  1. T. M. Aamodt, P. Chow, P. Hammarlund, H. Wang, and J. P. Shen. Hardware support for prescient instruction prefetch. In Proceedings of the 10th International Symposium on High Performance Computer Architecture, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Annavaram, J. M. Patel, and E. S. Davidson. Data prefetching by dependence graph precomputation. In Proceedings of the 28th annual international symposium on Computer architecture, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. A. Brown, H. Wang, G. Chrysos, P. H. Wang, and J. P. Shen. Speculative precomputation on chip multiprocessors. In In Proceedings of the 6th Workshop on Multithreaded Execution, Architecture, and Compilation, 2001.Google ScholarGoogle Scholar
  4. J. Chang and G. S. Sohi. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd annual International Symposium on Computer Architecture, June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Chappell, J. Stark, S. Kim, S. Reinhardt, and Y. Patt. Simultaneous subordinate microthreading (ssmt). In Proceedings of the international symposium on Computer Architecture, May 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T.-F. Chen and J.-L. Baer. Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers, (5), May 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. M. Chilimbi and M. Hirzel. Dynamic hot data stream prefetching for general-purpose programs. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Collins, Tullsen, Wang, and Shen}collins-dspJ. Collins, D. Tullsen, H. Wang, and J. Shen. Dynamic speculative precompuation. In Proceedings of the International Symposium on Microarchitecture, December 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Collins, Wang, Tullsen, Hughes, Lee, Lavery, and Shen}collins01J. Collins, H. Wang, D. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. Shen. Speculative precomputation: Long-range prefetching of delinquent loads. In Proceedings of the International Symposium on Computer Architecture, July 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Dundas and T. Mudge. Improving data cache performance by pre-executing instructions under a cache miss. In Proceedings of the 11th international conference on Supercomputing, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Garg and M. C. Huang. A performance-correctness explicitly-decoupled architecture. In Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Gummaraju and M. Rosenblum. Stream programming on general-purpose processors. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Hackenberg, D. Molka, and W. E. Nagel. Comparing cache architectures and coherence protocols on x86-64 multicore smp systems. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Z. Ibrahim, G. T. Byrd, and E. Rotenberg. Slipstream execution mode for cmp-based multiprocessors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the international symposium on Computer Architecture, June 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Kamruzzaman, S. Swanson, and D. M. Tullsen. Software data spreading: leveraging distributed caches to improve single thread performance. In Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Kim, S. Liao, P. Wang, J. Cuvillo, X. Tian, X. Zou, H. Wang, D. Yeung, M. Girkar, and J. Shen. Physical experiment with prefetching helper threads on Intel's hyper-threaded processors. In International Symposium on Code Generation and Optimization, March 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Kim and D. Yeung. Design and evaluation of compiler algorithm for pre-execution. In Proceedings of the international conference on Architectural support for programming languages and operating systems, October 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. V. Krishnan and J. Torrellas. A chip-multiprocessor architecture with speculative multithreading". IEEE Transactions on Computers, September 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Liao, P. Wang, H. Wang, G. Hoflehner, D. Lavery, and J. Shen. Post-pass binary adaptation for software-based speculative precomputation. In Proceedings of the conference on Programming Language Design and Implementation, October 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Lu, A. Das, W.-C. Hsu, K. Nguyen, and S. G. Abraham. Dynamic helper threaded prefetching on the sun ultrasparc cmp processor. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C.-K. Luk. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings of the 28th annual international symposium on Computer architecture, July 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Marcuello, A. González, and J. Tubella. Speculative multithreaded processors. In 12th International Conference on Supercomputing, November 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Michaud. Exploiting the cache capacity of a single-chip multi-core processor with execution migration. In Proceedings of the 10th International Symposium on High Performance Computer Architecture, February 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. C. Mowry, M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In ASPLOS-V: Proceedings of the fifth international conference on Architectural support for programming languages and operating systems, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Pisharath, Y. Liu, W. Liao, A. Choudhary, G. Memik, and J. Parhi. Nu-minebench 2.0. technical report. Technical Report CUCIS-2005-08-01, Center for Ultra-Scale Computing and Information Security, Northwestern University, August 2006. URL http://cucis.ece.northwestern.edu/techreports/pdf/CUCIS-2004-08-001.pdf%.Google ScholarGoogle Scholar
  28. C. G. Quiñones, C. Madriles, J. Sánchez, P. Marcuello, A. González, and D. M. Tullsen. Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices. In ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. E. Smith. Decoupled access/execute computer architectures. In ISCA '82: Proceedings of the 9th annual symposium on Computer Architecture, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. In Proceedings of the International Symposium on Computer Architecture, June 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slipstream processors: improving both performance and fault tolerance. SIGPLAN Not., 35 (11), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. W. Zhang, D. Tullsen, and B. Calder. Accelerating and adapting precomputation threads for efficient prefetching. In Proceedings of the International Symposium on High Performance Computer Architecture, January 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. C. Zilles and G. Sohi. Execution-based prediction using speculative slices. In Proceedings of the International Symposium on Computer Architecture, July 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Inter-core prefetching for multicore processors using migrating helper threads

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!