skip to main content
10.1145/1168857.1168892acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
Article

Stealth prefetching

Published:20 October 2006Publication History

ABSTRACT

Prefetching in shared-memory multiprocessor systems is an increasingly difficult problem. As system designs grow to incorporate larger numbers of faster processors, memory latency and interconnect traffic increase. While aggressive prefetching techniques can mitigate the increasing memory latency, they can harm performance by wasting precious interconnect bandwidth and prematurely accessing shared data, causing state downgrades at remote nodes that force later upgrades.This paper investigates Stealth Prefetching, a new technique that utilizes information from Coarse-Grain Coherence Tracking (CGCT) for prefetching data aggressively, stealthily, and efficiently in a broadcast-based shared-memory multiprocessor system. Stealth Prefetching utilizes CGCT to identify regions of memory that are not shared by other processors, aggressively fetches these lines from DRAM in open-page mode, and moves them close to the processor in anticipation of future references. Our analysis with commercial, scientific, and multiprogrammed workloads show that Stealth Prefetching provides an average speedup of 20% over an aggressive baseline system with conventional prefetching.

References

  1. Charlesworth, A. The Sun Fireplane System Interconnect. In Proceedings of SC2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Tendler, J., Dodson, S., and Fields, S. IBM eServer Power4 System Microarchitecture, Technical White Paper, IBM Server Group, 2001Google ScholarGoogle Scholar
  3. Kalla, R., Sinharoy, B., and Tendler, J. IBM Power5 Chip: A Dual-Core Multithreaded Processor IEEE Micro, 2004.Google ScholarGoogle Scholar
  4. Weber, F., Opteron and AMD64, A Commodity 64 bit x86 SOC. Presentation. Advanced Micro Devices, 2003.Google ScholarGoogle Scholar
  5. Lin, W-F., Burger, D., Reducing DRAM Latencies with an Integrated Memory Hierarchy Design. In Proceedings of the 28th International Symposium on High-Performance Computer Architecture (HPCA), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Lin, W-F., Burger, D., and Puzak, T., Filtering Superfluous Prefetches using Density Vectors. In Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors (ICCD), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Wang, Z., Burger, D., McKinley, K., Reinhardt, S., and Weems, C., Guided Region Prefetching: A Cooperative Hardware/Software Approach. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Wang, Z., McKinley, K., and Burger, D., Combining Software/Hardware Prefetching and Cache Replacement. IBM Austin Center for Advanced Studies Conference (CAS), 2004.Google ScholarGoogle Scholar
  9. Nesbit, K., and Smith, J., Prefetching Using a Global History Buffer. Proceedings of the 10th Annual International Symposium on High Performance Computer Architecture, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Hughes, C., and Adve, S., Memory-side Prefetching for Linked Data Structures for Processor-In-Memory Systems. IEEE Journal on Parallel and Distributed Systems, Volume 65, Issue 4, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Somogyi, S., Wenisch, T., Ailamaki, A., Falsafi, B., and Moshovos, A. Spatial Memory Streaming. Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jerger, N., Hill, E., and Lipasti, M., Friendly Fire: Understanding the Effects of Multiprocessor Prefetching. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2006.Google ScholarGoogle Scholar
  13. Moshovos, A., Exploiting Coarse-Grain Non-Shared Regions in Snoopy Coherent Multiprocessors. Computer Engineering Group Technical Report, University of Toronto, December 2003.Google ScholarGoogle Scholar
  14. Moshovos, A., RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA). 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Cantin, J., Lipasti, M., and Smith J., Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Cantin, J., Moshovos, A., Lipasti, M., Smith, J., and Falsafi, B., "Coarse-Grain Coherence Tracking: RegionScout and Region Coherence Arrays". IEEE Micro Special Issue on Top Picks from 2005 Computer Architecture Conferences, Jan-Feb 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jouppi, N., Improving Direct-Mapped Cache Performance by the Addition of a Small, Fully-Associative Cache and Prefetch Buffers. In Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA), 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Cain, H., Lepak, K., Schwartz, B., and Lipasti, M. Precise and Accurate Processor Simulation. In Proceedings of the 5th Workshop on Computer Architecture Evaluation Using Commercial Workloads, pp. 13--22, 2002.Google ScholarGoogle Scholar
  19. Keller, T., Maynard, A., Simpson, R., and Bohrer, P., Simosppc Full System Simulator. http://www.cs.utexas.edu/users/cart/simOS.Google ScholarGoogle Scholar
  20. UltraSPARC IV Processor, User's Manual Supplement, Sun Microsystems Inc, 2004.Google ScholarGoogle Scholar
  21. Gharachorloo, K., Gupta, A., and Hennessy, J. Two Techniques to Enhance the Performance of Memory Consistency Models. In Proceedings of the International Conference on Parallel Processing (ICPP), 1991.Google ScholarGoogle Scholar
  22. Alameldeen, A., Martin, M., Mauer, C., Moore, K., Xu, M., Hill, M., and Wood, D. Simulating a $2M Commercial Server on a $2K PC. IEEE Computer, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Stealth prefetching

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!