skip to main content
10.1145/1346281.1346290acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors

Published:01 March 2008Publication History

ABSTRACT

Integrating more processor cores on-die has become the unanimous trend in the microprocessor industry. Most of the current research thrusts using chip multiprocessors (CMPs) as the baseline to analyze problems in various domains. One of the main design issues facing CMP systems is the growing number of snoops required to maintain cache coherency and to support self/cross-modifying code that leads to power and performance limitations. In this paper, we analyze the internal and external snoop behavior in a CMP system and relax the snoopy cache coherence protocol based on the program semantics and properties of the shared variables for saving power. Based on the observations and analyses, we propose two novel techniques: Selective Snoop Probe (SSP) and Essential Snoop Probe (ESP) to reduce power without compromising performance. Our simulation results show that using the SSPtechnique, 5% to 65% data cache energy savings per core for different processor configurations can be achieved with 1% to 2% performance improvement. We also show that 5% to 82% of data cache energy per core is spent on the non-essential snoop probes that can be saved using the ESP technique.

Skip Supplemental Material Section

Supplemental Material

Video

References

  1. CACTI 4.2. In http://quid.hpl.hp.com:9081/cacti.Google ScholarGoogle Scholar
  2. ELF handling for Thread Local Storage. In people.redhat.com/drepper/tls.pdf.Google ScholarGoogle Scholar
  3. IDF 2006: Terascale Processing Brings 80 Cores to your Desktop. In http://www.pcper.com/article.php?aid=30&type=expert&pid=3.Google ScholarGoogle Scholar
  4. Microprocessor cache-coherency snooping. In http://www.warthman.com/ex-inqr.htm.Google ScholarGoogle Scholar
  5. Performance guidelines for AMD Athlon 64 and AMD Opteron. In www.amd.com/us-en/assets/content-type/white-papers-and-techdocs/40555.pdf.Google ScholarGoogle Scholar
  6. Power and Thermal Management in the Intel Core Duo. In www.intel.com/technology/itj/2006/volume10issue02/art03-Powerand-Thermal-Management/p03-power.htm.Google ScholarGoogle Scholar
  7. J.-L. Baer and W.-H.Wang. On the inclusion properties for multi-level cache hierarchies. In Proc. of Int'l Symp. on Computer Architecture, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Balfour and W.J. Dally. Design tradeoffs for tiled cmp on-chip networks. In Proc. of Int'l Conf. on Supercomputing, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L.A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A Scalable Architecture Based on Single Chip Multiprocessing. In Proc. of Int'l Symp. on Computer Architecture, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B.H. Bloom. Space/time Trade-offs in Hash Coding with Allowable Errors. Communication of the ACM 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Boggs, A. Baktha, J. Hawkins, D.T. Marr, J.A. Miller, P. Roussel, R. Singhal, B. Toll, and K.S. Venkatraman. Performance analysis and validation of the Intel Pentium 4 processor on 90nm Technology. Intel Technology Journal, 8(1), 2004.Google ScholarGoogle Scholar
  12. J.F. Cantin, M.H. Lipasti, and J.E. Smith. Improving multiprocessor performance with coarse-grain coherence tracking. In Proc. of Int'l Symp. on Computer Architecture, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. Dahlen, J. Gustin, S. Meredith, and D. Moran. The 82460GX Sever/Workstation Chip Set. IEEE Micro, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Dash and P. Petrov. Energy-efficient cache coherence for embedded multi-processor systems through application-driven snoop filtering. In EUROMICRO DSD 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Ekman, F. Dahlgren, and P. Stenstrom. Evaluation of Snoop Energy-Reduction techniques for Chip-Multiprocessors. Workshop on Duplicating, Deconstructing and Debunking in conjunction with ISCA 2002.Google ScholarGoogle Scholar
  16. M. Ekman, P. Stenstrom, and F. Dahlgren. TLB and Snoop Energyreduction using Virtual Caches in Low-power Chip Multiprocessors. In ISLPED 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Ghosh, E. Ozer, S. Biles, and H.-H.S. Lee. Efficient System-on-Chip Energy Measurement with a Segmented Bloom Filter. In ARCS 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Gomaa, C. Scarbrough, T.N. Vijaykumar, and I. Pomeranz. Transient-Fault Recovery for Chip Multiprocessors. In Proc. of Int'l Symp. on Computer Architecture, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Hamerly, E. Perelman, J. Lau, and B. Calder. SimPoint 3.0: faster and more flexible program analysis. Journal of Instruction Level Parallelism 2005.Google ScholarGoogle Scholar
  20. H.-H.S. Lee and C.S. Ballapuram. Energy efficient D-TLB and Data Cache using Semantic-aware Multilateral Partitioning. In Proc. of Int'l Symp. on Low-Power Electronics and Design, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. H.-H.S. Lee, M. Smelyanskiy, G.S. Tyson, and C.J. Newburn. Stack Value File: Custom Microarchitecture for the Stack. In Proc. of Int'l Conf. on High Performance Cmputer Architecture, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H.-H.S. Lee and G.S. Tyson. Region-Based Caching: An Energy-Delay Efficient Memory Architecture for Embedded Processors. In Proc. of Int'l Conf. on Compilers, Architecture and Synthesis for Embedded Systems, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Marr, S. Thakkar, and R. Zucker. Multiprocessor validation of the Pentium Pro microprocessor. COMPCON 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Mendelson, J. Mandelblat, S. Gochman, A. Shemer, R. Chabukswar, E. Niemeyer, and A. Kumar. CMP Implementation in Systems Based on the Core Duo. Intel Technology Journal, 10(2), 2006.Google ScholarGoogle ScholarCross RefCross Ref
  25. A. Moshovos. RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence. In ISCA 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Moshovos, G. Memik, B. Falsafi, and A.N. Choudhary. JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers. In Proc. of Int'l Conf. on High Performance Cmputer Architecture, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Novillo. OpenMP and Automatic Parallelization in GCC. In GCC developers summit, 2006.Google ScholarGoogle Scholar
  28. G. Ravindran and M. Stumm. A performance comparison of hierarchical ring- and mesh- connected multiprocessor networks. In Proc. of Int'l Conf. on High Performance Cmputer Architecture, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Roth. Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization. In Proc. of Int'l Symp. on Computer Architecture, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. Saldanha and M. Lipasti. Power efficient cache coherence. Workshop on Memory Performance Issuses in conjunction with ISCA 2001.Google ScholarGoogle Scholar
  31. S. Sethumadhavan, F. Roesner, J.S. Emer, D. Burger, and S.W. Keckler. Late-binding: Enabling Unordered Load-store Queues. In Proc. of Int'l Symp. on Computer Architecture, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. W. Shi, H.-H. S. Lee, L. Falk, and M. Ghosh. An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors. In Proc. of Int'l Symp. on Computer Architecture, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M.S. Squillante and E.D. Lazowska. Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling. IEEE Transactions on Parallel and Distributed Systems, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slipstream Processors: Improving Both Performance and Fault-Tolerance. In Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. X. Tian, A. Bik, M. Girkar, P. Grey, H. Saito, and E. Su. Intel OpenMP C++/Fortran Compiler for Hyper-Threading Technology. Intel technology Journal, 3(1), 2002.Google ScholarGoogle Scholar

Index Terms

  1. Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!