ABSTRACT
Integrating more processor cores on-die has become the unanimous trend in the microprocessor industry. Most of the current research thrusts using chip multiprocessors (CMPs) as the baseline to analyze problems in various domains. One of the main design issues facing CMP systems is the growing number of snoops required to maintain cache coherency and to support self/cross-modifying code that leads to power and performance limitations. In this paper, we analyze the internal and external snoop behavior in a CMP system and relax the snoopy cache coherence protocol based on the program semantics and properties of the shared variables for saving power. Based on the observations and analyses, we propose two novel techniques: Selective Snoop Probe (SSP) and Essential Snoop Probe (ESP) to reduce power without compromising performance. Our simulation results show that using the SSPtechnique, 5% to 65% data cache energy savings per core for different processor configurations can be achieved with 1% to 2% performance improvement. We also show that 5% to 82% of data cache energy per core is spent on the non-essential snoop probes that can be saved using the ESP technique.
Supplemental Material
Available for Download
Supplemental material for Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors
- CACTI 4.2. In http://quid.hpl.hp.com:9081/cacti.Google Scholar
- ELF handling for Thread Local Storage. In people.redhat.com/drepper/tls.pdf.Google Scholar
- IDF 2006: Terascale Processing Brings 80 Cores to your Desktop. In http://www.pcper.com/article.php?aid=30&type=expert&pid=3.Google Scholar
- Microprocessor cache-coherency snooping. In http://www.warthman.com/ex-inqr.htm.Google Scholar
- Performance guidelines for AMD Athlon 64 and AMD Opteron. In www.amd.com/us-en/assets/content-type/white-papers-and-techdocs/40555.pdf.Google Scholar
- Power and Thermal Management in the Intel Core Duo. In www.intel.com/technology/itj/2006/volume10issue02/art03-Powerand-Thermal-Management/p03-power.htm.Google Scholar
- J.-L. Baer and W.-H.Wang. On the inclusion properties for multi-level cache hierarchies. In Proc. of Int'l Symp. on Computer Architecture, 1988. Google Scholar
Digital Library
- J. Balfour and W.J. Dally. Design tradeoffs for tiled cmp on-chip networks. In Proc. of Int'l Conf. on Supercomputing, 2006. Google Scholar
Digital Library
- L.A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A Scalable Architecture Based on Single Chip Multiprocessing. In Proc. of Int'l Symp. on Computer Architecture, 2000. Google Scholar
Digital Library
- B.H. Bloom. Space/time Trade-offs in Hash Coding with Allowable Errors. Communication of the ACM 1970. Google Scholar
Digital Library
- D. Boggs, A. Baktha, J. Hawkins, D.T. Marr, J.A. Miller, P. Roussel, R. Singhal, B. Toll, and K.S. Venkatraman. Performance analysis and validation of the Intel Pentium 4 processor on 90nm Technology. Intel Technology Journal, 8(1), 2004.Google Scholar
- J.F. Cantin, M.H. Lipasti, and J.E. Smith. Improving multiprocessor performance with coarse-grain coherence tracking. In Proc. of Int'l Symp. on Computer Architecture, 2005. Google Scholar
Digital Library
- E. Dahlen, J. Gustin, S. Meredith, and D. Moran. The 82460GX Sever/Workstation Chip Set. IEEE Micro, 2000. Google Scholar
Digital Library
- A. Dash and P. Petrov. Energy-efficient cache coherence for embedded multi-processor systems through application-driven snoop filtering. In EUROMICRO DSD 2006. Google Scholar
Digital Library
- M. Ekman, F. Dahlgren, and P. Stenstrom. Evaluation of Snoop Energy-Reduction techniques for Chip-Multiprocessors. Workshop on Duplicating, Deconstructing and Debunking in conjunction with ISCA 2002.Google Scholar
- M. Ekman, P. Stenstrom, and F. Dahlgren. TLB and Snoop Energyreduction using Virtual Caches in Low-power Chip Multiprocessors. In ISLPED 2002. Google Scholar
Digital Library
- M. Ghosh, E. Ozer, S. Biles, and H.-H.S. Lee. Efficient System-on-Chip Energy Measurement with a Segmented Bloom Filter. In ARCS 2006. Google Scholar
Digital Library
- M. Gomaa, C. Scarbrough, T.N. Vijaykumar, and I. Pomeranz. Transient-Fault Recovery for Chip Multiprocessors. In Proc. of Int'l Symp. on Computer Architecture, 2003. Google Scholar
Digital Library
- G. Hamerly, E. Perelman, J. Lau, and B. Calder. SimPoint 3.0: faster and more flexible program analysis. Journal of Instruction Level Parallelism 2005.Google Scholar
- H.-H.S. Lee and C.S. Ballapuram. Energy efficient D-TLB and Data Cache using Semantic-aware Multilateral Partitioning. In Proc. of Int'l Symp. on Low-Power Electronics and Design, 2003. Google Scholar
Digital Library
- H.-H.S. Lee, M. Smelyanskiy, G.S. Tyson, and C.J. Newburn. Stack Value File: Custom Microarchitecture for the Stack. In Proc. of Int'l Conf. on High Performance Cmputer Architecture, 2001. Google Scholar
Digital Library
- H.-H.S. Lee and G.S. Tyson. Region-Based Caching: An Energy-Delay Efficient Memory Architecture for Embedded Processors. In Proc. of Int'l Conf. on Compilers, Architecture and Synthesis for Embedded Systems, 2000. Google Scholar
Digital Library
- D. Marr, S. Thakkar, and R. Zucker. Multiprocessor validation of the Pentium Pro microprocessor. COMPCON 1996. Google Scholar
Digital Library
- A. Mendelson, J. Mandelblat, S. Gochman, A. Shemer, R. Chabukswar, E. Niemeyer, and A. Kumar. CMP Implementation in Systems Based on the Core Duo. Intel Technology Journal, 10(2), 2006.Google Scholar
Cross Ref
- A. Moshovos. RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence. In ISCA 2005. Google Scholar
Digital Library
- A. Moshovos, G. Memik, B. Falsafi, and A.N. Choudhary. JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers. In Proc. of Int'l Conf. on High Performance Cmputer Architecture, 2001. Google Scholar
Digital Library
- D. Novillo. OpenMP and Automatic Parallelization in GCC. In GCC developers summit, 2006.Google Scholar
- G. Ravindran and M. Stumm. A performance comparison of hierarchical ring- and mesh- connected multiprocessor networks. In Proc. of Int'l Conf. on High Performance Cmputer Architecture, 1997. Google Scholar
Digital Library
- A. Roth. Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization. In Proc. of Int'l Symp. on Computer Architecture, 2005. Google Scholar
Digital Library
- C. Saldanha and M. Lipasti. Power efficient cache coherence. Workshop on Memory Performance Issuses in conjunction with ISCA 2001.Google Scholar
- S. Sethumadhavan, F. Roesner, J.S. Emer, D. Burger, and S.W. Keckler. Late-binding: Enabling Unordered Load-store Queues. In Proc. of Int'l Symp. on Computer Architecture, 2007. Google Scholar
Digital Library
- W. Shi, H.-H. S. Lee, L. Falk, and M. Ghosh. An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors. In Proc. of Int'l Symp. on Computer Architecture, 2006. Google Scholar
Digital Library
- M.S. Squillante and E.D. Lazowska. Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling. IEEE Transactions on Parallel and Distributed Systems, 1993. Google Scholar
Digital Library
- K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slipstream Processors: Improving Both Performance and Fault-Tolerance. In Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, 2000. Google Scholar
Digital Library
- X. Tian, A. Bik, M. Girkar, P. Grey, H. Saito, and E. Su. Intel OpenMP C++/Fortran Compiler for Hyper-Threading Technology. Intel technology Journal, 3(1), 2002.Google Scholar
Index Terms
Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors
Recommendations
Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors
ASPLOS '08Integrating more processor cores on-die has become the unanimous trend in the microprocessor industry. Most of the current research thrusts using chip multiprocessors (CMPs) as the baseline to analyze problems in various domains. One of the main design ...
Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors
ASPLOS '08Integrating more processor cores on-die has become the unanimous trend in the microprocessor industry. Most of the current research thrusts using chip multiprocessors (CMPs) as the baseline to analyze problems in various domains. One of the main design ...
Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors
ASPLOS '08Integrating more processor cores on-die has become the unanimous trend in the microprocessor industry. Most of the current research thrusts using chip multiprocessors (CMPs) as the baseline to analyze problems in various domains. One of the main design ...









Comments