Abstract
Improvements in semiconductor nanotechnology made chip multiprocessors the reference architecture for high-performance microprocessors. CMPs usually adopt large Last-Level Caches (LLC) shared among cores and private L1 caches, whose performances depend on the wire-delay dominated response time of LLC. NUCA (NonUniform Cache Architecture) caches represent a viable solution for tolerating wire-delay effects. In this article, we present Re-NUCA, a NUCA cache that exploits replication of blocks inside the LLC to avoid performance limitations of D-NUCA caches due to conflicting access to shared data. Results show that a Re-NUCA LLC permits to improve performances of more than 5% on average, and up to 15% for applications that strongly suffer from conflicting access to shared data, while reducing network traffic and power consumption with respect to D-NUCA caches. Besides, it outperforms different S-NUCA schemes optimized with victim replication.
- M. Annoni, A. Bardine, S. Campanelli, P. Foglia, and C. A. Prete. 2012. A real-time configurable nurbs interpolator with bounded acceleration, jerk and chord error. Comput.-Aided Des. 44, 6, 509--521. Google Scholar
Digital Library
- M. Awasthi, K. Sudan R. Balasubramonian, and J. B. Carter. 2009. Dynamic hardware-assisted software controlled page placement to manage capacity allocation and sharing with larger caches. In Proceedings of the 15th International Symposium on High Performance Computer Architecture (HPCA'09).Google Scholar
- A. Bardine, P. Foglia, G. Gabrielli, and C. A. Prete. 2007. Analysis of static and dynamic energy consumption in nuca caches. Initial results. In Proceedings of the Workshop on Memory Performance: Dealing with Applications, Systems and Architecture (MEDEA'07). ACM Press, New York, 105--112. Google Scholar
Digital Library
- A. Bardine, M. Comparetti, P. Foglia, G. Gabrielli, and C. A. Prete. 2008a. Impact of on-chip network parameters on nuca cache performances. Comput. Digital Techniques 3, 5, 501--512.Google Scholar
Cross Ref
- A. Bardine, M. Comparetti, P. Foglia, G. Gabrielli, and C. A. Prete. 2008b. Leveraging data promotion for low power d-nuca caches. In Proceedings of the 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools (DSD'08). 307--316. Google Scholar
Digital Library
- A. Bardine, M. Comparetti, P. Foglia, G. Gabrielli, and C. A. Prete. 2009. A power-efficient migration mechanism for d-nuca caches. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'09). European Design and Automation Association, 598--601. Google Scholar
Digital Library
- A. Bardine, M. Comparetti, P. Foglia, G. Gabrielli, and C. A. Prete. 2010. Way-adaptable d-nuca caches. Int. J. High Perf. Syst. Archit. 2, 3/4, 215--228. Google Scholar
Digital Library
- A. Bardine, M. Comparetti, P. Foglia, and C. A. Prete. 2014. Evaluation of leakage reduction alternatives for deep sub-micron dynamic non uniform cache architecture caches. IEEE Trans. VLSI. 22, 1, 185--190. Google Scholar
Digital Library
- S. Bartolini, P. Foglia, M. Solinas, and C. A. Prete. 2010. Feedback-driven restructuring of multi-threaded applications for nuca cache performance in cmps. In Proceedings of the 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'10). IEEE Computer Society, 87--94. Google Scholar
Digital Library
- B. M. Beckmann and D. A. Wood. 2004. Managing wire delay in large chip-multiprocessor caches. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'04). IEEE Computer Society, 319--330. Google Scholar
Digital Library
- B. M. Beckmann, M. R. Marty, and D. A. Wood. 2006. ASR: Adaptive selective replication for cmp caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 443--454. Google Scholar
Digital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. 2008. The parsec benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT'08). ACM Press, New York, 72--81. Google Scholar
Digital Library
- J. Chang and G. H. Sohi. 2006. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA'06). IEEE Computer Society, 264--276. Google Scholar
Digital Library
- M. Chaudhuri. 2009. PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches. In Proceedings of the 15th IEEE International on High Performance Computer Architecture (HPCA'09).Google Scholar
Cross Ref
- A. Chishti, M. D. Powell, and T. N. Vijaykumar. 2003. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'03). Google Scholar
Digital Library
- Z. Chishti, M. D. Powell, and T. N. Vijaykumar. 2005. Optimizing replication, communication, and capacity allocation in cmps. ACM SIGARCH Comput. Archit. News 33, 2, 357--368. Google Scholar
Digital Library
- S. Cho and L. Jin. 2006. Managing distributed, shared l2 caches through os-level page allocation. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06). IEEE Computer Society, 455--468. Google Scholar
Digital Library
- M. Comparetti and P. Foglia. 2013. A workload independent energy reduction strategy for d-nuca caches. J. Supercomput. To appear.Google Scholar
- J. Duato, S. Yalamanchili, and N. Lionel. 2003. Interconnection Networks: An Engineering Approach. Morgan Kaufmann Publishers, San Francisco, CA.Google Scholar
Digital Library
- P. Foglia, D. Mangano, and C. A. Prete. 2005. A nuca model for embedded systems cache design. In Proceedings of the 3rd Workshop on Embedded Systems for Real-Time Multimedia. 41--46.Google Scholar
- P. Foglia, F. Panicucci, C. A. Prete, and M. Solinas. 2009a. Analysis of performance dependencies in nuca-based cmp systems. In Proceedings of the 21st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'09). IEEE Computer Society, 49--56. Google Scholar
Digital Library
- P. Foglia, F. Panicucci, C. A. Prete, and M. Solinas. 2009b. An evaluation of behaviors of s-nuca cmps running scientific workload. In Proceedings of the 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools (DSD'09). IEEE Computer Society, 26--33. Google Scholar
Digital Library
- P. Foglia, G. Monni, C. A. Prete, and M. Solinas. 2010. Re-Nuca: Boosting cmp performances through block replication. In Proceedings of the 13th EUROMICRO Conference on Digital System Design, Architectures, Methods and Tools (DSD'10). IEEE Computer Society, 199--206. Google Scholar
Digital Library
- GEMS. 2008. Winsconsin multifacet gems simulator. http://www.cs.wisc.edu/gems/.Google Scholar
- K. Gharachorloo, M. Sharma, S. Steely, and S. Van Doren. 2000. Architecture and design of alphaserver gs320. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'00). ACM Press, New York, 13--24. Google Scholar
Digital Library
- D. Greenhill and J. Alabado. 2005. Power savings in the ultrasparc t1 processor. Sun Microsystem Whitepaper.Google Scholar
- Z. Guz, I. Keidar, A. Kolodny, and U. C. weiser. 2008. Utilizing shared data in chip multiprocessors with the nahalal architecture. In Proceedings of the 20th Annual Symposium on Parallelism in Algorithms and Architectures (SPAA'08). ACM Press, New York, 1--10. Google Scholar
Digital Library
- M. Hammoud, S. Cho, and R. Melhem. 2008. ACM: An efficient approach for managing shared caches in chip multiprocessors. In Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers (HiPEAC'09). Springer, 355--372. Google Scholar
Digital Library
- N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. 2009. Reactive nuca: Near-optimal block placement and replication in distributed caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA'09). ACM Press, New York, 184--195. Google Scholar
Digital Library
- R. Ho, K. W. Mai, and M. A. Horowitz. 2001. The future of wires. Proc. IEEE 89, 4, 490--504.Google Scholar
Cross Ref
- J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. 2005. A nuca substrate for flexible cmp cache sharing. In Proceedings of the 19th Annual International Conference on Supercomputing (ICS'05). ACM Press, New York, 31--40. Google Scholar
Digital Library
- A. B. Kahng, B. Li, L. Peh, and K. Samadi. 2009. ORION 2.0: A fast and accurate noc power and area model for early-stage design space exploration. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'09). European Design and Automation Association, 423--428. Google Scholar
Digital Library
- C. Kim, D. Burger, and S. W. Keckler. 2002. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'02). ACM Press, New York, 211--222. Google Scholar
Digital Library
- P. Kongetira, K. Aingaran, and K. Olukotun. 2005. Niagara: A 32-way multithreaded sparc processor. IEEE Micro. 25, 2, 21--29. Google Scholar
Digital Library
- K. Krewell. 1997. UltraSparc iv mirrors predecessors. Microprocessor rep., 1--3.Google Scholar
- F. LI, F. Kandemir, and M. J. Irwin. 2008. Implementation and evaluation of a migration-based nuca design for chip multiprocessors. SIGMETRICS Perform. Eval. Rev. 36, 1, 449--450. Google Scholar
Digital Library
- C. McNairy and R. Bhatia. 2005. Montecito: A dual-core, dual-thread titanium processor. IEEE Micro 25, 2, 10--20. Google Scholar
Digital Library
- A. Mendelson, J. Mandelblat, S. Gochman, A. Shemer, R. Chabukswar, E. Niemeyer, and A. Kumar. 2006. CMP implementation in systems based on the intel core duo processor. Intel Technol. J. 10, 2, 99--107.Google Scholar
Cross Ref
- J. Merino, V. Puente, and J. A. Gregorio. 2010. ESP-NUCA: A low-cost adaptive non-uniform cache architecture. In Proceedings of the 16th IEEE International Symposium on High-Performance Computer Architecture. 1--10.Google Scholar
- MICRON. 2010. 1 gb ddr2 sdram module datasheet. http://www.micron.com.Google Scholar
- N. Muralimanohar, R. Balasubramonian, and N. Jouppi. 2007. Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'07). IEEE Computer Society, 3--14. Google Scholar
Digital Library
- K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang. 1996. The case for a single-chip multiprocessor. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'96). ACM Press, New York, 2--11. Google Scholar
Digital Library
- PTM. 2007. Predictive technology model (ptm). http://www.eas.asu.edu/∼ptm/.Google Scholar
- B. Sinharoy, R. Kalla, J. Tendler, R. Eickemeyer, and J. Joyner. 2005. Power5 system architecture. IBM J. Res. Devel. 49, 4, 505--522. Google Scholar
Digital Library
- VIRTUTEC. 2010. Virtutec simics. http://www.virtutech.com.Google Scholar
- N. Weste and D. Harris. 2010. CMOS VLSI Design: A Circuits and Systems Perspective 4th Ed. Addison-Wesley Publishing. Google Scholar
Digital Library
- S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. 1995. The splash-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA'95). ACM Press, New York, 24--36. Google Scholar
Digital Library
- M. Zhang and K. Asanovic. 2005. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA'05). Google Scholar
Digital Library
Index Terms
Exploiting replication to improve performances of NUCA-based CMP systems
Recommendations
Reactive NUCA: near-optimal block placement and replication in distributed caches
Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureIncreases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Re-NUCA: Boosting CMP Performance Through Block Replication
DSD '10: Proceedings of the 2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and ToolsChip Multiprocessor (CMP) systems have become the reference architecture for designing micro-processors, thanks to the improvements in semiconductor nanotechnology that have continuously provided a crescent number of faster and smaller per-chip ...






Comments