skip to main content
research-article

Exploiting replication to improve performances of NUCA-based CMP systems

Published:28 March 2014Publication History
Skip Abstract Section

Abstract

Improvements in semiconductor nanotechnology made chip multiprocessors the reference architecture for high-performance microprocessors. CMPs usually adopt large Last-Level Caches (LLC) shared among cores and private L1 caches, whose performances depend on the wire-delay dominated response time of LLC. NUCA (NonUniform Cache Architecture) caches represent a viable solution for tolerating wire-delay effects. In this article, we present Re-NUCA, a NUCA cache that exploits replication of blocks inside the LLC to avoid performance limitations of D-NUCA caches due to conflicting access to shared data. Results show that a Re-NUCA LLC permits to improve performances of more than 5% on average, and up to 15% for applications that strongly suffer from conflicting access to shared data, while reducing network traffic and power consumption with respect to D-NUCA caches. Besides, it outperforms different S-NUCA schemes optimized with victim replication.

References

  1. M. Annoni, A. Bardine, S. Campanelli, P. Foglia, and C. A. Prete. 2012. A real-time configurable nurbs interpolator with bounded acceleration, jerk and chord error. Comput.-Aided Des. 44, 6, 509--521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Awasthi, K. Sudan R. Balasubramonian, and J. B. Carter. 2009. Dynamic hardware-assisted software controlled page placement to manage capacity allocation and sharing with larger caches. In Proceedings of the 15th International Symposium on High Performance Computer Architecture (HPCA'09).Google ScholarGoogle Scholar
  3. A. Bardine, P. Foglia, G. Gabrielli, and C. A. Prete. 2007. Analysis of static and dynamic energy consumption in nuca caches. Initial results. In Proceedings of the Workshop on Memory Performance: Dealing with Applications, Systems and Architecture (MEDEA'07). ACM Press, New York, 105--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Bardine, M. Comparetti, P. Foglia, G. Gabrielli, and C. A. Prete. 2008a. Impact of on-chip network parameters on nuca cache performances. Comput. Digital Techniques 3, 5, 501--512.Google ScholarGoogle ScholarCross RefCross Ref
  5. A. Bardine, M. Comparetti, P. Foglia, G. Gabrielli, and C. A. Prete. 2008b. Leveraging data promotion for low power d-nuca caches. In Proceedings of the 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools (DSD'08). 307--316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Bardine, M. Comparetti, P. Foglia, G. Gabrielli, and C. A. Prete. 2009. A power-efficient migration mechanism for d-nuca caches. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'09). European Design and Automation Association, 598--601. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Bardine, M. Comparetti, P. Foglia, G. Gabrielli, and C. A. Prete. 2010. Way-adaptable d-nuca caches. Int. J. High Perf. Syst. Archit. 2, 3/4, 215--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Bardine, M. Comparetti, P. Foglia, and C. A. Prete. 2014. Evaluation of leakage reduction alternatives for deep sub-micron dynamic non uniform cache architecture caches. IEEE Trans. VLSI. 22, 1, 185--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Bartolini, P. Foglia, M. Solinas, and C. A. Prete. 2010. Feedback-driven restructuring of multi-threaded applications for nuca cache performance in cmps. In Proceedings of the 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'10). IEEE Computer Society, 87--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. M. Beckmann and D. A. Wood. 2004. Managing wire delay in large chip-multiprocessor caches. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'04). IEEE Computer Society, 319--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. M. Beckmann, M. R. Marty, and D. A. Wood. 2006. ASR: Adaptive selective replication for cmp caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 443--454. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Bienia, S. Kumar, J. P. Singh, and K. Li. 2008. The parsec benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT'08). ACM Press, New York, 72--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Chang and G. H. Sohi. 2006. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA'06). IEEE Computer Society, 264--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Chaudhuri. 2009. PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches. In Proceedings of the 15th IEEE International on High Performance Computer Architecture (HPCA'09).Google ScholarGoogle ScholarCross RefCross Ref
  15. A. Chishti, M. D. Powell, and T. N. Vijaykumar. 2003. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Z. Chishti, M. D. Powell, and T. N. Vijaykumar. 2005. Optimizing replication, communication, and capacity allocation in cmps. ACM SIGARCH Comput. Archit. News 33, 2, 357--368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Cho and L. Jin. 2006. Managing distributed, shared l2 caches through os-level page allocation. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06). IEEE Computer Society, 455--468. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Comparetti and P. Foglia. 2013. A workload independent energy reduction strategy for d-nuca caches. J. Supercomput. To appear.Google ScholarGoogle Scholar
  19. J. Duato, S. Yalamanchili, and N. Lionel. 2003. Interconnection Networks: An Engineering Approach. Morgan Kaufmann Publishers, San Francisco, CA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Foglia, D. Mangano, and C. A. Prete. 2005. A nuca model for embedded systems cache design. In Proceedings of the 3rd Workshop on Embedded Systems for Real-Time Multimedia. 41--46.Google ScholarGoogle Scholar
  21. P. Foglia, F. Panicucci, C. A. Prete, and M. Solinas. 2009a. Analysis of performance dependencies in nuca-based cmp systems. In Proceedings of the 21st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'09). IEEE Computer Society, 49--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Foglia, F. Panicucci, C. A. Prete, and M. Solinas. 2009b. An evaluation of behaviors of s-nuca cmps running scientific workload. In Proceedings of the 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools (DSD'09). IEEE Computer Society, 26--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Foglia, G. Monni, C. A. Prete, and M. Solinas. 2010. Re-Nuca: Boosting cmp performances through block replication. In Proceedings of the 13th EUROMICRO Conference on Digital System Design, Architectures, Methods and Tools (DSD'10). IEEE Computer Society, 199--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. GEMS. 2008. Winsconsin multifacet gems simulator. http://www.cs.wisc.edu/gems/.Google ScholarGoogle Scholar
  25. K. Gharachorloo, M. Sharma, S. Steely, and S. Van Doren. 2000. Architecture and design of alphaserver gs320. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'00). ACM Press, New York, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Greenhill and J. Alabado. 2005. Power savings in the ultrasparc t1 processor. Sun Microsystem Whitepaper.Google ScholarGoogle Scholar
  27. Z. Guz, I. Keidar, A. Kolodny, and U. C. weiser. 2008. Utilizing shared data in chip multiprocessors with the nahalal architecture. In Proceedings of the 20th Annual Symposium on Parallelism in Algorithms and Architectures (SPAA'08). ACM Press, New York, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Hammoud, S. Cho, and R. Melhem. 2008. ACM: An efficient approach for managing shared caches in chip multiprocessors. In Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers (HiPEAC'09). Springer, 355--372. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. 2009. Reactive nuca: Near-optimal block placement and replication in distributed caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA'09). ACM Press, New York, 184--195. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. Ho, K. W. Mai, and M. A. Horowitz. 2001. The future of wires. Proc. IEEE 89, 4, 490--504.Google ScholarGoogle ScholarCross RefCross Ref
  31. J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. 2005. A nuca substrate for flexible cmp cache sharing. In Proceedings of the 19th Annual International Conference on Supercomputing (ICS'05). ACM Press, New York, 31--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. B. Kahng, B. Li, L. Peh, and K. Samadi. 2009. ORION 2.0: A fast and accurate noc power and area model for early-stage design space exploration. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'09). European Design and Automation Association, 423--428. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. C. Kim, D. Burger, and S. W. Keckler. 2002. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'02). ACM Press, New York, 211--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. P. Kongetira, K. Aingaran, and K. Olukotun. 2005. Niagara: A 32-way multithreaded sparc processor. IEEE Micro. 25, 2, 21--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. K. Krewell. 1997. UltraSparc iv mirrors predecessors. Microprocessor rep., 1--3.Google ScholarGoogle Scholar
  36. F. LI, F. Kandemir, and M. J. Irwin. 2008. Implementation and evaluation of a migration-based nuca design for chip multiprocessors. SIGMETRICS Perform. Eval. Rev. 36, 1, 449--450. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. McNairy and R. Bhatia. 2005. Montecito: A dual-core, dual-thread titanium processor. IEEE Micro 25, 2, 10--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Mendelson, J. Mandelblat, S. Gochman, A. Shemer, R. Chabukswar, E. Niemeyer, and A. Kumar. 2006. CMP implementation in systems based on the intel core duo processor. Intel Technol. J. 10, 2, 99--107.Google ScholarGoogle ScholarCross RefCross Ref
  39. J. Merino, V. Puente, and J. A. Gregorio. 2010. ESP-NUCA: A low-cost adaptive non-uniform cache architecture. In Proceedings of the 16th IEEE International Symposium on High-Performance Computer Architecture. 1--10.Google ScholarGoogle Scholar
  40. MICRON. 2010. 1 gb ddr2 sdram module datasheet. http://www.micron.com.Google ScholarGoogle Scholar
  41. N. Muralimanohar, R. Balasubramonian, and N. Jouppi. 2007. Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'07). IEEE Computer Society, 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang. 1996. The case for a single-chip multiprocessor. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'96). ACM Press, New York, 2--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. PTM. 2007. Predictive technology model (ptm). http://www.eas.asu.edu/∼ptm/.Google ScholarGoogle Scholar
  44. B. Sinharoy, R. Kalla, J. Tendler, R. Eickemeyer, and J. Joyner. 2005. Power5 system architecture. IBM J. Res. Devel. 49, 4, 505--522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. VIRTUTEC. 2010. Virtutec simics. http://www.virtutech.com.Google ScholarGoogle Scholar
  46. N. Weste and D. Harris. 2010. CMOS VLSI Design: A Circuits and Systems Perspective 4th Ed. Addison-Wesley Publishing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. 1995. The splash-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA'95). ACM Press, New York, 24--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. M. Zhang and K. Asanovic. 2005. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA'05). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Exploiting replication to improve performances of NUCA-based CMP systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!