skip to main content
research-article

On-chip traffic regulation to reduce coherence protocol cost on a microthreaded many-core architecture with distributed caches

Authors Info & Claims
Published:28 March 2014Publication History
Skip Abstract Section

Abstract

When hardware cache coherence scales to many cores on chip, over saturated traffic of the shared memory system may offset the benefit from massive hardware concurrency. In this article, we investigate the cost of a write-update protocol in terms of on-chip memory network traffic and its adverse effects on the system performance based on a multithreaded many-core architecture with distributed caches. We discuss possible software and hardware solutions to alleviate the network pressure. We find that in the context of massive concurrency, by introducing a write-merging buffer with 0.46% area overhead to each core, applications with good locality and concurrency are boosted up by 18.74% in performance on average. Other applications also benefit from this addition and even achieve a throughput increase of 5.93%. In addition, this improvement indicates that higher levels of concurrency per core can be exploited without impacting performance, thus tolerating latency better and giving higher processor efficiencies compared to other solutions.

References

  1. D. Agarwal and D. Yeung. 2003. Exploiting application-level information to reduce memory bandwidth consumption. In Proceedings of the 4th Workshop on Complexity-Effective Design, held in conjunction with the 30th International Symposium on Computer Architecture.Google ScholarGoogle Scholar
  2. A. Bakhoda, J. Kim, and T. M. Aamodt. 2010. Throughput-effective on-chip networks for many-core accelerators. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 421--432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. Bernard, K. Bousias, L. Guang, C. Jesshope, M. Lankamp, M. Van Tol, and L. Zhang. 2008. A general model of concurrency and its implementation as many-core dynamic risc processors. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS'08). 1--9.Google ScholarGoogle Scholar
  4. R. Bianchini, T. J. Leblanc, and J. Veenstra. 1994. Eliminating useless messages in write-update protocols on scalable multiprocessors. Tech. rep., University of Rochester. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Bousias, N. Hasasneh, and C. Jesshope. 2006. Instruction level parallelism through microthreading. A scalable approach to chip multiprocessors. Comput. J. 49, 2, 211--233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Burger, J. R. Goodman, and A. Kägi. 1996. Memory bandwidth limitations of future microprocessors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA'96). ACM, 78--89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Danek, L. Kafka, L. Kohout, J. Sykora, and R. Bartosinsk. 2011. UTLEON3: Exploring Fine-Grain Multi-Threading in FPGAs. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Das, O. Mutlu, T. Moscibroda, and C. R. Das. 2009. Application-aware prioritization mechanisms for on-chip networks. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 280--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Ding and K. Kennedy. 2000. The memory of bandwidth bottleneck and its amelioration by a compiler. In Proceedings of the 14th International Parallel and Distributed Processing Symposium. 181--189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Ferrante, S. Medardoni, and D. Bertozzi. 2008. Network interface sharing techniques for area optimized noc architectures. In Proceedings of the 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools (DSD'08). 10--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Glasco, B. Delagi, and M. Flynn. 1994. Update-based cache coherence protocols for scalable shared-memory multiprocessors. In Proceedings of the 27th Hawaii International Conference on System Sciences, Vol. 1. 534--545.Google ScholarGoogle Scholar
  12. P. Gratz, B. Grot, and S. W. Keckler. 2008. Regional congestion awareness for load balance in networks-on-chip. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA'08). 203--214.Google ScholarGoogle Scholar
  13. B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu. 2011. Kilo-noc: a heterogeneous network-on-chip architecture for scalability and service guarantees. SIGARCH Comput. Archit. News 39, 3, 401--412. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Gupta, S. W. Keckler, and D. Burger. 2000. Technology independent area and delay estimates for microprocessor building blocks. Tech. rep., Department of Computer Sciences, The University of Texas at Austin. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Howard, S. Dighe, Y. Hoskote, et al. 2010. A 48-core ia-32 message-passing processor with dvfs in 45nm cmos. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC'10). 108--109.Google ScholarGoogle ScholarCross RefCross Ref
  16. J. H. Kelm, M. R. Johnson, S. S. Lumettta, and S. J. Patel. 2010. Waypoint: Scaling coherence to thousand-core architectures. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. ACM, 99--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Kim, J. Balfour, and W. Dally. 2007. Flattened butterfly topology for on-chip networks. Computer Archit. Lett. 6, 2, 37--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Kim and J. Lee. 2010. Write buffer-oriented energy reduction in the l1 data cache of two-level caches for the embedded system. In Proceedings of the 20th Great lakes Symposium on VLSI (GLSVLSI'10). ACM, 257--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Kondo, H. Okawara, H. Nakamura, and T. Boku. 2000. Scima: Software controlled integrated memory architecture for high performance computing. In Proceedings of the International Conference on Computer Design. 105--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Lankamp, R. Poss, Q. Yang, J. Fu, I. Uddin, and C. R. Jesshope. 2013. MGSim: Simulation tools for multi-core processor architectures. Tech. Rep. arXiv:1302.1390v1 {cs.AR}, University of Amsterdam.Google ScholarGoogle Scholar
  21. J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, M. Horowitz, and C. Kozyrakis. 2007. Comparing memory systems for chip multiprocessors. SIGARCH Comput. Archit. News 35, 2, 358--368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Z. Majo and T. R. Gross. 2011. Memory system performance in a numa multicore multiprocessor. In Proceedings of the 4th Annual International Conference on Systems and Storage (SYSTOR'11). ACM, 12:1--12:10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. M. Martin, M. D. Hill, and D. J. Sorin. 2012. Why on-chip cache coherence is here to stay. Commun. ACM 55, 7, 78--89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Molina, A. Gonlaze, and J. Tubella. 1999. Reducing memory traffic via redundant store instructions. In Proceedings of the International Conference on High Performance Computing and Networking. 1246--1249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. F. Mounes-Toussi and D. Lilja. 1995. Write buffer design for cache-coherent shared-memory multiprocessors. In Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'95). 506--511. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. O. Mutlu. 2011. Memory systems in the many-core era: challenges, opportunities, and solution directions. SIGPLAN Not. 46, 11, 77--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. P. Nychis, C. Fallin, T. Moscibroda, O. Mutlu, and S. Seshan. 2012. On-chip networks from a networking perspective: congestion and scalability in many-core interconnects. In Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM'12). ACM, 407--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Poss. 2012. SL: A “quick and dirty” but working intermediate language for SVP systems. Tech. Rep. arXiv:1208.4572v1 {cs.PL}, University of Amsterdam.Google ScholarGoogle Scholar
  29. R. Poss, M. Lankamp, Q. Yang, J. Fu, M. W. Van Tol, and C. Jesshope. 2012. Apple-CORE: Microgrids of SVP cores. In Proceedings of the 15th Euromicro Conference on Digital System Design (DSD'12). S. Niar, Ed., IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. Poss, M. Lankamp, Q. Yang, J. Fu, M. W. Van Tol, I. Uddin, and C. Jesshope. 2013. Apple-CORE: Harnessing general-purpose many-cores with hardware concurrency management. Microprocess. Microsyst. 37, 8, 1090--1101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Sanchez and C. Kozyrakis. 2012. Scd: A scalable coherence directory with flexible sharer set encoding. In Proceedings of the IEEE 18th International Symposium on High Performance Computer Architecture. IEEE, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Secchi, A. Tumeo, and O. Villa. 2012. A bandwidth-optimized multi-core architecture for irregular applications. In Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'12). 580--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. K. Skadron and D. W. Clark. 1997. Design issues and tradeoffs for write buffers. In Proceedings of the 3rd International Symposium on High-Performance Computer Architecture. 144--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. W. Van Tol, R. Bakker, M. Verstraaten, C. Grelck, and C. Jesshope. 2011. Efficient memory copy operations on the 48-core intel scc processor. In Proceedings of the MARC Symposium. 13--18.Google ScholarGoogle Scholar
  35. P. T. Wolkotte, G. J. Smit, and J. E. Becker. 2005. Energy-efficient noc for best-effort communication. In Proceedings of the 15th International Conference on Field Programmable Logic and Applications. IEEE, 197--202.Google ScholarGoogle Scholar
  36. Q. Yang, C. Jesshope, and J. Fu. 2011. A micro threading based concurrency model for parallel computing. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW'11). 1668--1674. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. H. Zhao, A. Shriraman, S. Dwarkadas, and V. Srinivasan. 2011. Spatl: Honey, I shrunk the coherence directory. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'11). IEEE, 33--44. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. On-chip traffic regulation to reduce coherence protocol cost on a microthreaded many-core architecture with distributed caches

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!