Abstract
When hardware cache coherence scales to many cores on chip, over saturated traffic of the shared memory system may offset the benefit from massive hardware concurrency. In this article, we investigate the cost of a write-update protocol in terms of on-chip memory network traffic and its adverse effects on the system performance based on a multithreaded many-core architecture with distributed caches. We discuss possible software and hardware solutions to alleviate the network pressure. We find that in the context of massive concurrency, by introducing a write-merging buffer with 0.46% area overhead to each core, applications with good locality and concurrency are boosted up by 18.74% in performance on average. Other applications also benefit from this addition and even achieve a throughput increase of 5.93%. In addition, this improvement indicates that higher levels of concurrency per core can be exploited without impacting performance, thus tolerating latency better and giving higher processor efficiencies compared to other solutions.
- D. Agarwal and D. Yeung. 2003. Exploiting application-level information to reduce memory bandwidth consumption. In Proceedings of the 4th Workshop on Complexity-Effective Design, held in conjunction with the 30th International Symposium on Computer Architecture.Google Scholar
- A. Bakhoda, J. Kim, and T. M. Aamodt. 2010. Throughput-effective on-chip networks for many-core accelerators. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 421--432. Google Scholar
Digital Library
- T. Bernard, K. Bousias, L. Guang, C. Jesshope, M. Lankamp, M. Van Tol, and L. Zhang. 2008. A general model of concurrency and its implementation as many-core dynamic risc processors. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS'08). 1--9.Google Scholar
- R. Bianchini, T. J. Leblanc, and J. Veenstra. 1994. Eliminating useless messages in write-update protocols on scalable multiprocessors. Tech. rep., University of Rochester. Google Scholar
Digital Library
- K. Bousias, N. Hasasneh, and C. Jesshope. 2006. Instruction level parallelism through microthreading. A scalable approach to chip multiprocessors. Comput. J. 49, 2, 211--233. Google Scholar
Digital Library
- D. Burger, J. R. Goodman, and A. Kägi. 1996. Memory bandwidth limitations of future microprocessors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA'96). ACM, 78--89. Google Scholar
Digital Library
- M. Danek, L. Kafka, L. Kohout, J. Sykora, and R. Bartosinsk. 2011. UTLEON3: Exploring Fine-Grain Multi-Threading in FPGAs. Springer. Google Scholar
Digital Library
- R. Das, O. Mutlu, T. Moscibroda, and C. R. Das. 2009. Application-aware prioritization mechanisms for on-chip networks. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 280--291. Google Scholar
Digital Library
- C. Ding and K. Kennedy. 2000. The memory of bandwidth bottleneck and its amelioration by a compiler. In Proceedings of the 14th International Parallel and Distributed Processing Symposium. 181--189. Google Scholar
Digital Library
- A. Ferrante, S. Medardoni, and D. Bertozzi. 2008. Network interface sharing techniques for area optimized noc architectures. In Proceedings of the 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools (DSD'08). 10--17. Google Scholar
Digital Library
- D. Glasco, B. Delagi, and M. Flynn. 1994. Update-based cache coherence protocols for scalable shared-memory multiprocessors. In Proceedings of the 27th Hawaii International Conference on System Sciences, Vol. 1. 534--545.Google Scholar
- P. Gratz, B. Grot, and S. W. Keckler. 2008. Regional congestion awareness for load balance in networks-on-chip. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA'08). 203--214.Google Scholar
- B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu. 2011. Kilo-noc: a heterogeneous network-on-chip architecture for scalability and service guarantees. SIGARCH Comput. Archit. News 39, 3, 401--412. Google Scholar
Digital Library
- S. Gupta, S. W. Keckler, and D. Burger. 2000. Technology independent area and delay estimates for microprocessor building blocks. Tech. rep., Department of Computer Sciences, The University of Texas at Austin. Google Scholar
Digital Library
- J. Howard, S. Dighe, Y. Hoskote, et al. 2010. A 48-core ia-32 message-passing processor with dvfs in 45nm cmos. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC'10). 108--109.Google Scholar
Cross Ref
- J. H. Kelm, M. R. Johnson, S. S. Lumettta, and S. J. Patel. 2010. Waypoint: Scaling coherence to thousand-core architectures. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. ACM, 99--110. Google Scholar
Digital Library
- J. Kim, J. Balfour, and W. Dally. 2007. Flattened butterfly topology for on-chip networks. Computer Archit. Lett. 6, 2, 37--40. Google Scholar
Digital Library
- S. Kim and J. Lee. 2010. Write buffer-oriented energy reduction in the l1 data cache of two-level caches for the embedded system. In Proceedings of the 20th Great lakes Symposium on VLSI (GLSVLSI'10). ACM, 257--262. Google Scholar
Digital Library
- M. Kondo, H. Okawara, H. Nakamura, and T. Boku. 2000. Scima: Software controlled integrated memory architecture for high performance computing. In Proceedings of the International Conference on Computer Design. 105--111. Google Scholar
Digital Library
- M. Lankamp, R. Poss, Q. Yang, J. Fu, I. Uddin, and C. R. Jesshope. 2013. MGSim: Simulation tools for multi-core processor architectures. Tech. Rep. arXiv:1302.1390v1 {cs.AR}, University of Amsterdam.Google Scholar
- J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, M. Horowitz, and C. Kozyrakis. 2007. Comparing memory systems for chip multiprocessors. SIGARCH Comput. Archit. News 35, 2, 358--368. Google Scholar
Digital Library
- Z. Majo and T. R. Gross. 2011. Memory system performance in a numa multicore multiprocessor. In Proceedings of the 4th Annual International Conference on Systems and Storage (SYSTOR'11). ACM, 12:1--12:10. Google Scholar
Digital Library
- M. M. Martin, M. D. Hill, and D. J. Sorin. 2012. Why on-chip cache coherence is here to stay. Commun. ACM 55, 7, 78--89. Google Scholar
Digital Library
- C. Molina, A. Gonlaze, and J. Tubella. 1999. Reducing memory traffic via redundant store instructions. In Proceedings of the International Conference on High Performance Computing and Networking. 1246--1249. Google Scholar
Digital Library
- F. Mounes-Toussi and D. Lilja. 1995. Write buffer design for cache-coherent shared-memory multiprocessors. In Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'95). 506--511. Google Scholar
Digital Library
- O. Mutlu. 2011. Memory systems in the many-core era: challenges, opportunities, and solution directions. SIGPLAN Not. 46, 11, 77--78. Google Scholar
Digital Library
- G. P. Nychis, C. Fallin, T. Moscibroda, O. Mutlu, and S. Seshan. 2012. On-chip networks from a networking perspective: congestion and scalability in many-core interconnects. In Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM'12). ACM, 407--418. Google Scholar
Digital Library
- R. Poss. 2012. SL: A “quick and dirty” but working intermediate language for SVP systems. Tech. Rep. arXiv:1208.4572v1 {cs.PL}, University of Amsterdam.Google Scholar
- R. Poss, M. Lankamp, Q. Yang, J. Fu, M. W. Van Tol, and C. Jesshope. 2012. Apple-CORE: Microgrids of SVP cores. In Proceedings of the 15th Euromicro Conference on Digital System Design (DSD'12). S. Niar, Ed., IEEE. Google Scholar
Digital Library
- R. Poss, M. Lankamp, Q. Yang, J. Fu, M. W. Van Tol, I. Uddin, and C. Jesshope. 2013. Apple-CORE: Harnessing general-purpose many-cores with hardware concurrency management. Microprocess. Microsyst. 37, 8, 1090--1101. Google Scholar
Digital Library
- D. Sanchez and C. Kozyrakis. 2012. Scd: A scalable coherence directory with flexible sharer set encoding. In Proceedings of the IEEE 18th International Symposium on High Performance Computer Architecture. IEEE, 1--12. Google Scholar
Digital Library
- S. Secchi, A. Tumeo, and O. Villa. 2012. A bandwidth-optimized multi-core architecture for irregular applications. In Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'12). 580--587. Google Scholar
Digital Library
- K. Skadron and D. W. Clark. 1997. Design issues and tradeoffs for write buffers. In Proceedings of the 3rd International Symposium on High-Performance Computer Architecture. 144--155. Google Scholar
Digital Library
- M. W. Van Tol, R. Bakker, M. Verstraaten, C. Grelck, and C. Jesshope. 2011. Efficient memory copy operations on the 48-core intel scc processor. In Proceedings of the MARC Symposium. 13--18.Google Scholar
- P. T. Wolkotte, G. J. Smit, and J. E. Becker. 2005. Energy-efficient noc for best-effort communication. In Proceedings of the 15th International Conference on Field Programmable Logic and Applications. IEEE, 197--202.Google Scholar
- Q. Yang, C. Jesshope, and J. Fu. 2011. A micro threading based concurrency model for parallel computing. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW'11). 1668--1674. Google Scholar
Digital Library
- H. Zhao, A. Shriraman, S. Dwarkadas, and V. Srinivasan. 2011. Spatl: Honey, I shrunk the coherence directory. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'11). IEEE, 33--44. Google Scholar
Digital Library
Index Terms
On-chip traffic regulation to reduce coherence protocol cost on a microthreaded many-core architecture with distributed caches
Recommendations
Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
CF '16: Proceedings of the ACM International Conference on Computing FrontiersChip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core ...
Efficient Timestamp-Based Cache Coherence Protocol for Many-Core Architectures
ICS '16: Proceedings of the 2016 International Conference on SupercomputingAs we enter the era of many-core, providing the shared memory abstraction through cache coherence has become progressively difficult. The de-facto standard directory-based cache coherence has been extensively studied; but it does not scale well with ...
A Direct Coherence Protocol for Many-Core Chip Multiprocessors
Future many-core CMP designs that will integrate tens of processor cores on-chip will be constrained by area and power. Area constraints make impractical the use of a bus or a crossbar as the on-chip interconnection network, and tiled CMPs organized ...






Comments