skip to main content
research-article

PAIS: Parallelism-aware interconnect scheduling in multicores

Published:28 March 2014Publication History
Skip Abstract Section

Abstract

Multicore processors have the potential to deliver scalable performance by distributing computation across multiple cores. However, the communication cost of parallel application thread execution may significantly limit the performance achievable due to latency and contention on shared resources in the on-chip network of multicores experienced by packets from critical threads. We present PAIS, Parallelism-Aware Interconnect Scheduling, that bolsters performance and energy efficiency of parallel applications. PAIS dynamically detects thread execution progress based on communication latency and scheduling, and it accelerates communication for slowly executing threads by prioritizing packets from those threads with flow control and priority-based arbitration.

References

  1. Anant Agarwal, Liewei Bao, John Brown, Bruce Edwards, Matt Mattina, Chyi-Chang Miao, Carl Ramey, and David Wentzlaff. 2007. Tile processor: Embedded multicore for networking and multimedia. In Hot Chips 19.Google ScholarGoogle Scholar
  2. Niket Agarwal, Tushar Krishna, Li-Shiuan Peh, and Niraj K. Jha. 2009. GARNET: A detailed on-chip network model inside a full-system simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 33--42.Google ScholarGoogle Scholar
  3. Minseon Ahn and Eun Jung Kim. 2010. Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks. In Proceedings of the Annual ACM/IEEE International Symposium on Microarchitecture. 399--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Nick Barrow-Williams, Christian Fensch, and Simon W. Moore. 2009. A communication characterisation of Splash-2 and Parsec. In Proceedings of the IEEE International Symposium on Workload Characterization. 86--97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Abhishek Bhattacharjee and Margaret Martonosi. 2009. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In Proceedings of the Annual International Symposium on Computer Architecture. 290--301. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jeffery A. Brown, Leo Porter, and Dean M. Tullsen. 2011. Fast thread migration via cache working set prediction. In Proceedings of the International Symposium on High-Performance Computer Architecture. 193--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Qiong Cai, José González, Ryan Rakvic, Grigorios Magklis, Pedro Chaparro, and Antonio González. 2008. Meeting points: using thread criticality to adapt multicore hardware to parallel regions. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 240--249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. George Chrysos. 2012. Knights Corner, Intel's first many integrated core (MIC) architecture product. In Hot Chips 24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Pat Conway, Nathan Kalyanasundharam, Gregg Donley, Kevin Lepak, and Bill Hughes. 2010. Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro 30, 2, 16--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. William James Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. 2009. Application-aware prioritization mechanisms for on-chip networks. In Proceedings of Annual ACM/IEEE International Symposium on Microarchitecture. 280--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. 2010. Aérgia: Exploiting packet latency slack in on-chip networks. In Proceedings of the Annual International Symposium on Computer Architecture. 106--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, José A. Joao, Onur Mutlu, and Yale N. Patt. 2011. Parallel application memory scheduling. In Proceedings of the Annual ACM/IEEE International Symposium on Microarchitecture. 362--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yatin Hoskote, Sriram Vangal, Arvind Singh, Nitin Borkar, and Shekhar Borkar. 2007. A 5-GHz mesh interconnect for a teraflops processor. IEEE Micro 27, 5, 51--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Natalie D. Enright Jerger, Li-Shiuan Peh, and Mikko H. Lipasti. 2008. Circuit-Switched Coherence. In Proceedings of the International Symposium on Networks-on-Chip. 193--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yuho Jin, Ruisheng Wang, Woojin Choi, and Timothy Mark Pinkston. 2010. Thread Criticality Support in On-Chip Networks. In Proceedings of the International Workshop on Network on Chip Architectures. 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. José A. Joao, M. Aater Suleman, Onur Mutlu, and Yale N. Patt. 2012. Bottleneck identification and scheduling in multithreaded applications. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 223--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Andrew B. Kahng, Bin Li, Li-Shiuan Peh, and Kambiz Samadi. 2009. ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration. In Proceedings of the Conference and Exhibition on Design, Automation and Test in Europe. 423--428. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Amit Kumar, Li-Shiuan Peh, and Niraj K. Jha. 2008. Token flow control. In Proceedings of the Annual ACM/IEEE International Symposium on Microarchitecture. 342--353. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Amit Kumar, Li-Shiuan Peh, Partha Kundu, and Niraj K. Jha. 2007. Express virtual channels: Towards the ideal interconnection fabric. In Proceedings of the Annual International Symposium on Computer Architecture. 150--161. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jian Li, José F. Martínez, and Michael C. Huang. 2004. The thrifty barrier: Energy-aware synchronization in shared-memory multiprocessors. In Proceedings of the International Symposium on High-Performance Computer Architecture. 14--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Chun Liu, Anand Sivasubramaniam, Mahmut T. Kandemir, and Mary Jane Irwin. 2005. Exploiting Barriers to Optimize Power Consumption of CMPs. In Proceedings of the International Parallel and Distributed Processing Symposium. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Peter S. Magnusson, Magnus Christensson, Jesper Eskilson, Daniel Forsgren, Gustav Hållberg, Johan Högberg, Fredrik Larsson, Andreas Moestedt, and Bengt Werner. 2002. Simics: A full system simulation platform. IEEE Computer 35, 2, 50--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Milo M. K. Martin, Daniel J. Sorin, Bradford M. Beckmann, Michael R. Marty, Min Xu, Alaa R. Alameldeen, Kevin E. Moore, Mark D. Hill, and David A. Wood. 2005. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News 33, 4, 92--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Hiroki Matsutani, Michihiro Koibuchi, Hideharu Amano, and Tsutomu Yoshinaga. 2009. Prediction router: Yet another low latency on-chip router architecture. In Proceedings of the International Symposium on High-Performance Computer Architecture. 367--378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Richard McDougall and Jim Mauro. 2006. Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture. Prentice Hall.Google ScholarGoogle Scholar
  28. Robert D. Mullins, Andrew West, and Simon W. Moore. 2004. Low-latency virtual-channel routers for on-chip networks. In Proceedings of the Annual International Symposium on Computer Architecture. 188--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Angeles G. Navarro, Rafael Asenjo, Siham Tabik, and Calin Cascaval. 2009. Analytical modeling of pipeline parallelism. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 281--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ivan Miro Panades, Fabien Clermidy, Pascal Vivet, and Alain Greiner. 2008. Physical implementation of the DSPIN network-on-chip in the FAUST architecture. In Proceedings of the International Symposium on Networks-on-Chip. 139--148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Dongkook Park, Reetuparna Das, Chrysostomos Nicopoulos, Jongman Kim, N. Vijaykrishnan, Ravishankar Iyer, and Chita R. Das. 2007. Design of a dynamic priority-based fast path architecture for on-chip interconnects. In Proceedings of the 15th Annual IEEE Symposium on High-Performance Interconnects. 15--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Praveen Salihundam, Shailendra Jain, Tiju Jacob, Shasi Kumar, Vasantha Erraguntla, Yatin Hoskote, Sriram Vangal, Gregory Ruhl, and Nitin Borkar. 2011. A 2 Tb/s 6 x 4 mesh network for a single-chip cloud computer with DVFS in 45 nm CMOS. IEEE J. Solid-State Circuits 46, 4, 757--766.Google ScholarGoogle ScholarCross RefCross Ref
  33. Yongho Song and Timothy Mark Pinkston. 2003. A progressive approach to handling message-dependent deadlocks in parallel computer systems. IEEE Trans. Parallel Distrib. Syst. 14, 3, 259--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, and Yale N. Patt. 2009. Accelerating critical section execution with asymmetric multi-core architectures. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 253--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Pascal T. Wolkotte, Gerard J. M. Smit, Gerard K. Rauwerda, and Lodewijk T. Smit. 2005. An energy-efficient reconfigurable circuit-switched network-on-chip. In Proceedings of the International Parallel and Distributed Processing Symposium: 12th Reconfigurable Architecture Workshop. 4--7. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. PAIS: Parallelism-aware interconnect scheduling in multicores

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!