Abstract
Multicore processors have the potential to deliver scalable performance by distributing computation across multiple cores. However, the communication cost of parallel application thread execution may significantly limit the performance achievable due to latency and contention on shared resources in the on-chip network of multicores experienced by packets from critical threads. We present PAIS, Parallelism-Aware Interconnect Scheduling, that bolsters performance and energy efficiency of parallel applications. PAIS dynamically detects thread execution progress based on communication latency and scheduling, and it accelerates communication for slowly executing threads by prioritizing packets from those threads with flow control and priority-based arbitration.
- Anant Agarwal, Liewei Bao, John Brown, Bruce Edwards, Matt Mattina, Chyi-Chang Miao, Carl Ramey, and David Wentzlaff. 2007. Tile processor: Embedded multicore for networking and multimedia. In Hot Chips 19.Google Scholar
- Niket Agarwal, Tushar Krishna, Li-Shiuan Peh, and Niraj K. Jha. 2009. GARNET: A detailed on-chip network model inside a full-system simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 33--42.Google Scholar
- Minseon Ahn and Eun Jung Kim. 2010. Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks. In Proceedings of the Annual ACM/IEEE International Symposium on Microarchitecture. 399--408. Google Scholar
Digital Library
- Nick Barrow-Williams, Christian Fensch, and Simon W. Moore. 2009. A communication characterisation of Splash-2 and Parsec. In Proceedings of the IEEE International Symposium on Workload Characterization. 86--97. Google Scholar
Digital Library
- Abhishek Bhattacharjee and Margaret Martonosi. 2009. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In Proceedings of the Annual International Symposium on Computer Architecture. 290--301. Google Scholar
Digital Library
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. Google Scholar
Digital Library
- Jeffery A. Brown, Leo Porter, and Dean M. Tullsen. 2011. Fast thread migration via cache working set prediction. In Proceedings of the International Symposium on High-Performance Computer Architecture. 193--204. Google Scholar
Digital Library
- Qiong Cai, José González, Ryan Rakvic, Grigorios Magklis, Pedro Chaparro, and Antonio González. 2008. Meeting points: using thread criticality to adapt multicore hardware to parallel regions. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 240--249. Google Scholar
Digital Library
- George Chrysos. 2012. Knights Corner, Intel's first many integrated core (MIC) architecture product. In Hot Chips 24. Google Scholar
Digital Library
- Pat Conway, Nathan Kalyanasundharam, Gregg Donley, Kevin Lepak, and Bill Hughes. 2010. Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro 30, 2, 16--29. Google Scholar
Digital Library
- William James Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann, San Francisco, CA. Google Scholar
Digital Library
- Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. 2009. Application-aware prioritization mechanisms for on-chip networks. In Proceedings of Annual ACM/IEEE International Symposium on Microarchitecture. 280--291. Google Scholar
Digital Library
- Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. 2010. Aérgia: Exploiting packet latency slack in on-chip networks. In Proceedings of the Annual International Symposium on Computer Architecture. 106--116. Google Scholar
Digital Library
- Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, José A. Joao, Onur Mutlu, and Yale N. Patt. 2011. Parallel application memory scheduling. In Proceedings of the Annual ACM/IEEE International Symposium on Microarchitecture. 362--373. Google Scholar
Digital Library
- Yatin Hoskote, Sriram Vangal, Arvind Singh, Nitin Borkar, and Shekhar Borkar. 2007. A 5-GHz mesh interconnect for a teraflops processor. IEEE Micro 27, 5, 51--61. Google Scholar
Digital Library
- Natalie D. Enright Jerger, Li-Shiuan Peh, and Mikko H. Lipasti. 2008. Circuit-Switched Coherence. In Proceedings of the International Symposium on Networks-on-Chip. 193--202. Google Scholar
Digital Library
- Yuho Jin, Ruisheng Wang, Woojin Choi, and Timothy Mark Pinkston. 2010. Thread Criticality Support in On-Chip Networks. In Proceedings of the International Workshop on Network on Chip Architectures. 1--6. Google Scholar
Digital Library
- José A. Joao, M. Aater Suleman, Onur Mutlu, and Yale N. Patt. 2012. Bottleneck identification and scheduling in multithreaded applications. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 223--234. Google Scholar
Digital Library
- Andrew B. Kahng, Bin Li, Li-Shiuan Peh, and Kambiz Samadi. 2009. ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration. In Proceedings of the Conference and Exhibition on Design, Automation and Test in Europe. 423--428. Google Scholar
Digital Library
- Amit Kumar, Li-Shiuan Peh, and Niraj K. Jha. 2008. Token flow control. In Proceedings of the Annual ACM/IEEE International Symposium on Microarchitecture. 342--353. Google Scholar
Digital Library
- Amit Kumar, Li-Shiuan Peh, Partha Kundu, and Niraj K. Jha. 2007. Express virtual channels: Towards the ideal interconnection fabric. In Proceedings of the Annual International Symposium on Computer Architecture. 150--161. Google Scholar
Digital Library
- Jian Li, José F. Martínez, and Michael C. Huang. 2004. The thrifty barrier: Energy-aware synchronization in shared-memory multiprocessors. In Proceedings of the International Symposium on High-Performance Computer Architecture. 14--23. Google Scholar
Digital Library
- Chun Liu, Anand Sivasubramaniam, Mahmut T. Kandemir, and Mary Jane Irwin. 2005. Exploiting Barriers to Optimize Power Consumption of CMPs. In Proceedings of the International Parallel and Distributed Processing Symposium. Google Scholar
Digital Library
- Peter S. Magnusson, Magnus Christensson, Jesper Eskilson, Daniel Forsgren, Gustav Hållberg, Johan Högberg, Fredrik Larsson, Andreas Moestedt, and Bengt Werner. 2002. Simics: A full system simulation platform. IEEE Computer 35, 2, 50--58. Google Scholar
Digital Library
- Milo M. K. Martin, Daniel J. Sorin, Bradford M. Beckmann, Michael R. Marty, Min Xu, Alaa R. Alameldeen, Kevin E. Moore, Mark D. Hill, and David A. Wood. 2005. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News 33, 4, 92--99. Google Scholar
Digital Library
- Hiroki Matsutani, Michihiro Koibuchi, Hideharu Amano, and Tsutomu Yoshinaga. 2009. Prediction router: Yet another low latency on-chip router architecture. In Proceedings of the International Symposium on High-Performance Computer Architecture. 367--378. Google Scholar
Digital Library
- Richard McDougall and Jim Mauro. 2006. Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture. Prentice Hall.Google Scholar
- Robert D. Mullins, Andrew West, and Simon W. Moore. 2004. Low-latency virtual-channel routers for on-chip networks. In Proceedings of the Annual International Symposium on Computer Architecture. 188--197. Google Scholar
Digital Library
- Angeles G. Navarro, Rafael Asenjo, Siham Tabik, and Calin Cascaval. 2009. Analytical modeling of pipeline parallelism. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 281--290. Google Scholar
Digital Library
- Ivan Miro Panades, Fabien Clermidy, Pascal Vivet, and Alain Greiner. 2008. Physical implementation of the DSPIN network-on-chip in the FAUST architecture. In Proceedings of the International Symposium on Networks-on-Chip. 139--148. Google Scholar
Digital Library
- Dongkook Park, Reetuparna Das, Chrysostomos Nicopoulos, Jongman Kim, N. Vijaykrishnan, Ravishankar Iyer, and Chita R. Das. 2007. Design of a dynamic priority-based fast path architecture for on-chip interconnects. In Proceedings of the 15th Annual IEEE Symposium on High-Performance Interconnects. 15--20. Google Scholar
Digital Library
- Praveen Salihundam, Shailendra Jain, Tiju Jacob, Shasi Kumar, Vasantha Erraguntla, Yatin Hoskote, Sriram Vangal, Gregory Ruhl, and Nitin Borkar. 2011. A 2 Tb/s 6 x 4 mesh network for a single-chip cloud computer with DVFS in 45 nm CMOS. IEEE J. Solid-State Circuits 46, 4, 757--766.Google Scholar
Cross Ref
- Yongho Song and Timothy Mark Pinkston. 2003. A progressive approach to handling message-dependent deadlocks in parallel computer systems. IEEE Trans. Parallel Distrib. Syst. 14, 3, 259--275. Google Scholar
Digital Library
- M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, and Yale N. Patt. 2009. Accelerating critical section execution with asymmetric multi-core architectures. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 253--264. Google Scholar
Digital Library
- Pascal T. Wolkotte, Gerard J. M. Smit, Gerard K. Rauwerda, and Lodewijk T. Smit. 2005. An energy-efficient reconfigurable circuit-switched network-on-chip. In Proceedings of the International Parallel and Distributed Processing Symposium: 12th Reconfigurable Architecture Workshop. 4--7. Google Scholar
Digital Library
Index Terms
PAIS: Parallelism-aware interconnect scheduling in multicores
Recommendations
Exploiting Parallelism on GPUs and FPGAs with OmpSs
ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC SystemsThis paper presents the OmpSs approach to deal with heterogeneous programming on GPU and FPGA accelerators. The OmpSs programming model is based on the Mercurium compiler and the Nanos++ runtime. Applications are annotated with compiler directives ...
Automated performance tuning
PASCO '10: Proceedings of the 4th International Workshop on Parallel and Symbolic ComputationThis tutorial presents automated techniques for implementing and optimizing numeric and symbolic libraries on modern computing platforms including SSE, multicore, and GPU. Obtaining high performance requires effective use of the memory hierarchy, short ...
An adaptive hash-based multilayer scheduler for L7-filter on a highly threaded hierarchical multi-core server
ANCS '09: Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications SystemsUbiquitous multi-core-based web servers and edge routers are increasingly popular in deploying computationally intensive Deep Packet Inspection (DPI) programs. Previous work has shown the benefits of connection locality-based scheduling on multi-core ...






Comments