Abstract
As technology constantly strengthens its presence in all aspects of human life, computing systems integrate a high number of processing cores, whereas applications become more complex and greedy for computational resources. Inevitably, this high increase in processing elements combined with the unpredictable resource requirements of executed applications at design time impose new design constraints to resource management of many-core systems, turning the distributed functionality into a necessity. In this work, we present a distributed runtime resource management framework for many-core systems utilizing a network-on-chip (NoC) infrastructure. Specifically, we couple the concept of distributed management with parallel applications by assigning different roles to the available computing resources. The presented design is based on the idea of local controllers and managers, whereas an on-chip intercommunication scheme ensures decision distribution. The evaluation of the proposed framework was performed on an Intel Single-Chip Cloud Computer, an actual NoC-based, many-core system. Experimental results show that the proposed scheme manages to allocate resources efficiently at runtime, leading to gains of up to 30% in application execution latency compared to relevant state-of-the-art distributed resource management frameworks.
- Mohammad Abdullah Al Faruque, Rudolf Krist, and Jórg Henkel. 2008. ADAM: Run-time agent-based distributed application mapping for on-chip communication. In Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08). Google Scholar
Digital Library
- Iraklis Anagnostopoulos, Alexandros Bartzas, Georgios Kathareios, and Dimitrios Soudris. 2012. A divide and conquer based distributed run-time mapping methodology for many-core platforms. In Proceedings of the Conference on Design, Automation, and Test in Europe. 111--116. Google Scholar
Digital Library
- Iraklis Anagnostopoulos, Vasileios Tsoutsouras, Alexandros Bartzas, and Dimitrios Soudris. 2013. Distributed run-time resource management for malleable applications on many-core platforms. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). ACM, New York, NY, 168. Google Scholar
Digital Library
- Tatsumi Aoyama, Ken-Ichi Ishikawa, Yasuyuki Kimura, Hideo Matsufuru, Atsushi Sato, Tomohiro Suzuki, and Sunao Torii. 2016. First application of lattice QCD to Pezy-SC processor. Procedia Computer Science 80, 1418--1427. Google Scholar
Digital Library
- Dimitra Azariadi, Vasileios Tsoutsouras, Sotirios Xydis, and Dimitrios Soudris. 2016. ECG signal analysis and arrhythmia detection on IoT wearable medical devices. In Proceedings of the 2016 5th International Conferrence on Modern Circuits and Systems Technologies (MOCAST’16). IEEE, Los Alamitos, CA, 1--4.Google Scholar
Cross Ref
- Antonio Barbalace, Binoy Ravindran, and David Katz. 2014. Popcorn: A replicated-kernel OS based on Linux. In Proceedings of the 2014 Linux Symposium.Google Scholar
- Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. ACM, New York, NY, 29--44. Google Scholar
Digital Library
- Adam Beguelin, Erik Seligman, and Peter Stephan. 1997. Application level fault tolerance in heterogeneous networks of workstations. Journal of Parallel and Distributed Computing 43, 2, 147--155. Google Scholar
Digital Library
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. ACM, New York, NY, 72--81. Google Scholar
Digital Library
- Tobias Bjerregaard and Shankar Mahadevan. 2006. A survey of research and practices of network-on-chip. ACM Computing Surveys 38, 1, 1. Google Scholar
Digital Library
- Brent Bohnenstiehl, Aaron Stillmaker, Jon Pimentel, Timothy Andreas, Bin Liu, Anh Tran, Emmanuel Adeagbo, and Bevan Baas. 2016. A 5.8 pJ/Op 115 billion ops/sec, to 1.78 trillion ops/sec 32nm 1000-processor array. In Proceedings of the 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits’16). IEEE, Los Alamitos, CA, 1--2.Google Scholar
Cross Ref
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3, 27. Google Scholar
Digital Library
- Z. Chen and D. Marculescu. 2015. Distributed reinforcement learning for power limited many-core system performance optimization. In Proceedings of the 2015 Conference on Design, Automation, and Test in Europe (DATE’15). 1521--1526. Google Scholar
Digital Library
- Jules L. Coleman. 1979. Efficiency, utility, and wealth maximization. Hofstra Law Review 8, 509.Google Scholar
- Juan A. Colmenares, Gage Eads, Steven Hofmeyr, Sarah Bird, Miquel Moretó, David Chou, Brian Gluzman, et al. 2013. Tessellation: Refactoring the OS around explicit resource containers with continuous adaptation. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). ACM, New York, NY, 76. Google Scholar
Digital Library
- Yingnan Cui, Wei Zhang, and Hao Yu. 2012. Decentralized agent based re-clustering for task mapping of tera-scale network-on-chip system. In Proceedings of the 2012 IEEE International Symposium on Circuits and Systems (ISCAS’12). IEEE, Los Alamitos, CA, 2437--2440.Google Scholar
Cross Ref
- Travis Desell, Kaoutar El Maghraoui, and Carlos A. Varela. 2007. Malleable applications for scalable high performance computing. Cluster Computing 10, 3, 323--337. Google Scholar
Digital Library
- Bryan Donyanavard, Tiago Mück, Santanu Sarma, and Nikil Dutt. 2016. SPARTA: Runtime task allocation for energy efficient heterogeneous many-cores. In Proceedings of the 11th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM, New York, NY, 27. Google Scholar
Digital Library
- Allen B. Downey. 1997. A Model for Speedup of Parallel Programs. Technical Report. University of California at Berkeley. Google Scholar
- Mohammad Fattah, Masoud Daneshtalab, Pasi Liljeberg, and Juha Plosila. 2013. Smart hill climbing for agile dynamic mapping in many-core systems. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). ACM, New York, NY, 39. Google Scholar
Digital Library
- Mohammad Fattah, Maurizio Palesi, Pasi Liljeberg, Juha Plosila, and Hannu Tenhunen. 2014. Shifa: System-level hierarchy in run-time fault-aware management of many-core systems. In Proceedings of the 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC’14). IEEE, Los Alamitos, CA, 1--6. Google Scholar
Digital Library
- Dror G. Feitelson and Larry Rudolph. 1996. Toward convergence in job schedulers for parallel supercomputers. In Job Scheduling Strategies for Parallel Processing. Lecture Notes in Computer Science, Vol. 1162. Springer, 1--26. Google Scholar
Digital Library
- Tobias Fleig, Oliver Mattes, and Wolfgang Karl. 2014. Evaluation of adaptive memory management techniques on the Tilera Tile-GX platform. In Proceedings of the 2014 Workshop on Architecture of Computing Systems (ARCS’14). 1--8.Google Scholar
- Marti A. Hearst, Susan T. Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and Their Applications 13, 4, 18--28. Google Scholar
Digital Library
- Jason Howard, Saurabh Dighe, Yatin Hoskote, Sriram Vangal, David Finan, Gregory Ruhl, David Jenkins, et al. 2010. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proceedings of the 2010 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC’10). IEEE, Los Alamitos, CA, 108--109.Google Scholar
Cross Ref
- Engin Ipek, Bronis R. De Supinski, Martin Schulz, and Sally A. McKee. 2005. An approach to performance prediction for parallel applications. In Proceedings of the 2005 European Conference on Parallel Processing. 196--205. Google Scholar
Digital Library
- James Jeffers and James Reinders. 2013. Intel Xeon Phi Coprocessor High-Performance Programming. Newnes. Google Scholar
Digital Library
- David Katz, Antonio Barbalace, Saif Ansary, Akshay Ravichandran, and Binoy Ravindran. 2015. Thread migration in a replicated-kernel OS. In Proceedings of the 2015 IEEE 35th International Conference on Distributed Computing Systems (ICDCS’15). IEEE, Los Alamitos, CA.Google Scholar
Cross Ref
- Sebastian Kobbe, Lars Bauer, and Jörg Henkel. 2015. Adaptive on-the-fly application performance modeling for many cores. In Proceedings of the 2015 Design, Automation, and Test in Europe Conference and Exhibition. Google Scholar
Digital Library
- Sebastian Kobbe, Lars Bauer, Daniel Lohmann, Wolfgang Schröder-Preikschat, and Jörg Henkel. 2011. DistRM: Distributed resource management for on-chip many-core systems. In Proceedings of the 7th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM, New York, NY, 119--128. Google Scholar
Digital Library
- Samuel Kounev, Fabian Brosig, Nikolaus Huber, and Ralf Reussner. 2010. Towards self-aware performance and resource management in modern service-oriented systems. In Proceedings of the 2010 IEEE International Conference on Services Computing (SCC’10). Google Scholar
Digital Library
- George Kurian, Jason E. Miller, James Psota, Jonathan Eastep, Jifeng Liu, Jurgen Michel, Lionel C. Kimerling, and Anant Agarwal. 2010. ATAC: A 1000-core cache-coherent processor with on-chip optical network. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. ACM, New York, NY, 477--488. Google Scholar
Digital Library
- Ong Mao, Frans Kaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, and Zheng Zhang. 2008. Corey: An operating system for many cores. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). 43--57. Google Scholar
Digital Library
- T. Mattson and R. van der Wijngaart. 2010. RCCE: A Small Library for Many-Core Communication. Intel Corporation.Google Scholar
- V. Nollet, T. Marescaux, P. Avasare, D. Verkest, and J.-Y. Mignolet. 2005. Centralized run-time resource management in a network-on-chip containing reconfigurable hardware tiles. In Proceedings of the Conference on Design, Automation, and Test in Europe (DATE’05). IEEE, Los Alamitos, CA, 234--239. Google Scholar
Digital Library
- Andreas Olofsson. 2016. Epiphany-V: A 1024 processor 64-bit RISC system-on-chip. arXiv:1610.01832.Google Scholar
- Anuj Pathania, Vanchinathan Venkataramani, Muhammad Shafique, Tulika Mitra, and Jörg Henkel. 2016. Distributed fair scheduling for many-cores. In Proceedings of the 2016 Conference on Design, Automation, and Test in Europe (DATE’16). 379--384. Google Scholar
Digital Library
- Anuj Pathania, Vanchinathan Venkataramani, Muhammad Shafique, Tulika Mitra, and Jörg Henkel. 2016. Distributed scheduling for many-cores using cooperative game theory. In Proceedings of the 2016 53rd ACM/EDAC/IEEE Design Automation Conference (DAC’16). IEEE, Los Alamitos, CA, 1--6. Google Scholar
Digital Library
- Anuj Pathania, Vanchinathan Venkataramani, Muhammad Shafique, Tulika Mitra, and Jorg Henkel. 2016. Optimal greedy algorithm for many-core scheduling. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 36, 6, 1054--1058. Google Scholar
Digital Library
- Subramanian Ramachandran and Frank Mueller. 2016. Distributed job allocation for large-scale manycores. In Proceedings of the 2016 International Conference on High Performance Computing. 404--425.Google Scholar
Cross Ref
- Sabela Ramos and Torsten Hoefler. 2013. Modeling communication in cache-coherent SMP systems: A case-study with Xeon Phi. In Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing. ACM, New York, NY, 97--108. Google Scholar
Digital Library
- Barret Rhoden, Kevin Klues, David Zhu, and Eric Brewer. 2011. Improving per-node efficiency in the datacenter with new OS abstractions. In Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, New York, NY, 25. Google Scholar
Digital Library
- G. Sabin, M. Lang, and P. Sadayappan. 2006. Moldable parallel job scheduling using job efficiency: An iterative approach. In Proceedings of the 2006 Workshop on Job Scheduling Strategies for Parallel Processing. 94--114. Google Scholar
Digital Library
- Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, et al. 2008. Larrabee: A many-core x86 architecture for visual computing. ACM Transactions on Graphics 27, 18. Google Scholar
Digital Library
- Muhammad Shafique, Anton Ivanov, Benjamin Vogel, and Jörg Henkel. 2016. Scalable power management for on-chip systems with malleable applications. IEEE Transactions on Computers 65, 11, 3398--3412. Google Scholar
Digital Library
- C. Silvano, W. Fornaciari, S. Crespi Reghizzi, G. Agosta, G. Palermo, V. Zaccaria, P. Bellasi, et al. 2011. Parallel paradigms and run-time management techniques for many-core architectures: The 2PARMA approach. In Proceedings of the 2011 9th IEEE International Conference on Industrial Informatics. IEEE, Los Alamitos, CA, 835--840.Google Scholar
Cross Ref
- Amit Kumar Singh, Muhammad Shafique, Akash Kumar, and Jörg Henkel. 2013. Mapping on multi/many-core systems: Survey of current and emerging trends. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). ACM, New York, NY, 1. Google Scholar
Digital Library
- Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O’Connor, and Tor M. Aamodt. 2013. Cache coherence for GPU architectures. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture. Google Scholar
Digital Library
- Vasileios Tsoutsouras, Sotirios Xydis, and Dimitrios Soudris. 2015. Job-arrival aware distributed run-time resource management on Intel SCC manycore platform. In Proceedings of the 2015 IEEE 13th International Conference on Embedded and Ubiquitous Computing (EUC’15). IEEE, Los Alamitos, CA, 17--24. Google Scholar
Digital Library
- S. S. Vadhiyar and J. Dongarra. 2003. SRS: A framework for developing malleable and migratable parallel applications for distributed systems. Parallel Processing Letters 13, 291--312.Google Scholar
Cross Ref
- Sriram Vangal, Jason Howard, Gregory Ruhl, Saurabh Dighe, Howard Wilson, James Tschanz, David Finan, et al. 2007. An 80-tile 1.28 TFLOPS network-on-chip in 65nm CMOS. In Proceedings of the 2007 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC’07). IEEE, Los Alamitos, CA, 98--589.Google Scholar
Cross Ref
- David Wentzlaff and Anant Agarwal. 2009. Factored operating systems (fos): The case for a scalable operating system for multicores. ACM SIGOPS Operating Systems Review 43, 2, 76--85. Google Scholar
Digital Library
- Bo Yang, Liang Guang, Tero Säntti, and Juha Plosila. 2013. Mapping multiple applications with unbounded and bounded number of cores on many-core networks-on-chip. Microprocessors and Microsystems 37, 4, 460--471.Google Scholar
Cross Ref
- Lei Yang, Weichen Liu, Weiwen Jiang, Mengquan Li, Juan Yi, and Edwin Hsing-Mean Sha. 2016. Application mapping and scheduling for network-on-chip-based multiprocessor system-on-chip with fine-grain communication optimization. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 24, 10, 3027--3040. Google Scholar
Digital Library
Index Terms
A Hierarchical Distributed Runtime Resource Management Scheme for NoC-Based Many-Cores
Recommendations
Performance-Aware Resource Management of Multi-Threaded Applications on Many-Core Systems
GLSVLSI '17: Proceedings of the on Great Lakes Symposium on VLSI 2017Modern computing systems employ a large number of processing elements leaving behind traditional design approaches and architectures. On the software side, this evolution in system architecture has driven rapid changes on the field of application ...
Parallel deblocking filter for H.264/AVC on the TILERA many-core systems
MMM'11: Proceedings of the 17th international conference on Advances in multimedia modeling - Volume Part IFor the purpose of accelerating deblocking filter, which accounts for a significant percentage of H.264/AVC decoding time, some studies use wavefront method to achieve the required performance on multi-core platforms. We study the problem under the ...
Defragmentation for Efficient Runtime Resource Management in NoC-Based Many-Core Systems
Efficient runtime resource allocation is critical to the overall performance and energy consumption of many-core systems. A region of free cores is allocated for each newly launched application. The cores are deallocated when the corresponding ...






Comments