Abstract
Many-core systems are envisioned to leverage the ever-increasing demand for more powerful computing systems. To provide the necessary computing power, the number of Processing Elements integrated on-chip increases and NoC based infrastructures are adopted to address the interconnection scalability. The advent of these new architectures surfaces the need for more sophisticated, distributed resource management paradigms, which in addition to the extreme integration scaling, make the new systems more prone to errors manifested both at hardware and software. In this work, we highlight the need for Run-Time Resource management to be enhanced with fault tolerance features and propose SoftRM, a resource management framework which can dynamically adapt to permanent failures in a self-organized, workload-aware manner. Self-organization allows the resource management agents to recover from a failure in a coordinated way by electing a new agent to replace the failed one, while workload awareness optimizes this choice according to the status of each core. We evaluate the proposed framework on Intel Single-chip Cloud Computer (SCC), a NoC based many-core system and customize it to achieve minimum interference on the resource allocation process. We showcase that its workload-aware features manage to utilize free resources in more that 90% of the conducted experiments. Comparison with relevant state-of-the-art fault tolerant frameworks shows decrease of up to 67% in the imposed overhead on application execution.
- Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, Bharan Giridhar, Ronald G. Dreslinski, David Blaauw, and Trevor Mudge. 2013. Scaling towards kilo-core processors with asymmetric high-radix topologies. In High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on. IEEE, 496--507 Google Scholar
Digital Library
- Iraklis Anagnostopoulos, Vasileios Tsoutsouras, Alexandros Bartzas, and Dimitrios Soudris. 2013. Distributed run-time resource management for malleable applications on many-core platforms. In Proceedings of the 50th Annual Design Automation Conference. ACM. Google Scholar
Digital Library
- Tatsumi Aoyama, Ken-Ichi Ishikawa, Yasuyuki Kimura, Hideo Matsufuru, Atsushi Sato, Tomohiro Suzuki, and Sunao Torii. 2016. First application of lattice QCD to pezy-SC processor. Procedia Computer Science 80 (2016), 1418--1427. Google Scholar
Digital Library
- Siavoosh Payandeh Azad, Behrad Niazmand, Jaan Raik, Gert Jervan, and Thomas Hollstein. 2016. Holistic approach for fault-tolerant network-on-chip based many-core systems. arXiv preprint arXiv:1601.07089 (2016).Google Scholar
- Brent Bohnenstiehl, Aaron Stillmaker, Jon Pimentel, Timothy Andreas, Bin Liu, Anh Tran, Emmanuel Adeagbo, and Bevan Baas. 2016. A 5.8 pJ/Op 115 billion ops/sec, to 1.78 trillion ops/sec 32nm 1000-processor array. In VLSI Circuits (VLSI-Circuits), 2016 IEEE Symposium on.Google Scholar
Cross Ref
- Cristiana Bolchini, Matteo Carminati, and Antonio Miele. 2013. Self-adaptive fault tolerance in multi-/many-core systems. Journal of Electronic Testing 29, 2 (2013), 159--175. Google Scholar
Digital Library
- Christian Cachin, Rachid Guerraoui, and Luís Rodrigues. 2011. Introduction to Reliable and Secure Distributed Programming. Springer Science 8 Business Media. Google Scholar
- Tushar Deepak Chandra and Sam Toueg. 1996. Unreliable failure detectors for reliable distributed systems. J. ACM 43, 2 (1996), 225--267. Google Scholar
Digital Library
- Chen-Ling Chou and Radu Marculescu. FARM: Fault-aware resource management in NoC-based multiprocessor platforms. In 2011 Design, Automation 8 Test in Europe.Google Scholar
- Anup Das, Akash Kumar, and Bharadwaj Veeravalli. 2013. Communication and migration energy aware design space exploration for multicore systems with intermittent faults. In Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, 1631--1636. Google Scholar
Digital Library
- Anup Das, Akash Kumar, and Bharadwaj Veeravalli. 2013. Reliability-driven task mapping for lifetime extension of networks-on-chip based multiprocessor systems. In Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, 689--694. Google Scholar
Digital Library
- Anup Das, Akash Kumar, Bharadwaj Veeravalli, Cristiana Bolchini, and Antonio Miele. 2014. Combined DVFS and mapping exploration for lifetime and soft-error susceptibility improvement in MPSoCs. In Design, Automation and Test in Europe Conference and Exhibition (DATE), 2014. IEEE, 1--6. Google Scholar
Digital Library
- Mohammad Fattah, Maurizio Palesi, Pasi Liljeberg, Juha Plosila, and Hannu Tenhunen. 2014. Shifa: System-level hierarchy in run-time fault-aware management of many-core systems. In Design Automation Conference (DAC), 2014 51st ACM/EDAC/IEEE. IEEE, 1--6. Google Scholar
Digital Library
- Mohammad-Hashem Haghbayan, Antonio Miele, Amir M Rahmani, Pasi Liljeberg, and Hannu Tenhunen. 2016. A lifetime-aware runtime mapping approach for many-core systems in the dark silicon era. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE), 2016. IEEE, 854--857. Google Scholar
Digital Library
- Adam S. Hartman and Donald E. Thomas. 2012. Lifetime improvement through runtime wear-based task mapping. In Proceedings of the Eighth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’12). ACM, New York, NY, USA, 10. Google Scholar
Digital Library
- Christian Haubelt, Dirk Koch, Felix Reimann, Thilo Streichert, and Jürgen Teich. 2010. ReCoNets design methodology for embedded systems consisting of small networks of reconfigurable nodes and connections. In Dynamically Reconfigurable Systems. Springer, 223--243.Google Scholar
- Jörg Henkel, Lars Bauer, Nikil Dutt, Puneet Gupta, Sani Nassif, Muhammad Shafique, Mehdi Tahoori, and Norbert Wehn. 2013. Reliable on-chip systems in the nano-era: Lessons learnt and future trends. In Proc. of the 50th Annual Design Automation Conference. ACM. Google Scholar
Digital Library
- Jörg Henkel, Lars Bauer, Hongyan Zhang, Semeen Rehman, and Muhammad Shafique. 2014. Multi-layer dependability: From microarchitecture to application level. In 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6. Google Scholar
Digital Library
- Viacheslav Izosimov, Ilia Polian, Paul Pop, Petru Eles, and Zebo Peng. 2009. Analysis and optimization of fault-tolerant embedded systems with hardened processors. In 2009 Design, Automation 8 Test in Europe Conference 8 Exhibition. IEEE. Google Scholar
Digital Library
- Paris Christos Kanellakis and Alex Allister Shvartsman. 2013. Fault-tolerant Parallel Computation. Vol. 401. Springer Science 8 Business Media.Google Scholar
- Fatemeh Khalili and Hamid R Zarandi. 2013. A reliability-aware multi-application mapping technique in networks-on-chip. In 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. IEEE. Google Scholar
Digital Library
- Sebastian Kobbe, Lars Bauer, Daniel Lohmann, Wolfgang Schröder-Preikschat, and Jörg Henkel. 2011. DistRM: Distributed resource management for on-chip many-core systems. In Proceedings of the Seventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. Google Scholar
Digital Library
- Israel Koren and C. Mani Krishna. 2010. Fault-tolerant Systems. Morgan Kaufmann.Google Scholar
- Leslie Lamport and others. 2001. Paxos made simple. ACM Sigact News 32, 4 (2001), 18--25.Google Scholar
- Xiaojun Li, Jin Qin, and Joseph B. Bernstein. 2008. Compact modeling of MOSFET wearout mechanisms for circuit-reliability simulation. IEEE Transactions on Device and Materials Reliability 8, 1 (2008), 98--121.Google Scholar
Cross Ref
- Leibo Liu, Chen Wu, Chenchen Deng, Shouyi Yin, Qinghua Wu, Jie Han, and Shaojun Wei. 2015. A flexible energy-and reliability-aware application mapping for NoC-based reconfigurable architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems (2015).Google Scholar
Cross Ref
- Timothy G. Mattson, Michael Riepen, Thomas Lehnig, Paul Brett, Werner Haas, Patrick Kennedy, Jason Howard, Sriram Vangal, and others. 2010. The 48-core SCC processor: The programmer’s view. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 1--11. Google Scholar
Digital Library
- Peter Munk, Mohammad Shadi Alhakeem, Raphael Lisicki, Helge Parzyjegla, Jan Richling, and Hans-Ulrich Heiß. 2015. Toward a fault-tolerance framework for COTS many-core systems. In Dependable Computing Conference (EDCC), 2015 Eleventh European. IEEE. Google Scholar
Digital Library
- Badrun Nahar and Brett H. Meyer. 2015. RotR: Rotational redundant task mapping for fail-operational MPSoCs. In Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS), 2015 IEEE International Symposium on. IEEE, 21--28.Google Scholar
- Andreas Olofsson. 2016. Epiphany-v: A 1024 processor 64-bit risc system-on-chip. arXiv preprint arXiv:1610.01832 (2016).Google Scholar
- Abbas Rahimi, Luca Benini, and Rajesh K. Gupta. 2016. Variability mitigation in nanometer CMOS integrated systems: A survey of techniques from circuits to software. Proc. IEEE 104, 7 (2016), 1410--1448.Google Scholar
Cross Ref
- Muhammad Shafique, Philip Axer, Christoph Borchert, Jian-Jia Chen, Kuan-Hsun Chen, Björn Döbel, Rolf Ernst, Hermann Härtig, Andreas Heinig, Rüdiger Kapitza, and others. 2015. Multi-layer software reliability for unreliable hardware. it-Information Technology 57, 3 (2015), 170--180.Google Scholar
- Muhammad Shafique and Jorg Henkel. 2013. Agent-based distributed power management for Kilo-core processors: Special Session: Keeping Kilo-core chips cool: New directions and emerging solutions. In Computer-Aided Design (ICCAD), 2013 IEEE/ACM International Conference on. IEEE, 153--160. Google Scholar
Digital Library
- Premkishore Shivakumar, Michael Kistler, Stephen W. Keckler, Doug Burger, and Lorenzo Alvisi. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic. In Dependable Systems and Networks, 2002. DSN 2002. Proceedings. International Conference on. IEEE, 389--398. Google Scholar
Digital Library
- Amit Kumar Singh, Piotr Dziurzanski, Hashan Roshantha Mendis, and Leandro Soares Indrusiak. 2017. A survey and comparative study of hard and soft real-time dynamic resource allocation strategies for multi/many-core systems. ACM Comput. Surv. (2017). Google Scholar
Digital Library
- Amit Kumar Singh, Muhammad Shafique, Akash Kumar, and Jörg Henkel. 2013. Mapping on multi/many-core systems: Survey of current and emerging trends. In Proceedings of the 50th Annual Design Automation Conference. ACM, 1. Google Scholar
Digital Library
- Vasileios Tsoutsouras, Sotirios Xydis, and Dimitrios Soudris. 2015. Job-arrival aware distributed run-time resource management on intel SCC manycore platform. In Embedded and Ubiquitous Computing (EUC), 2015 IEEE 13th International Conference on. IEEE, 17--24. Google Scholar
Digital Library
- Eduardo Wachter, Vinicius Fochi, Francisco Barreto, Alexandre Amory, and Fernando Moraes. 2016. A hierarchical and distributed fault tolerant proposal for NoC-based MPSoCs. IEEE Transactions on Emerging Topics in Computing (2016).Google Scholar
- Sebastian Werner, Javier Navaridas, and Mikel Luján. 2016. A survey on design approaches to circumvent permanent faults in Networks-on-Chip. Comput. Surveys (2016). Google Scholar
Digital Library
- Chen Wu, Chenchen Deng, Leibo Liu, Jie Han, Jiqiang Chen, Shouyi Yin, and Shaojun Wei. 2015. An efficient application mapping approach for the co-optimization of reliability, energy, and performance in reconfigurable NoC architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34, 8 (2015), 1264--1277.Google Scholar
Digital Library
Index Terms
SoftRM: Self-Organized Fault-Tolerant Resource Management for Failure Detection and Recovery in NoC Based Many-Cores
Recommendations
HPC-BLAST: distributed BLAST for xeon phi clusters
BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health InformaticsThe near exponential growth in sequence data available to bioinformaticists, and the emergence of new fields of biological research, continue to fuel an incessant need for increases in sequence alignment performance. Concurrently, the High Performance ...
A Fault Detection and Recovery Architecture for a Teradevice Dataflow System
DFM '11: Proceedings of the 2011 First Workshop on Data-Flow Execution Models for Extreme Scale ComputingFuture computing systems (Teradevices) will probably contain more than 1000 cores on a single die. To exploit this parallelism, threaded dataflow execution models are promising, since they provide side-effect free execution and reduced synchronization ...
Fault-tolerant Network-on-Chip based on Fault-aware Flits and Deflection Routing
NOCS '15: Proceedings of the 9th International Symposium on Networks-on-ChipDeflection routing is a promising approach for energy and hardware efficient NoCs. Future VLSI designs will have an increasing susceptibility to failures and breakdowns. The inherent redundancy of NoCs can be used to tolerate such failures. We extended ...






Comments