skip to main content
research-article

SoftRM: Self-Organized Fault-Tolerant Resource Management for Failure Detection and Recovery in NoC Based Many-Cores

Published:27 September 2017Publication History
Skip Abstract Section

Abstract

Many-core systems are envisioned to leverage the ever-increasing demand for more powerful computing systems. To provide the necessary computing power, the number of Processing Elements integrated on-chip increases and NoC based infrastructures are adopted to address the interconnection scalability. The advent of these new architectures surfaces the need for more sophisticated, distributed resource management paradigms, which in addition to the extreme integration scaling, make the new systems more prone to errors manifested both at hardware and software. In this work, we highlight the need for Run-Time Resource management to be enhanced with fault tolerance features and propose SoftRM, a resource management framework which can dynamically adapt to permanent failures in a self-organized, workload-aware manner. Self-organization allows the resource management agents to recover from a failure in a coordinated way by electing a new agent to replace the failed one, while workload awareness optimizes this choice according to the status of each core. We evaluate the proposed framework on Intel Single-chip Cloud Computer (SCC), a NoC based many-core system and customize it to achieve minimum interference on the resource allocation process. We showcase that its workload-aware features manage to utilize free resources in more that 90% of the conducted experiments. Comparison with relevant state-of-the-art fault tolerant frameworks shows decrease of up to 67% in the imposed overhead on application execution.

References

  1. Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, Bharan Giridhar, Ronald G. Dreslinski, David Blaauw, and Trevor Mudge. 2013. Scaling towards kilo-core processors with asymmetric high-radix topologies. In High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on. IEEE, 496--507 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Iraklis Anagnostopoulos, Vasileios Tsoutsouras, Alexandros Bartzas, and Dimitrios Soudris. 2013. Distributed run-time resource management for malleable applications on many-core platforms. In Proceedings of the 50th Annual Design Automation Conference. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Tatsumi Aoyama, Ken-Ichi Ishikawa, Yasuyuki Kimura, Hideo Matsufuru, Atsushi Sato, Tomohiro Suzuki, and Sunao Torii. 2016. First application of lattice QCD to pezy-SC processor. Procedia Computer Science 80 (2016), 1418--1427. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Siavoosh Payandeh Azad, Behrad Niazmand, Jaan Raik, Gert Jervan, and Thomas Hollstein. 2016. Holistic approach for fault-tolerant network-on-chip based many-core systems. arXiv preprint arXiv:1601.07089 (2016).Google ScholarGoogle Scholar
  5. Brent Bohnenstiehl, Aaron Stillmaker, Jon Pimentel, Timothy Andreas, Bin Liu, Anh Tran, Emmanuel Adeagbo, and Bevan Baas. 2016. A 5.8 pJ/Op 115 billion ops/sec, to 1.78 trillion ops/sec 32nm 1000-processor array. In VLSI Circuits (VLSI-Circuits), 2016 IEEE Symposium on.Google ScholarGoogle ScholarCross RefCross Ref
  6. Cristiana Bolchini, Matteo Carminati, and Antonio Miele. 2013. Self-adaptive fault tolerance in multi-/many-core systems. Journal of Electronic Testing 29, 2 (2013), 159--175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Christian Cachin, Rachid Guerraoui, and Luís Rodrigues. 2011. Introduction to Reliable and Secure Distributed Programming. Springer Science 8 Business Media. Google ScholarGoogle Scholar
  8. Tushar Deepak Chandra and Sam Toueg. 1996. Unreliable failure detectors for reliable distributed systems. J. ACM 43, 2 (1996), 225--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chen-Ling Chou and Radu Marculescu. FARM: Fault-aware resource management in NoC-based multiprocessor platforms. In 2011 Design, Automation 8 Test in Europe.Google ScholarGoogle Scholar
  10. Anup Das, Akash Kumar, and Bharadwaj Veeravalli. 2013. Communication and migration energy aware design space exploration for multicore systems with intermittent faults. In Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, 1631--1636. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Anup Das, Akash Kumar, and Bharadwaj Veeravalli. 2013. Reliability-driven task mapping for lifetime extension of networks-on-chip based multiprocessor systems. In Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, 689--694. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Anup Das, Akash Kumar, Bharadwaj Veeravalli, Cristiana Bolchini, and Antonio Miele. 2014. Combined DVFS and mapping exploration for lifetime and soft-error susceptibility improvement in MPSoCs. In Design, Automation and Test in Europe Conference and Exhibition (DATE), 2014. IEEE, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Mohammad Fattah, Maurizio Palesi, Pasi Liljeberg, Juha Plosila, and Hannu Tenhunen. 2014. Shifa: System-level hierarchy in run-time fault-aware management of many-core systems. In Design Automation Conference (DAC), 2014 51st ACM/EDAC/IEEE. IEEE, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Mohammad-Hashem Haghbayan, Antonio Miele, Amir M Rahmani, Pasi Liljeberg, and Hannu Tenhunen. 2016. A lifetime-aware runtime mapping approach for many-core systems in the dark silicon era. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE), 2016. IEEE, 854--857. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Adam S. Hartman and Donald E. Thomas. 2012. Lifetime improvement through runtime wear-based task mapping. In Proceedings of the Eighth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’12). ACM, New York, NY, USA, 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Christian Haubelt, Dirk Koch, Felix Reimann, Thilo Streichert, and Jürgen Teich. 2010. ReCoNets design methodology for embedded systems consisting of small networks of reconfigurable nodes and connections. In Dynamically Reconfigurable Systems. Springer, 223--243.Google ScholarGoogle Scholar
  17. Jörg Henkel, Lars Bauer, Nikil Dutt, Puneet Gupta, Sani Nassif, Muhammad Shafique, Mehdi Tahoori, and Norbert Wehn. 2013. Reliable on-chip systems in the nano-era: Lessons learnt and future trends. In Proc. of the 50th Annual Design Automation Conference. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jörg Henkel, Lars Bauer, Hongyan Zhang, Semeen Rehman, and Muhammad Shafique. 2014. Multi-layer dependability: From microarchitecture to application level. In 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Viacheslav Izosimov, Ilia Polian, Paul Pop, Petru Eles, and Zebo Peng. 2009. Analysis and optimization of fault-tolerant embedded systems with hardened processors. In 2009 Design, Automation 8 Test in Europe Conference 8 Exhibition. IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Paris Christos Kanellakis and Alex Allister Shvartsman. 2013. Fault-tolerant Parallel Computation. Vol. 401. Springer Science 8 Business Media.Google ScholarGoogle Scholar
  21. Fatemeh Khalili and Hamid R Zarandi. 2013. A reliability-aware multi-application mapping technique in networks-on-chip. In 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sebastian Kobbe, Lars Bauer, Daniel Lohmann, Wolfgang Schröder-Preikschat, and Jörg Henkel. 2011. DistRM: Distributed resource management for on-chip many-core systems. In Proceedings of the Seventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Israel Koren and C. Mani Krishna. 2010. Fault-tolerant Systems. Morgan Kaufmann.Google ScholarGoogle Scholar
  24. Leslie Lamport and others. 2001. Paxos made simple. ACM Sigact News 32, 4 (2001), 18--25.Google ScholarGoogle Scholar
  25. Xiaojun Li, Jin Qin, and Joseph B. Bernstein. 2008. Compact modeling of MOSFET wearout mechanisms for circuit-reliability simulation. IEEE Transactions on Device and Materials Reliability 8, 1 (2008), 98--121.Google ScholarGoogle ScholarCross RefCross Ref
  26. Leibo Liu, Chen Wu, Chenchen Deng, Shouyi Yin, Qinghua Wu, Jie Han, and Shaojun Wei. 2015. A flexible energy-and reliability-aware application mapping for NoC-based reconfigurable architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems (2015).Google ScholarGoogle ScholarCross RefCross Ref
  27. Timothy G. Mattson, Michael Riepen, Thomas Lehnig, Paul Brett, Werner Haas, Patrick Kennedy, Jason Howard, Sriram Vangal, and others. 2010. The 48-core SCC processor: The programmer’s view. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Peter Munk, Mohammad Shadi Alhakeem, Raphael Lisicki, Helge Parzyjegla, Jan Richling, and Hans-Ulrich Heiß. 2015. Toward a fault-tolerance framework for COTS many-core systems. In Dependable Computing Conference (EDCC), 2015 Eleventh European. IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Badrun Nahar and Brett H. Meyer. 2015. RotR: Rotational redundant task mapping for fail-operational MPSoCs. In Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS), 2015 IEEE International Symposium on. IEEE, 21--28.Google ScholarGoogle Scholar
  30. Andreas Olofsson. 2016. Epiphany-v: A 1024 processor 64-bit risc system-on-chip. arXiv preprint arXiv:1610.01832 (2016).Google ScholarGoogle Scholar
  31. Abbas Rahimi, Luca Benini, and Rajesh K. Gupta. 2016. Variability mitigation in nanometer CMOS integrated systems: A survey of techniques from circuits to software. Proc. IEEE 104, 7 (2016), 1410--1448.Google ScholarGoogle ScholarCross RefCross Ref
  32. Muhammad Shafique, Philip Axer, Christoph Borchert, Jian-Jia Chen, Kuan-Hsun Chen, Björn Döbel, Rolf Ernst, Hermann Härtig, Andreas Heinig, Rüdiger Kapitza, and others. 2015. Multi-layer software reliability for unreliable hardware. it-Information Technology 57, 3 (2015), 170--180.Google ScholarGoogle Scholar
  33. Muhammad Shafique and Jorg Henkel. 2013. Agent-based distributed power management for Kilo-core processors: Special Session: Keeping Kilo-core chips cool: New directions and emerging solutions. In Computer-Aided Design (ICCAD), 2013 IEEE/ACM International Conference on. IEEE, 153--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Premkishore Shivakumar, Michael Kistler, Stephen W. Keckler, Doug Burger, and Lorenzo Alvisi. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic. In Dependable Systems and Networks, 2002. DSN 2002. Proceedings. International Conference on. IEEE, 389--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Amit Kumar Singh, Piotr Dziurzanski, Hashan Roshantha Mendis, and Leandro Soares Indrusiak. 2017. A survey and comparative study of hard and soft real-time dynamic resource allocation strategies for multi/many-core systems. ACM Comput. Surv. (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Amit Kumar Singh, Muhammad Shafique, Akash Kumar, and Jörg Henkel. 2013. Mapping on multi/many-core systems: Survey of current and emerging trends. In Proceedings of the 50th Annual Design Automation Conference. ACM, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Vasileios Tsoutsouras, Sotirios Xydis, and Dimitrios Soudris. 2015. Job-arrival aware distributed run-time resource management on intel SCC manycore platform. In Embedded and Ubiquitous Computing (EUC), 2015 IEEE 13th International Conference on. IEEE, 17--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Eduardo Wachter, Vinicius Fochi, Francisco Barreto, Alexandre Amory, and Fernando Moraes. 2016. A hierarchical and distributed fault tolerant proposal for NoC-based MPSoCs. IEEE Transactions on Emerging Topics in Computing (2016).Google ScholarGoogle Scholar
  39. Sebastian Werner, Javier Navaridas, and Mikel Luján. 2016. A survey on design approaches to circumvent permanent faults in Networks-on-Chip. Comput. Surveys (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Chen Wu, Chenchen Deng, Leibo Liu, Jie Han, Jiqiang Chen, Shouyi Yin, and Shaojun Wei. 2015. An efficient application mapping approach for the co-optimization of reliability, energy, and performance in reconfigurable NoC architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34, 8 (2015), 1264--1277.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SoftRM: Self-Organized Fault-Tolerant Resource Management for Failure Detection and Recovery in NoC Based Many-Cores

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!