skip to main content
research-article
Public Access

Declarative Resilience: A Holistic Soft-Error Resilient Multicore Architecture that Trades off Program Accuracy for Efficiency

Published:24 July 2018Publication History
Skip Abstract Section

Abstract

To protect multicores from soft-error perturbations, research has explored various resiliency schemes that provide high soft-error coverage. However, these schemes incur high performance and energy overheads. We observe that not all soft-error perturbations affect program correctness, and some soft-errors only affect program accuracy, i.e., the program completes with certain acceptable deviations from error free outcome. Thus, it is practical to improve processor efficiency by trading off resiliency overheads with program accuracy. This article proposes the idea of declarative resilience that selectively applies strong resiliency schemes for code regions that are crucial for program correctness (crucial code) and lightweight resiliency for code regions that are susceptible to program accuracy deviations as a result of soft-errors (non-crucial code). At the application level, crucial and non-crucial code is identified based on its impact on the program outcome. A cross-layer architecture enables efficient resilience along with holistic soft-error coverage. Only program accuracy is compromised in the worst-case scenario of a soft-error strike during non-crucial code execution. For a set of machine-learning and graph analytic benchmarks, declarative resilience reduces performance overhead over a state-of-the-art system that applies strong resiliency for all program code regions from ∼ 1.43× to ∼ 1.2×.

References

  1. M. Ahmad, F. Hijaz, Q. Shi, and O. Khan. 2015. CRONO: A benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’15). 44--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Konstantinos Aisopos and Li-Shiuan Peh. 2011. A systematic methodology to develop resilient cache coherence protocols. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11). ACM, New York, NY, 47--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. M. Austin. 1999. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO’99). 196--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Bernick, B. Bruckert, P. D. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen. 2005. NonStop reg; advanced architecture. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’05). 12--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Bienia, S. Kumar, J. P. Singh, and K. Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’08). 72--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. 2016. End to end learning for self-driving cars. CoRR abs/1604.07316 (2016). Retrieved from http://arxiv.org/abs/1604.07316.Google ScholarGoogle Scholar
  7. Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying quantitative reliability for programs that execute on unreliable hardware. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’13). ACM, New York, NY, 33--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Chielle, J. R. Azambuja, R. S. Barth, F. Almeida, and F. L. Kastensmidt. 2013. Evaluating selective redundancy in data-flow software-based techniques. IEEE Trans. Nuclear Sci. 60, 4 (Aug. 2013), 2768--2775.Google ScholarGoogle ScholarCross RefCross Ref
  9. Paolo Crucitti, Vito Latora, Massimo Marchiori, and Andrea Rapisarda. 2003. Efficiency of scale-free networks: Error and attack tolerance. Physica A: Stat. Mech. Appl. 320 (2003), 622--642.Google ScholarGoogle ScholarCross RefCross Ref
  10. Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 497--508. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Deng, W. Dong, R. Socher, L. J. Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  12. Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Architecture support for disciplined approximate programming. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’12). ACM, New York, NY, 301--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Neural acceleration for general-purpose approximate programs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). IEEE Computer Society, Washington, DC, 449--460. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: Probabilistic soft error reliability on the cheap. SIGPLAN Not. 45, 3 (Mar. 2010), 385--396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yangguang Fu, Mingyue Ding, and Chengping Zhou. 2012. Phase angle-encoded and quantum-behaved particle swarm optimization applied to three-dimensional route planning for UAV. IEEE Trans. Syst., Man Cyberneti. Part A: Syst. Hum. 42 (2012), 511--526. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Gschwind, V. Salapura, C. Trammell, and S. A. McKee. 2011. SoftBeam: Precise tracking of transient faults and vulnerability analysis at processor design time. In Proceedings of the IEEE 29th International Conference on Computer Design (ICCD’11). 404--410. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. K. S. Hari, S. V. Adve, and H. Naeimi. 2012. Low-cost program-level detectors for reducing silent data corruptions. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’12). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Mei-Chen Hsueh, T. K. Tsai, and R. K. Iyer. 1997. Fault injection techniques and tools. Computer 30, 4 (Apr. 1997), 75--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Texas Instruments. 2016. Texas instruments soft error FAQs. Retrieved from http://www.ti.com/support-quality/faqs/soft-error-rate-faqs.html.Google ScholarGoogle Scholar
  20. G. R. Jagadeesh, T. Srikanthan, and K. H. Quek. 2002. Heuristic techniques for accelerating hierarchical routing on road networks. IEEE Trans. Intell. Transportat. Syst. 3, 4 (Dec. 2002), 301--309. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Karnik, B. Bloechel, K. Soumyanath, V. De, and S. Borkar. 2001. Scaling trends of cosmic ray induced soft errors in static latches beyond 0.18 /spl mu/. In Proceedings of the Symposium on VLSI Circuits. Digest of Technical Papers (IEEE Cat. No. 01CH37185). 61--62.Google ScholarGoogle Scholar
  22. D. S. Khudia and S. Mahlke. 2014. Harnessing soft computations for low-budget fault tolerance. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. 319--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Kim, H. Kim, K. Lakshmanan, and R. Rajkumar. 2013. Parallel scheduling for cyber-physical systems: Analysis and case study on a self-driving car. In Proceedings of the ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS’13). 31--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Kooli and G. Di Natale. 2014. A survey on simulation-based fault injection tools for complex systems. In Proceedings of the 9th IEEE International Conference on Design Technology of Integrated Systems in Nanoscale Era (DTIS’14). 1--6.Google ScholarGoogle Scholar
  25. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097--1105. Retrieved from http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (Nov. 1998), 2278--2324.Google ScholarGoogle ScholarCross RefCross Ref
  27. Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. 2008. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. CoRR abs/0810.1355 (2008). Retrieved from http://arxiv.org/abs/0810.1355.Google ScholarGoogle Scholar
  28. H. Li, D. Song, Y. Lu, and J. Liu. 2012. A two-view based multilayer feature graph for robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA’12). 3580--3587.Google ScholarGoogle Scholar
  29. Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the Annual International Symposium on Microarchitecture (MICRO’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. Li, M. Shafique, J. A. Ambrose, S. Rehman, J. Henkel, and S. Parameswaran. 2013. RASTER: Runtime adaptive spatial/temporal error resiliency for embedded processors. In Proceedings of the 50th ACM/EDAC/IEEE Design Automation Conference (DAC’13). 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. W. Lichtman, H. Pfister, and N. Shavit. 2014. The big data challenges of connectomics. In Nature Neuroscience, volume 17. Nature Publishing Group, 1448--1454.Google ScholarGoogle Scholar
  32. Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2011. Flikker: Saving DRAM refresh-power through critical data partitioning. SIGPLAN Not. 46, 3 (Mar. 2011), 213--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Q. Lu, M. Farahani, J. Wei, A. Thomas, and K. Pattabiraman. 2015. LLFI: An intermediate code-level fault injection tool for hardware faults. In Proceedings of the IEEE International Conference on Software Quality, Reliability and Security. 11--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Meixner, M. E. Bauer, and D. J. Sorin. 2008. Argus: Low-cost, comprehensive error detection in simple cores. IEEE Micro 28, 1 (Jan. 2008), 52--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. S. Miguel, J. Albericio, A. Moshovos, and N. E. Jerger. 2015. Doppelganger: A cache for approximate computing. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’15). 50--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal. 2010. Graphite: A distributed parallel simulator for multicores. In Proceedings of the 16th International Symposium on High-Performance Computer Architecture (HPCS’10). 1--12.Google ScholarGoogle Scholar
  37. S. Misailovic, S. Sidiroglou, H. Hoffmann, and M. Rinard. 2010. Quality of service profiling. In Proceedings of the ACM/IEEE 32nd International Conference on Software Engineering, Vol. 1. 25--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. 2002. Detailed design and evaluation of redundant multi-threading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture. 99--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. L. Murphy and P. Newman. 2011. Risky planning: Path planning over costmaps with a probabilistically bounded speed-accuracy tradeoff. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA’11). 3727--3732.Google ScholarGoogle Scholar
  40. Michael A. Nielsen. 2015. Neural Networks and Deep Learning. Determination Press.Google ScholarGoogle Scholar
  41. N. Oh, S. Mitra, and E. J. McCluskey. 2002. ED4I: Error detection by diverse data and duplicated instructions. IEEE Trans. Comput. 51, 2 (Feb. 2002), 180--199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. H. Omar, M. Ahmad, and O. Khan. 2017. GraphTuner: An input dependence aware loop perforation scheme for efficient execution of approximated graph algorithms. In Proceedings of the IEEE International Conference on Computer Design (ICCD’17). 201--208.Google ScholarGoogle Scholar
  43. M. W. Rashid and M. C. Huang. 2008. Supporting highly-decoupled thread-level redundancy for parallel programs. In Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture. 393--404.Google ScholarGoogle Scholar
  44. V. Reddy and E. Rotenberg. 2008. Coverage of a microarchitecture-level fault check regimen in a superscalar processor. In Proceedings of the IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN’08). 1--10.Google ScholarGoogle Scholar
  45. S. Rehman, F. Kriebel, Duo Sun, M. Shafique, and J. Henkel. 2014. dTune: Leveraging reliable code generation for adaptive dependability tuning under process variation and aging-induced effects. In Proceedings of the 51st ACM/EDAC/IEEE Design Automation Conference (DAC’14). 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. 2005. SWIFT: Software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization. 243--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August, and Shubhendu S. Mukherjee. 2005. Software-controlled fault tolerance. ACM Trans. Archit. Code Optim. 2, 4 (Dec. 2005), 366--396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Felipe Restrepo-Calle, Antonio Martínez-Álvarez, Sergio Cuenca-Asensi, and Antonio Jimeno-Morenilla. 2013. Selective SWIFT-R. J. Electron. Test. 29, 6 (Dec. 2013), 825--838. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. V. Roberge, M. Tarbouchi, and G. Labonte. 2013. Comparison of parallel genetic algorithm and particle swarm optimization for real-time UAV path planning. IEEE Trans. Industr. Informat. 9, 1 (Feb. 2013), 132--141.Google ScholarGoogle ScholarCross RefCross Ref
  50. E. Rotenberg. 1999. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing. 84--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Mehrzad Samadi, Janghaeng Lee, D. Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. 2013. SAGE: Self-tuning approximation for graphics engines. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’13). ACM, New York, NY, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Siva Kumar Sastry Hari, Man-Lap Li, Pradeep Ramachandran, Byn Choi, and Sarita V. Adve. 2009. mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). ACM, New York, NY, 122--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. I. Sato and H. Niihara. 2014. Beyond pedestrian detection: Deep neural networks level-up automotive safety. In Proceedings of the GPU Technology Conference.Google ScholarGoogle Scholar
  54. P. Sermanet and Y. LeCun. 2011. Traffic sign recognition with multi-scale convolutional networks. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’11). 2809--2813.Google ScholarGoogle Scholar
  55. Muhammad Shafique, Semeen Rehman, Pau Vilimelis Aceituno, and Jörg Henkel. 2013. Exploiting program-level masking and error propagation for constrained reliability optimization. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). ACM, New York, NY, Article 17, 9 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Q. Shi, H. Hoffmann, and O. Khan. 2015. A cross-layer multicore architecture to tradeoff program accuracy and resilience overheads. IEEE Comput. Architect. Lett. 14, 2 (July 2015), 85--89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Q. Shi and O. Khan. 2013. Toward holistic soft-error-resilient shared-memory multicores. Computer 46, 10 (Oct. 2013), 56--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Qingchuan Shi, Hamza Omar, and Omer Khan. 2017. Exploiting the tradeoff between program accuracy and soft-error resiliency overhead for machine learning workloads. CoRR abs/1707.02589 (2017). Retrieved from http://arxiv.org/abs/1707.02589.Google ScholarGoogle Scholar
  59. Premkishore Shivakumar, Michael Kistler, Stephen W. Keckler, Doug Burger, Lorenzo Alvisi, Ibm Technical, Contacts John Keaty, Rob Bell, and Ram Rajamony. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of International Conference on Dependable Systems and Networks. 389--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE’11). ACM, New York, NY, 124--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. T. J. Siegel, E. Pfeffer, and J. A. Magee. 2004. The IBM eServer Z990 microprocessor. IBM J. Res. Dev. 48, 3–4 (May 2004), 295--309. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). Retrieved from http://arxiv.org/abs/1409.1556.Google ScholarGoogle Scholar
  63. J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. 2011. The german traffic sign recognition benchmark: A multi-class classification competition. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’11). 1453--1460.Google ScholarGoogle Scholar
  64. Anthony Stentz. 1995. The focussed D* algorithm for real-time replanning. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI’95). Morgan Kaufmann Publishers Inc., San Francisco, CA, 1652--1659. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Chen Sun, Chia-Hsin Owen Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. 2012. DSENT—A tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Proceedings of the International Symposium on Networks-on-Chip. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. A. Vega, C. C. Lin, K. Swaminathan, A. Buyuktosunoglu, S. Pankanti, and P. Bose. 2015. Resilient, UAV-embedded real-time computing. In Proceedings of the 33rd IEEE International Conference on Computer Design (ICCD’15). 736--739. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. R. Venkatagiri, A. Mahmoud, S. K. S. Hari, and S. V. Adve. 2016. Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. R. Viguier, C. C. Lin, K. Swaminathan, A. Vega, A. Buyuktosunoglu, S. Pankanti, P. Bose, H. Akbarpour, F. Bunyak, K. Palaniappan, and G. Seetharaman. 2015. Resilient mobile cognition: Algorithms, innovations, and architectures. In Proceedings of the 33rd IEEE International Conference on Computer Design (ICCD’15). 728--731. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. J. Wadden, A. Lyashevsky, S. Gurumurthi, V. Sridharan, and K. Skadron. 2014. Real-world design and evaluation of compiler-managed GPU redundant multithreading. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). 73--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. N. J. Wang and S. J. Patel. 2005. ReStore: Symptom based soft error detection in microprocessors. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’05). 30--39. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Declarative Resilience: A Holistic Soft-Error Resilient Multicore Architecture that Trades off Program Accuracy for Efficiency

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!