Abstract
To protect multicores from soft-error perturbations, research has explored various resiliency schemes that provide high soft-error coverage. However, these schemes incur high performance and energy overheads. We observe that not all soft-error perturbations affect program correctness, and some soft-errors only affect program accuracy, i.e., the program completes with certain acceptable deviations from error free outcome. Thus, it is practical to improve processor efficiency by trading off resiliency overheads with program accuracy. This article proposes the idea of declarative resilience that selectively applies strong resiliency schemes for code regions that are crucial for program correctness (crucial code) and lightweight resiliency for code regions that are susceptible to program accuracy deviations as a result of soft-errors (non-crucial code). At the application level, crucial and non-crucial code is identified based on its impact on the program outcome. A cross-layer architecture enables efficient resilience along with holistic soft-error coverage. Only program accuracy is compromised in the worst-case scenario of a soft-error strike during non-crucial code execution. For a set of machine-learning and graph analytic benchmarks, declarative resilience reduces performance overhead over a state-of-the-art system that applies strong resiliency for all program code regions from ∼ 1.43× to ∼ 1.2×.
- M. Ahmad, F. Hijaz, Q. Shi, and O. Khan. 2015. CRONO: A benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’15). 44--55. Google Scholar
Digital Library
- Konstantinos Aisopos and Li-Shiuan Peh. 2011. A systematic methodology to develop resilient cache coherence protocols. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11). ACM, New York, NY, 47--58. Google Scholar
Digital Library
- T. M. Austin. 1999. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO’99). 196--207. Google Scholar
Digital Library
- D. Bernick, B. Bruckert, P. D. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen. 2005. NonStop reg; advanced architecture. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’05). 12--21. Google Scholar
Digital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’08). 72--81. Google Scholar
Digital Library
- Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. 2016. End to end learning for self-driving cars. CoRR abs/1604.07316 (2016). Retrieved from http://arxiv.org/abs/1604.07316.Google Scholar
- Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying quantitative reliability for programs that execute on unreliable hardware. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’13). ACM, New York, NY, 33--52. Google Scholar
Digital Library
- E. Chielle, J. R. Azambuja, R. S. Barth, F. Almeida, and F. L. Kastensmidt. 2013. Evaluating selective redundancy in data-flow software-based techniques. IEEE Trans. Nuclear Sci. 60, 4 (Aug. 2013), 2768--2775.Google Scholar
Cross Ref
- Paolo Crucitti, Vito Latora, Massimo Marchiori, and Andrea Rapisarda. 2003. Efficiency of scale-free networks: Error and attack tolerance. Physica A: Stat. Mech. Appl. 320 (2003), 622--642.Google Scholar
Cross Ref
- Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 497--508. Google Scholar
Digital Library
- J. Deng, W. Dong, R. Socher, L. J. Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 248--255.Google Scholar
Cross Ref
- Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Architecture support for disciplined approximate programming. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’12). ACM, New York, NY, 301--312. Google Scholar
Digital Library
- Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Neural acceleration for general-purpose approximate programs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). IEEE Computer Society, Washington, DC, 449--460. Google Scholar
Digital Library
- Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: Probabilistic soft error reliability on the cheap. SIGPLAN Not. 45, 3 (Mar. 2010), 385--396. Google Scholar
Digital Library
- Yangguang Fu, Mingyue Ding, and Chengping Zhou. 2012. Phase angle-encoded and quantum-behaved particle swarm optimization applied to three-dimensional route planning for UAV. IEEE Trans. Syst., Man Cyberneti. Part A: Syst. Hum. 42 (2012), 511--526. Google Scholar
Digital Library
- M. Gschwind, V. Salapura, C. Trammell, and S. A. McKee. 2011. SoftBeam: Precise tracking of transient faults and vulnerability analysis at processor design time. In Proceedings of the IEEE 29th International Conference on Computer Design (ICCD’11). 404--410. Google Scholar
Digital Library
- S. K. S. Hari, S. V. Adve, and H. Naeimi. 2012. Low-cost program-level detectors for reducing silent data corruptions. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’12). 1--12. Google Scholar
Digital Library
- Mei-Chen Hsueh, T. K. Tsai, and R. K. Iyer. 1997. Fault injection techniques and tools. Computer 30, 4 (Apr. 1997), 75--82. Google Scholar
Digital Library
- Texas Instruments. 2016. Texas instruments soft error FAQs. Retrieved from http://www.ti.com/support-quality/faqs/soft-error-rate-faqs.html.Google Scholar
- G. R. Jagadeesh, T. Srikanthan, and K. H. Quek. 2002. Heuristic techniques for accelerating hierarchical routing on road networks. IEEE Trans. Intell. Transportat. Syst. 3, 4 (Dec. 2002), 301--309. Google Scholar
Digital Library
- T. Karnik, B. Bloechel, K. Soumyanath, V. De, and S. Borkar. 2001. Scaling trends of cosmic ray induced soft errors in static latches beyond 0.18 /spl mu/. In Proceedings of the Symposium on VLSI Circuits. Digest of Technical Papers (IEEE Cat. No. 01CH37185). 61--62.Google Scholar
- D. S. Khudia and S. Mahlke. 2014. Harnessing soft computations for low-budget fault tolerance. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. 319--330. Google Scholar
Digital Library
- J. Kim, H. Kim, K. Lakshmanan, and R. Rajkumar. 2013. Parallel scheduling for cyber-physical systems: Analysis and case study on a self-driving car. In Proceedings of the ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS’13). 31--40. Google Scholar
Digital Library
- M. Kooli and G. Di Natale. 2014. A survey on simulation-based fault injection tools for complex systems. In Proceedings of the 9th IEEE International Conference on Design Technology of Integrated Systems in Nanoscale Era (DTIS’14). 1--6.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097--1105. Retrieved from http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf. Google Scholar
Digital Library
- Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (Nov. 1998), 2278--2324.Google Scholar
Cross Ref
- Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. 2008. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. CoRR abs/0810.1355 (2008). Retrieved from http://arxiv.org/abs/0810.1355.Google Scholar
- H. Li, D. Song, Y. Lu, and J. Liu. 2012. A two-view based multilayer feature graph for robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA’12). 3580--3587.Google Scholar
- Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the Annual International Symposium on Microarchitecture (MICRO’09). Google Scholar
Digital Library
- T. Li, M. Shafique, J. A. Ambrose, S. Rehman, J. Henkel, and S. Parameswaran. 2013. RASTER: Runtime adaptive spatial/temporal error resiliency for embedded processors. In Proceedings of the 50th ACM/EDAC/IEEE Design Automation Conference (DAC’13). 1--7. Google Scholar
Digital Library
- J. W. Lichtman, H. Pfister, and N. Shavit. 2014. The big data challenges of connectomics. In Nature Neuroscience, volume 17. Nature Publishing Group, 1448--1454.Google Scholar
- Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2011. Flikker: Saving DRAM refresh-power through critical data partitioning. SIGPLAN Not. 46, 3 (Mar. 2011), 213--224. Google Scholar
Digital Library
- Q. Lu, M. Farahani, J. Wei, A. Thomas, and K. Pattabiraman. 2015. LLFI: An intermediate code-level fault injection tool for hardware faults. In Proceedings of the IEEE International Conference on Software Quality, Reliability and Security. 11--16. Google Scholar
Digital Library
- A. Meixner, M. E. Bauer, and D. J. Sorin. 2008. Argus: Low-cost, comprehensive error detection in simple cores. IEEE Micro 28, 1 (Jan. 2008), 52--59. Google Scholar
Digital Library
- J. S. Miguel, J. Albericio, A. Moshovos, and N. E. Jerger. 2015. Doppelganger: A cache for approximate computing. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’15). 50--61. Google Scholar
Digital Library
- J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal. 2010. Graphite: A distributed parallel simulator for multicores. In Proceedings of the 16th International Symposium on High-Performance Computer Architecture (HPCS’10). 1--12.Google Scholar
- S. Misailovic, S. Sidiroglou, H. Hoffmann, and M. Rinard. 2010. Quality of service profiling. In Proceedings of the ACM/IEEE 32nd International Conference on Software Engineering, Vol. 1. 25--34. Google Scholar
Digital Library
- S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. 2002. Detailed design and evaluation of redundant multi-threading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture. 99--110. Google Scholar
Digital Library
- L. Murphy and P. Newman. 2011. Risky planning: Path planning over costmaps with a probabilistically bounded speed-accuracy tradeoff. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA’11). 3727--3732.Google Scholar
- Michael A. Nielsen. 2015. Neural Networks and Deep Learning. Determination Press.Google Scholar
- N. Oh, S. Mitra, and E. J. McCluskey. 2002. ED4I: Error detection by diverse data and duplicated instructions. IEEE Trans. Comput. 51, 2 (Feb. 2002), 180--199. Google Scholar
Digital Library
- H. Omar, M. Ahmad, and O. Khan. 2017. GraphTuner: An input dependence aware loop perforation scheme for efficient execution of approximated graph algorithms. In Proceedings of the IEEE International Conference on Computer Design (ICCD’17). 201--208.Google Scholar
- M. W. Rashid and M. C. Huang. 2008. Supporting highly-decoupled thread-level redundancy for parallel programs. In Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture. 393--404.Google Scholar
- V. Reddy and E. Rotenberg. 2008. Coverage of a microarchitecture-level fault check regimen in a superscalar processor. In Proceedings of the IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN’08). 1--10.Google Scholar
- S. Rehman, F. Kriebel, Duo Sun, M. Shafique, and J. Henkel. 2014. dTune: Leveraging reliable code generation for adaptive dependability tuning under process variation and aging-induced effects. In Proceedings of the 51st ACM/EDAC/IEEE Design Automation Conference (DAC’14). 1--6. Google Scholar
Digital Library
- G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. 2005. SWIFT: Software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization. 243--254. Google Scholar
Digital Library
- George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August, and Shubhendu S. Mukherjee. 2005. Software-controlled fault tolerance. ACM Trans. Archit. Code Optim. 2, 4 (Dec. 2005), 366--396. Google Scholar
Digital Library
- Felipe Restrepo-Calle, Antonio Martínez-Álvarez, Sergio Cuenca-Asensi, and Antonio Jimeno-Morenilla. 2013. Selective SWIFT-R. J. Electron. Test. 29, 6 (Dec. 2013), 825--838. Google Scholar
Digital Library
- V. Roberge, M. Tarbouchi, and G. Labonte. 2013. Comparison of parallel genetic algorithm and particle swarm optimization for real-time UAV path planning. IEEE Trans. Industr. Informat. 9, 1 (Feb. 2013), 132--141.Google Scholar
Cross Ref
- E. Rotenberg. 1999. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing. 84--91. Google Scholar
Digital Library
- Mehrzad Samadi, Janghaeng Lee, D. Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. 2013. SAGE: Self-tuning approximation for graphics engines. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’13). ACM, New York, NY, 13--24. Google Scholar
Digital Library
- Siva Kumar Sastry Hari, Man-Lap Li, Pradeep Ramachandran, Byn Choi, and Sarita V. Adve. 2009. mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). ACM, New York, NY, 122--132. Google Scholar
Digital Library
- I. Sato and H. Niihara. 2014. Beyond pedestrian detection: Deep neural networks level-up automotive safety. In Proceedings of the GPU Technology Conference.Google Scholar
- P. Sermanet and Y. LeCun. 2011. Traffic sign recognition with multi-scale convolutional networks. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’11). 2809--2813.Google Scholar
- Muhammad Shafique, Semeen Rehman, Pau Vilimelis Aceituno, and Jörg Henkel. 2013. Exploiting program-level masking and error propagation for constrained reliability optimization. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). ACM, New York, NY, Article 17, 9 pages. Google Scholar
Digital Library
- Q. Shi, H. Hoffmann, and O. Khan. 2015. A cross-layer multicore architecture to tradeoff program accuracy and resilience overheads. IEEE Comput. Architect. Lett. 14, 2 (July 2015), 85--89. Google Scholar
Digital Library
- Q. Shi and O. Khan. 2013. Toward holistic soft-error-resilient shared-memory multicores. Computer 46, 10 (Oct. 2013), 56--64. Google Scholar
Digital Library
- Qingchuan Shi, Hamza Omar, and Omer Khan. 2017. Exploiting the tradeoff between program accuracy and soft-error resiliency overhead for machine learning workloads. CoRR abs/1707.02589 (2017). Retrieved from http://arxiv.org/abs/1707.02589.Google Scholar
- Premkishore Shivakumar, Michael Kistler, Stephen W. Keckler, Doug Burger, Lorenzo Alvisi, Ibm Technical, Contacts John Keaty, Rob Bell, and Ram Rajamony. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of International Conference on Dependable Systems and Networks. 389--398. Google Scholar
Digital Library
- Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE’11). ACM, New York, NY, 124--134. Google Scholar
Digital Library
- T. J. Siegel, E. Pfeffer, and J. A. Magee. 2004. The IBM eServer Z990 microprocessor. IBM J. Res. Dev. 48, 3–4 (May 2004), 295--309. Google Scholar
Digital Library
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). Retrieved from http://arxiv.org/abs/1409.1556.Google Scholar
- J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. 2011. The german traffic sign recognition benchmark: A multi-class classification competition. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’11). 1453--1460.Google Scholar
- Anthony Stentz. 1995. The focussed D* algorithm for real-time replanning. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI’95). Morgan Kaufmann Publishers Inc., San Francisco, CA, 1652--1659. Google Scholar
Digital Library
- Chen Sun, Chia-Hsin Owen Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. 2012. DSENT—A tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Proceedings of the International Symposium on Networks-on-Chip. Google Scholar
Digital Library
- A. Vega, C. C. Lin, K. Swaminathan, A. Buyuktosunoglu, S. Pankanti, and P. Bose. 2015. Resilient, UAV-embedded real-time computing. In Proceedings of the 33rd IEEE International Conference on Computer Design (ICCD’15). 736--739. Google Scholar
Digital Library
- R. Venkatagiri, A. Mahmoud, S. K. S. Hari, and S. V. Adve. 2016. Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1--14. Google Scholar
Digital Library
- R. Viguier, C. C. Lin, K. Swaminathan, A. Vega, A. Buyuktosunoglu, S. Pankanti, P. Bose, H. Akbarpour, F. Bunyak, K. Palaniappan, and G. Seetharaman. 2015. Resilient mobile cognition: Algorithms, innovations, and architectures. In Proceedings of the 33rd IEEE International Conference on Computer Design (ICCD’15). 728--731. Google Scholar
Digital Library
- J. Wadden, A. Lyashevsky, S. Gurumurthi, V. Sridharan, and K. Skadron. 2014. Real-world design and evaluation of compiler-managed GPU redundant multithreading. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). 73--84. Google Scholar
Digital Library
- N. J. Wang and S. J. Patel. 2005. ReStore: Symptom based soft error detection in microprocessors. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’05). 30--39. Google Scholar
Digital Library
Index Terms
Declarative Resilience: A Holistic Soft-Error Resilient Multicore Architecture that Trades off Program Accuracy for Efficiency
Recommendations
Simulating Single Event Transients in VDSM ICs for Ground Level Radiation
This work considers a SET (single event transient) fault simulation technique to evaluate the probability that a transient pulse, born in the combinational logic, may be latched in a storage cell. Fault injection procedures and a fast fault simulation ...
Time Redundancy Based Soft-Error Tolerance to Rescue Nanometer Technologies
VTS '99: Proceedings of the 1999 17TH IEEE VLSI Test SymposiumThe increased operating frequencies, geometry shrinking and power supply reduction that accompany the process of very deep submicron scaling, affect the reliable operation of very deep submicron ICs. The effects of various noise sources are becoming of ...
An on-line soft error mitigation technique for control logic of VLIW processors
DFT '12: Proceedings of the 2012 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)The soft error phenomenon is forecast to be a real threat for today's technology of ICs. While implementing error detection and correction codes for regular structural memory arrays have been effectively used to stem the emerging soft error threat, ...






Comments