Abstract
IoT devices need reliable hardware at low cost. It is challenging to efficiently cope with both hard and soft faults in embedded scratchpad memories. To address this problem, we propose a two-step approach: FaultLink and Software-Defined Error-Localizing Codes (SDELC). FaultLink avoids hard faults found during testing by generating a custom-tailored application binary image for each individual chip. During software deployment-time, FaultLink optimally packs small sections of program code and data into fault-free segments of the memory address space and generates a custom linker script for a lazy-linking procedure. During run-time, SDELC deals with unpredictable soft faults via novel and inexpensive Ultra-Lightweight Error-Localizing Codes (UL-ELCs). These require fewer parity bits than single-error-correcting Hamming codes. Yet our UL-ELCs are more powerful than basic single-error-detecting parity: they localize single-bit errors to a specific chunk of a codeword. SDELC then heuristically recovers from these localized errors using a small embedded C library that exploits observable side information (SI) about the application’s memory contents. SI can be in the form of redundant data (value locality), legal/illegal instructions, etc. Our combined FaultLink+SDELC approach improves min-VDD by up to 440 mV and correctly recovers from up to 90% (70%) of random single-bit soft faults in data (instructions) with just three parity bits per 32-bit word.
- 1995. Tool Interface Standard (TIS) Executable and Linking Format (ELF) Specification (Version 1.2). (1995).Google Scholar
- Amit Agarwal, Bipul C. Paul, Hamid Mahmoodi, Animesh Datta, and Kaushik Roy. 2005. A process-tolerant cache architecture for improved yield in nanoscale technologies. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 13, 1 (2005), 27--38. Google Scholar
Digital Library
- Yuvraj Agarwal, Alex Bishop, Tuck-Boon Chan, Matt Fotjik, Puneet Gupta, Andrew B. Kahng, Liangzhen Lai, Paul Martin, Mani Srivastava, Dennis Sylvester, Lucas Wanner, and Bing Zhang. 2014. RedCooper: Hardware Sensor Enabled Variability Software Testbed for Lifetime Energy Constrained Application. Technical Report. University of California, Los Angeles (UCLA).Google Scholar
- F. J. Aichelmann. 1984. Fault-tolerant design techniques for semiconductor memory applications. IBM Journal of Research and Development 28, 2 (1984), 177--183. Google Scholar
Digital Library
- Alaa Alameldeen and David Wood. 2004. Frequent Pattern Compression: A Significance-Based Compression Scheme for L2 Caches. Technical Report. University of Wisconsin, Madison.Google Scholar
- Alaa R. Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu, Chris Wilkerson, and Shih-Lien Lu. 2011. Energy-efficient cache design using variable-strength error-correcting codes. In Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- Amin Ansari, Shuguang Feng, Shantanu Gupta, and Scott Mahlke. 2011. Archipelago: A polymorphic cache design for enabling robust near-threshold operation. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA). Google Scholar
Digital Library
- Amin Ansari, Shantanu Gupta, Shuguang Feng, and Scott Mahlke. 2009. ZerehCache: Armoring cache architectures in high defect density technologies. In Proceedings of the ACM/IEEE International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- Abbas BanaiyanMofrad, Houman Homayoun, and Nikil Dutt. 2011. FFT-cache: A flexible fault-tolerant cache architecture for ultra low voltage operation. In Proceedings of the ACM/IEEE International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES). Google Scholar
Digital Library
- Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, M. Balakrishnan, and Peter Marwedel. 2002. Scratchpad memory: A design alternative for cache on-chip memory in embedded systems. In Proceedings of the ACM/IEEE International Symposium on Hardware/Software Codesign (CODES). Google Scholar
Digital Library
- Luis Angel D. Bathen and Nikil D. Dutt. 2011. E-RoC: Embedded RAIDs-on-chip for low power distributed dynamically managed reliable memories. In Design, Automation, and Test in Europe (DATE).Google Scholar
- Luis Angel D. Bathen, Nikil D. Dutt, Alex Nicolau, and Puneet Gupta. 2012. VaMV: Variability-aware memory virtualization. In Design, Automation, and Test in Europe (DATE). Google Scholar
Digital Library
- Robert C. Baumann. 2005. Radiation-induced soft errors in advanced semiconductor technologies. IEEE Transactions on Device and Materials Reliability 5, 3 (2005), 305--316.Google Scholar
- Timothy J. Dell. 1997. A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory. Technical Report. IBM Microelectronics Division.Google Scholar
- Nikil Dutt, Puneet Gupta, Alex Nicolau, Abbas BanaiyanMofrad, Mark Gottscho, and Majid Shoushtari. 2014. Multi-layer memory resiliency. In Proceedings of the ACM/IEEE Design Automation Conference (DAC). Google Scholar
Digital Library
- Hamed Farbeh, Mahdi Fazeli, Faramarz Khosravi, and Seyed Ghassem Miremadi. 2012. Memory mapped SPM: Protecting instruction scratchpad memory in embedded systems against soft errors. In Proceedings of the European Dependable Computing Conference (EDCC). Google Scholar
Digital Library
- Eiji Fujiwara and Masato Kitakami. 1993. A class of error locating codes for byte-organized memory systems. In Proceedings of the International Symposium on Fault-Tolerant Computing.Google Scholar
Cross Ref
- Mark Gottscho, Abbas BanaiyanMofrad, Nikil Dutt, Alex Nicolau, and Puneet Gupta. 2015. DPCS: Dynamic power/capacity scaling for SRAM caches in the nanoscale era. ACM Transactions on Architecture and Code Optimization (TACO) 12, 3 (2015), 26. Google Scholar
Digital Library
- Mark Gottscho, Luis A. D. Bathen, Nikil Dutt, Alex Nicolau, and Puneet Gupta. 2015. ViPZonE: Hardware power variability-aware memory management for energy savings. IEEE Transactions on Computers (TC) 64, 5 (2015), 1483--1496.Google Scholar
Digital Library
- Mark Gottscho, Clayton Schoeny, Lara Dolecek, and Puneet Gupta. 2016. Software-defined error-correcting codes. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).Google Scholar
Cross Ref
- Puneet Gupta, Yuvraj Agarwal, Lara Dolecek, Nikil Dutt, Rajesh K. Gupta, Rakesh Kumar, Subhasish Mitra, Alexandru Nicolau, Tajana Simunic Rosing, Mani B. Srivastava, Steven Swanson, and Dennis Sylvester. 2013. Underdesigned and opportunistic computing in presence of hardware variability. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD) 32, 1 (2013), 8--23. Google Scholar
Digital Library
- Matthew R. Guthaus, Jeffrey S. Ringenberg, Dan Ernst, Todd M. Austin, Trevor Mudge, and Richard B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the IEEE International Workshop on Workload Characterization (IWWC). Google Scholar
Digital Library
- Said Hamdioui, Georgi Gaydadjiev, and Ad J. van de Goor. 2004. The state-of-art and future trends in testing embedded memories. In International Workshop on Memory Technology, Design and Testing (MTDT). Google Scholar
Digital Library
- Said Hamdioui, Ad J. van de Goor, and Mike Rodgers. 2002. March SS: A test for all static simple RAM faults. In International Workshop on Memory Technology, Design, and Testing (MTDT). Google Scholar
Digital Library
- Nam Sung Kim, Krisztian Flautner, David Blaauw, and Trevor Mudge. 2004. Circuit and microarchitectural techniques for reducing cache leakage power. IEEE Transactions on Very Large Scale Integration Systems (TVLSI) 12, 2 (2004), 167--184. Google Scholar
Digital Library
- Liangzhen Lai. 2015. Cross-Layer Approaches for Monitoring, Margining and Mitigation of Circuit Variability. Ph.D. Dissertation. University of California, Los Angeles (UCLA).Google Scholar
- Serge Lamikhov-Center. 2016. ELFIO: C++ Library for Reading and Generating ELF Files. (2016). http://elfio.sourceforge.net/Google Scholar
- F. Li, G. Chen, M. Kandemir, and I. Kolcu. 2005. Improving scratch-pad memory reliability through compiler-guided data block duplication. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). Google Scholar
Digital Library
- Man-Lap Li, Pradeep Ramachandran, Swarup K. Sahoo, Sarita V. Adve, Vikram S. Adve, and Yuanyuan Zhou. 2008. Understanding the propagation of hard errors to software and implications for resilient system design. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Mikko H. Lipasti, Christopher B. Wilkerson, and John Paul Shen. 1996. Value locality and load value prediction. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G Zorn. 2011. Flikker: Saving DRAM refresh-power through critical data partitioning. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Shih-Lien Lu, Qiong Cai, and Patrick Stolt. 2013. Memory resiliency. Intel Technology Journal 17, 1 (2013).Google Scholar
- Yixin Luo, Sriram Govindan, Bikash Sharma, Mark Santaniello, Justin Meza, Aman Kansal, Jie Liu, Badriddine Khessib, Kushagra Vaid, and Onur Mutlu. 2014. Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). Google Scholar
Digital Library
- Tayyeb Mahmood, Seokin Hong, and Soontae Kim. 2015. Ensuring cache reliability and energy scaling at near-threshold voltage with macho. IEEE Transactions on Computers (TC) 64, 6 (2015), 1694--1706.Google Scholar
Digital Library
- Mehrtash Manoochehri, Murali Annavaram, and Michel Dubois. 2011. CPPC: Correctable parity protected cache. In Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- Michail Mavropoulos, Georgios Keramidas, and Dimitris Nikolos. 2015. A defect-aware reconfigurable cache architecture for low-vccmin DVFS-enabled systems. In Design, Automation, and Test in Europe (DATE). Google Scholar
Digital Library
- Sparsh Mittal. 2014. A survey of architectural techniques for improving cache power efficiency. Sustainable Computing: Informatics and Systems 4, 1 (2014), 33--43.Google Scholar
Cross Ref
- Sparsh Mittal. 2016. A survey of architectural techniques for managing process variation. Comput. Surveys 48, 4 (2016). Google Scholar
Digital Library
- Amir Mahdi Hosseini Monazzah, Hamed Farbeh, Seyed Ghassem Miremadi, Mahdi Fazeli, and Hossein Asadi. 2013. FTSPM: A fault-tolerant scratchpad memory. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). Google Scholar
Digital Library
- M. Mutyam and V. Narayanan. 2007. Working with process variation aware caches. In Design, Automation, and Test in Europe (DATE). Google Scholar
Digital Library
- Preeti Ranjan Panda, Nikil Dutt, and Alexandru Nicolau. 1999. Memory Issues in Embedded Systems-on-Chip: Optimizations and Exploration.Google Scholar
- Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2012. Base-delta-immediate compression: Practical data compression for on-chip caches. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT). Google Scholar
Digital Library
- Michael Powell, Se-Hyun Yang, Babak Falsafi, Kaushik Roy, and T. N. Vijaykumar. 2000. Gated-Vdd: A circuit technique to reduce leakage in deep-submicron cache memories. In Proceedings of the IEEE International Symposium on Low Power Electronics and Design (ISLPED). Google Scholar
Digital Library
- Moinuddin K. Qureshi and Zeshan Chishti. 2013. Operating SECDED-based caches at ultra-low voltage with FLAIR. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). Google Scholar
Digital Library
- Ashish Ranjan, Swagath Venkataramani, Xuanyao Fong, Kaushik Roy, and Anand Raghunathan. 2015. Approximate storage for energy efficient spintronic memories. In Proceedings of the ACM/IEEE Design Automation Conference (DAC). Google Scholar
Digital Library
- Mohamed M. Sabry, David Atienza, and Francky Catthoor. 2014. OCEAN: An optimized HW/SW reliability mitigation approach for scratchpad memories in real-time SoCs. ACM Transactions on Embedded Computing Systems (TECS) 13, 4s (2014). Google Scholar
Digital Library
- Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapragasam, Luis Ceze, and Dan Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. In Proceedings of the ACM Conference on Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Adrian Sampson, Jacob Nelson, Karin Strauss, and Luis Ceze. 2013. Approximate storage in solid-state memories. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- Hossein Sayadi, Hamed Farbeh, Amir Mahdi Hosseini Monazzah, and Seyed Ghassem Miremadi. 2014. A data recomputation approach for reliability improvement of scratchpad memory in embedded systems. In Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT).Google Scholar
Cross Ref
- Mark F. Schilling. 2012. The surprising predictability of long runs. Mathematics Magazine 85, 2 (2012), 141--149.Google Scholar
Cross Ref
- Stanley E. Schuster. 1978. Multiple word/bit line redundancy for semiconductor memories. IEEE Journal of Solid-State Circuits (JSSC) 13, 5 (1978), 698--703.Google Scholar
Cross Ref
- Philip P. Shirvani and Edward J. McCluskey. 1999. PADded cache: A new fault-tolerance technique for cache memories. In Proceedings of the VLSI Test Symposium. Google Scholar
Digital Library
- Majid Shoushtari, Abbas BanaiyanMofrad, and Nikil Dutt. 2015. Exploiting partially-forgetful memories for approximate computing. IEEE Embedded Systems Letters (ESL) 7, 1 (2015), 19--22.Google Scholar
Digital Library
- Jiguo Song, Gedare Bloom, and Gabriel Palmer. 2016. SuperGlue: IDL-based, system-level fault tolerance for embedded systems. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).Google Scholar
Cross Ref
- Rick van Rein. 2016. BadRAM: Linux Kernel Support for Broken RAM Modules. (2016).Google Scholar
- Daniel P. Volpato, Alexandre K. I. Mendonca, Luiz C. V. dos Santos, and José Luís Güntzel. 2010. A post-compiling approach that exploits code granularity in scratchpads to improve energy efficiency. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI). 127--132. Google Scholar
Digital Library
- Jiajing Wang and Benton H. Calhoun. 2011. Minimum supply voltage and yield estimation for large SRAMs under parametric variations. IEEE Transactions on Very Large Scale Integration Systems (TVLSI) 19, 11 (2011), 2120--2125. Google Scholar
Digital Library
- Lucas Wanner, Charwak Apte, Rahul Balani, Puneet Gupta, and Mani Srivastava. 2013. Hardware variability-aware duty cycling for embedded sensors. IEEE Transactions on Very Large Scale Integration Systems (TVLSI) 21, 6 (2013), 1000--1012. Google Scholar
Digital Library
- Lucas Wanner, Liangzhen Lai, Abbas Rahimi, Mark Gottscho, Pietro Mercati, Chu-Hsiang Huang, Frederic Sala, Yuvraj Agarwal, Lara Dolecek, Nikil Dutt, Puneet Gupta, Rajesh Gupta, Ranjit Jhala, Rakesh Kumar, Sorin Lerner, Subhasish Mitra, Alexandru Nicolau, Tajana Simunic Rosing, Mani B. Srivastava, Steve Swanson, Dennis Sylvester, and Yuanyuan Zhou. 2015. NSF expedition on variability-aware software: Recent results and contributions. De Gruyter Information Technology (IT) 57, 3 (2015).Google Scholar
- Andrew Waterman, Yunsup Lee, David Patterson, and Krste Asanovic. 2014. The RISC-V Instruction Set Manual Volume I: User-Level ISA Version 2.0. (2014).Google Scholar
- Chris Wilkerson, Hongliang Gao, Alaa R. Alameldeen, Zeshan Chishti, Muhammad Khellah, and Shih-Lien Lu. 2008. Trading off cache capacity for reliability to enable low voltage operation. In Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- Jack K. Wolf. 1965. On an extended class of error-locating codes. Information and Control 8, 2 (1965), 163--169.Google Scholar
Cross Ref
- J. K. Wolf and B. Elspas. 1963. Error-locating codes -- A new concept in error control. IEEE Transactions on Information Theory 9, 2 (1963), 113--117. Google Scholar
Digital Library
- Jun Xu, Zbigniew Kalbarczyk, Sanjay Patel, and Ravishankar K. Iyer. 2002. Architecture support for defending against buffer overflow attacks. In Workshop on Evaluating and Architecting Systems for Dependability.Google Scholar
- Chao Yan and Russ Joseph. 2016. Enabling deep voltage scaling in delay sensitive L1 caches. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).Google Scholar
Cross Ref
- Jun Yang, Youtao Zhang, and Rajiv Gupta. 2000. Frequent value compression in data caches. In Proceedings of the ACM/IEEE International Symposium on Microarchitecture (MICRO). 258--265. Google Scholar
Digital Library
- Amir Yazdanbakhsh, Divya Mahajan, Hadi Esmaeilzadeh, and Pejman Lotfi-Kamran. 2017. AxBench: A multiplatform benchmark suite for approximate computing. IEEE Design and Test 34, 2 (2017), 60--68.Google Scholar
Cross Ref
Index Terms
Low-Cost Memory Fault Tolerance for IoT Devices
Recommendations
Memory Mapped SPM: Protecting Instruction Scratchpad Memory in Embedded Systems against Soft Errors
EDCC '12: Proceedings of the 2012 Ninth European Dependable Computing ConferencePredictability, energy consumption, area and reliability are the major concerns in embedded systems. Using scratchpad memories (SPMs) instead of cache memories play an increasing role to satisfy these concerns. Both cache and SPM as on-chip SRAM ...
Exploiting Idle Hardware to Provide Low Overhead Fault Tolerance for VLIW Processors
Special Issue on Nanoelectronic Circuit and System Design Methods for the Mobile Computing Era and Regular PapersBecause of technology scaling, the soft error rate has been increasing in digital circuits, which affects system reliability. Therefore, modern processors, including VLIW architectures, must have means to mitigate such effects to guarantee reliable ...
Low-Overhead Fault-Tolerance Technique for a Dynamically Reconfigurable Softcore Processor
In this paper, we propose a new approach to implement a reliable softcore processor on SRAM-based FPGAs, which can mitigate radiation-induced temporary faults (single-event upsets (SEUs)) at moderate cost. A new Enhanced Lockstep scheme built using a ...






Comments