Abstract
Scaling of semiconductor devices has enabled higher levels of integration and performance improvements at the price of making devices more susceptible to the effects of static and dynamic variability. Adding safety margins (guardbands) on the operating frequency or supply voltage prevents timing errors, but has a negative impact on performance and energy consumption. We propose Edge-TM, an adaptive hardware/software error management policy that (i) optimistically scales the voltage beyond the edge of safe operation for better energy savings and (ii) works in combination with a Hardware Transactional Memory (HTM)-based error recovery mechanism. The policy applies dynamic voltage scaling (DVS) (while keeping frequency fixed) based on the feedback provided by HTM, which makes it simple and generally applicable. Experiments on an embedded platform show our technique capable of 57% energy improvement compared to using voltage guardbands and an extra 21-24% improvement over existing state-of-the-art error tolerance solutions, at a nominal area and time overhead.
- S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De. 2003. Parameter variations and impact on circuits and microarchitecture. In DAC. 338--342. Google Scholar
Digital Library
- K. A. Bowman, J. W. Tschanz, Nam Sung Kim, J. C. Lee, C. B. Wilkerson, S. L. Lu, T. Karnik, and V. K. De. 2009. Energy-efficient and metastability-immune resilient circuits for dynamic variation tolerance. IEEE JSSC 44, 1 (Jan 2009), 49--63.Google Scholar
Cross Ref
- K. A. Bowman, J. W. Tschanz, S. L. Lu, P. A. Aseron, M. M. Khellah, A. Raychowdhury, B. M. Geuskens, C. Tokunaga, C. B. Wilkerson, T. Karnik, and V. K. De. 2011. A 45nm resilient microprocessor core for dynamic variation tolerance. IEEE JSSC 46, 1 (Jan 2011), 194--208.Google Scholar
Cross Ref
- F. Chaix, G. Bizot, M. Nicolaidis, and N. E. Zergainoh. 2011. Variability-aware task mapping strategies for many-cores processor chips. In IOLTS. 55--60. Google Scholar
Digital Library
- Cristian Constantinescu. 2008. Intermittent faults and effects on reliability of integrated circuits. In RAMS. 370--374. Google Scholar
Digital Library
- S. Das, D. Roberts, Seokwoo Lee, S. Pant, D. Blaauw, T. Austin, K. Flautner, and T. Mudge. 2006. A self-tuning DVS processor using delay-error detection and correction. IEEE JSSC 41, 4 (April 2006), 792--804.Google Scholar
Cross Ref
- S. Das, C. Tokunaga, S. Pant, W. H. Ma, S. Kalaiselvan, K. Lai, D. M. Bull, and D. T. Blaauw. 2009. RazorII: In situ error detection and correction for PVT and SER tolerance. IEEE JSSC 44, 1 (Jan 2009), 32--48.Google Scholar
Cross Ref
- S. Dighe, S. R. Vangal, P. Aseron, S. Kumar, T. Jacob, K. A. Bowman, J. Howard, J. Tschanz, V. Erraguntla, N. Borkar, V. K. De, and S. Borkar. 2011. Within-die variation-aware dynamic-voltage-frequency-scaling with optimal core allocation and thread hopping for the 80-core TeraFLOPS processor. JSSC 46, 1 (Jan 2011), 184--193.Google Scholar
- Dan Ernst, Nam Sung Kim, Shidhartha Das, Sanjay Pant, Rajeev Rao, Toan Pham, Conrad Ziesler, David Blaauw, Todd Austin, Krisztian Flautner, and Trevor Mudge. 2003. Razor: A low-power pipeline based on circuit-level timing speculation. In MICRO. 7--. http://dl.acm.org/citation.cfm?id=956417.956571 Google Scholar
Digital Library
- M. Fojtik, D. Fick, Y. Kim, N. Pinckney, D. M. Harris, D. Blaauw, and D. Sylvester. 2013. Bubble razor: Eliminating timing margins in an ARM cortex-M3 processor in 45 nm CMOS using architecturally independent error detection and correction. IEEE JSSC 48, 1 (Jan 2013), 66--81.Google Scholar
Cross Ref
- Maurice Herlihy and J. Eliot B. Moss. 1993. Transactional memory: Architectural support for lock-free data structures. In ISCA. 289--300. Google Scholar
Digital Library
- Sungpack Hong, Tayo Oguntebi, Jared Casper, Nathan Bronson, Christos Kozyrakis, and Kunle Olukotun. 2010. Eigenbench: A simple exploration tool for orthogonal TM characteristics. In IISWC. 1--11. Google Scholar
Digital Library
- Intel. 2009. Voltage Regulator Module and Enterprise Voltage Regulator-Down 11.1. (2009). http://www.intel.com/Assets/en_US/PDF/designguide/321736.pdf.Google Scholar
- A. B. Kahng, S. Kang, R. Kumar, and J. Sartori. 2010. Slack redistribution for graceful degradation under voltage overscaling. In 2010 15th Asia and South Pacific Design Automation Conference (ASP-DAC). 825--831. Google Scholar
Digital Library
- Veit B. Kleeberger, Petra R. Maier, and Ulf Schlichtmann. 2014. Workload- and instruction-aware timing analysis: The missing link between technology and system-level resilience. In DAC. Google Scholar
Digital Library
- L. Leem, Hyungmin Cho, J. Bau, Q. A. Jacobson, and S. Mitra. 2010. ERSA: Error resilient system architecture for probabilistic applications. In DATE. 1560--1565. Google Scholar
Digital Library
- Lai Liangzhen and Puneet Gupta. 2014. A Case Study of Logic Delay Fault Behaviors on General-Purpose Embedded Processor Under Voltage Overscaling. Technical Report. University of California. Retrieved from http://escholarship.org/uc/item/3967v8hw.Google Scholar
- S. Narayanan, G. Lyle, R. Kumar, and D. Jones. 2009. Testing the critical operating point (COP) hypothesis using FPGA emulation of timing errors in over-scaled soft-processors. In SELSE.Google Scholar
- OpenMP. 2017. The OpenMP Application Program Interface v.3.0. available through www.openmp.org. (2017).Google Scholar
- Dimitra Papagiannopoulou, Andrea Marongiu, Tali Moreshet, Luca Benini, Maurice Herlihy, and Iris Bahar. 2015. Playing with fire: Transactional memory revisited for error-resilient and energy-efficient MPSoC execution. In GLSVLSI. 9--14. Google Scholar
Digital Library
- D. Papagiannopoulou, T. Moreshet, A. Marongiu, L. Benini, M. Herlihy, and R. Iris Bahar. 2014. Speculative synchronization for coherence-free embedded NUMA architectures. In SAMOS. 99--106.Google Scholar
- J. Patel. 2008. CMOS process variations: A critical operation point hypothesis. web.stanford.edu/class/ee380/Abstracts/080402-jhpatel.pdf. (2008). http://web.stanford.edu/class/ee380/Abstracts/080402-jhpatel.pdf.Google Scholar
- Francesco Paterna, Andrea Acquaviva, Alberto Caprara, Francesco Papariello, Giuseppe Desoli, and Luca Benini. 2012. Variability-aware task allocation for energy-efficient quality of service provisioning in embedded streaming multimedia applications. IEEE TOC 61, 7 (2012), 939--953. Google Scholar
Digital Library
- Abbas Rahimi, Daniele Cesarini, Andrea Marongiu, Rajesh K. Gupta, and Luca Benini. 2014. Improving resilience to timing errors by exposing variability effects to software in tightly-coupled processor clusters. JETCAS 4, 2 (2014), 216--229.Google Scholar
- D. Rossi, F. Conti, A. Marongiu, A. Pullini, I. Loi, M. Gautschi, G. Tagliavini, A. Capotondi, P. Flatresse, and L. Benini. 2015. PULP: A parallel ultra low power platform for next generation IoT applications. In Hot Chips.Google Scholar
- Davide Rossi, Antonio Pullini, Igor Loi, Michael Gautschi, Frank Kagan Gurkaynak, Adam Teman, Jeremy Constantin, Andreas Burg, Ivan Miro-Panades, Edith Beigné, Fabien Clermidy, Fady Abouzeid, Philippe Flatresse, and Luca Benini. 2016. 193 MOPS/mW @ 162 MOPS, 0.32V to 1.15V voltage range multi-core accelerator for energy efficient parallel and sequential digital processing. In COOL CHIPS.Google Scholar
- S. R. Sarangi, B. Greskamp, R. Teodorescu, J. Nakano, A. Tiwari, and J. Torrellas. 2008. VARIUS: A model of process variation and resulting timing errors for microarchitects. IEEE TSM 21, 1 (Feb 2008), 3--13.Google Scholar
- John Sartori and Rakesh Kumar. 2010. Overscaling-friendly timing speculation architectures. In GLSVLSI. 209--214. Google Scholar
Digital Library
- J. Tschanz, K. Bowman, S. Walstra, M. Agostinelli, T. Karnik, and Vivek De. 2009. Tunable replica circuits and adaptive voltage-frequency techniques for dynamic voltage, temperature, and aging variation tolerance. In SVC. 112--113.Google Scholar
- Jons-Tobias Wamhoff, Mario Schwalbe, Rasha Faqeh, Christof Fetzer, and Pascal Felber. 2013. Transactional encoding for tolerating transient hardware errors. In Stabilization, Safety, and Security of Distributed Systems. Vol. 8255. Springer Intl. Pub., 1--16.Google Scholar
- Philip M. Wells, Koushik Chakraborty, and Gurindar S. Sohi. 2008. Adapting to intermittent faults in multicore systems. In ASPLOS. Google Scholar
Digital Library
- G. Yalcin, A. Cristal, O. Unsal, A. Sobe, D. Harmanci, P. Felber, A. Voronin, J.-T. Wamhoff, and C. Fetzer. 2014. Combining error detection and transactional memory for energy-efficient computing below safe operation margins. In PDP. 248--255. Google Scholar
Digital Library
- Gulay Yalcin, Osman Unsal, and Adrian Cristal. 2013. FaulTM: Error detection and recovery using hardware transactional memory. In DATE. 220--225. http://dl.acm.org/citation.cfm?id=2485288.2485344 Google Scholar
Digital Library
- Gulay Yalcin, Osman Sabri Unsal, and Adrian Cristal. 2013. Fault tolerance for multi-threaded applications by leveraging hardware transactional memory. In Computing Frontiers. Article 4, 9 pages. Google Scholar
Digital Library
Index Terms
Edge-TM: Exploiting Transactional Memory for Error Tolerance and Energy Efficiency
Recommendations
Implications of fin width scaling on variability and reliability of high-k metal gate FinFETs
In this paper, we report a study to understand the fin width dependence on performance, variability and reliability of n-type and p-type triple-gate fin field effect transistors (FinFETs) with high-k dielectric and metal gate. Our results indicate that ...
Low-power-consumption fully depleted silicon-on-insulator technology
Progress in FDSOI research and development is reviewed.Process technology and electrical characteristics of FDSOI CMOS transistors are described.Low-voltage CMOS circuits utilizing the FDSOI technology are described. Scaling the CMOS device has ...
Modeling and minimization of PMOS NBTI effect for robust nanometer design
DAC '06: Proceedings of the 43rd annual Design Automation ConferenceNegative bias temperature instability (NBTI) has become the dominant reliability concern for nanoscale PMOS transistors. In this paper, a predictive model is developed for the degradation of NBTI in both static and dynamic operations. Model scalability ...






Comments