Abstract
Future multicore processors will be more susceptible to a variety of hardware failures. In particular, intermittent faults, caused in part by manufacturing, thermal, and voltage variations, can cause bursts of frequent faults that last from several cycles to several seconds or more. Due to practical limitations of circuit techniques, cost-effective reliability will likely require the ability to temporarily suspend execution on a core during periods of intermittent faults.
We investigate three of the most obvious techniques for adapting to the dynamically changing resource availability caused by intermittent faults, and demonstrate their different system-level implications. We show that system software reconfiguration has very high overhead, that temporarily pausing execution on a faulty core can lead to cascading livelock, and that using spare cores has high fault-free cost. To remedy these and other drawbacks of the three baseline techniques, we propose using a thin hardware/firmware layer to manage an overcommitted system -- one where the OS is configured to use more virtual processors than the number of currently available physical cores. We show that this proposed technique can gracefully degrade performance during intermittent faults of various duration with low overhead, without involving system software, and without requiring spare cores.
Supplemental Material
Available for Download
Supplemental material for Adapting to intermittent faults in multicore systems
- Advanced Micro Devices. AMD64 Architecture Programmer's Manual Volume 2: System Prog., Dec 2005.Google Scholar
- W. Armstrong et al. Advanced virtualization capabilities of POWER5 systems. IBMJournal and Research and Development, 49(4/5), 2005. Google Scholar
Digital Library
- D. Bernick et al. Nonstop advanced architecture. In Proceedings of the 2005 International Conference on Dependable Systems and Networks, 2005. Google Scholar
Digital Library
- D. M. Blough, F. J. Kurdahi, and S. Y. Ohm. High-level synthesis of recoverable VLSI microarchitectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 7(4):401--410, 1999. Google Scholar
Digital Library
- D. M. Blough, G. F. Sullivan, and G. M. Masson. Intermittent fault diagnosis in multiprocessor systems. IEEE Transactions on Computers, 41(11):1430--1441, 1992. Google Scholar
Digital Library
- S. Borkar. Microarchitecture and design challenges for gigascale integration: Keynote. In Proceedings of the 37th Annual International Symposium on Microarchitecture (MICRO), 2004. Google Scholar
Digital Library
- S. Borkar, T. Karnik, and V. De. Design and reliability challenges in nanometer technologies. In Proceedings of the 41th Annual Conference on Design Automation, 2004. Google Scholar
Digital Library
- S. Borkar, T. Karnik, J. Tschanz, A. Keshavarzi, and V. De. Parameter variations and impact on circuits and microarchitecture. In Proceedings of the 40th Annual Conference on Design Automation, 2003. Google Scholar
Digital Library
- F. A. Bower, D. J. Sorin, and S. Ozev. A mechanism for online diagnosis of hard faults in microprocessors. In Proceedings of the 38th Annual International Symposium on Microarchitecture (MICRO), 2005. Google Scholar
Digital Library
- K. Bowman, S. Duvall, and J. Meindl. Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration. IEEE Journal of Solid-State Circuits, 37(2):183--190, Feb 2002.Google Scholar
Cross Ref
- J. Chang and G. S. Sohi. Cooperative cache partitioning for chip multiprocessors. In Proceedings of the 21st Annual International Conference on Supercomputing, 2007. Google Scholar
Digital Library
- C. Constantinescu. Trends and challenges in VLSI circuit reliability. IEEE Micro, 23(4):14--19, 2003. Google Scholar
Digital Library
- C. Constantinescu. Intermittent faults in VLSI circuits. In Proceedings of the IEEE Workshop on Silicon Errors in Logic -- System Effects, 2007.Google Scholar
- O. Contant, S. Lafortune, and D. Teneketzis. Diagnosis of intermittent faults. Discrete Event Dynamic Systems, 14(2):171--202, 2004. Google Scholar
Digital Library
- G. Deen, M. Hammer, J. Bethencourt, I. Eiron, J. Thomas, and J. Kaufman. Running Quake II on a grid. IBM Journal and Research and Development, 45(1), 2006. Google Scholar
Digital Library
- D. Ernst et al. Razor: A low-power pipeline based on circuit-level timing speculation. In Proceedings of the 36th Annual International Symposium on Microarchitecture (MICRO), 2003. Google Scholar
Digital Library
- K. Govil, D. Teodosiu, Y. Huang, and M. Rosenblum. Cellular Disco: Resource management using virtual clusters on sharedmemory multiprocessors. ACM Transactions on Computer Systems, 18(3):229--262, Aug 2000. Google Scholar
Digital Library
- S. H. Gunther, F. Binns, D. M. Carmean, and J. C. Hall. Managing the impact of increasing microprocessor power consumption. Intel Technology Journal, Q1, 2001.Google Scholar
- S. N. Hamilton and A. Orailoglu. Transient and intermittent fault recovery without rollback. In Proceedings of the 13th International Symposium on Defect and Fault-Tolerance in VLSI Systems, 1998. Google Scholar
Digital Library
- A. A. Ismaeel and R. Bhatnagar. Test for detection & location of intermittent faults in combinational circuits. IEEE Transactions on Reliability, 46(2):269--274, Jun 1997.Google Scholar
Cross Ref
- R. Joseph. Exploring core salvage techniques for multi-core architectures. In Proceedings of the Workshop on High Performance Computing Reliability Issues, 2006.Google Scholar
- R. Joseph, D. Brooks, and M. Martonosi. Control techniques to eliminate voltage emergencies in high performance processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA), 2003. Google Scholar
Digital Library
- Z. T. Kalbarczyk, R. K. Iyer, S. Bagchi, and K.Whisnant. Chameleon: A software infrastructure for adaptive fault tolerance. IEEE Transactions on Parallel and Distrubuted Systems, 10(6):560--579, 1999. Google Scholar
Digital Library
- S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the 13th Annual International Conference on Parallel Architectures and Compilation Techniques (PACT), 2004. Google Scholar
Digital Library
- C. LaFrieda, E. Ipek, J. F. Martinez, and R. Manohar. Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In Proceedings of the 2007 International Conference on Dependable Systems and Networks, 2007. Google Scholar
Digital Library
- T. Li, A. R. Lebeck, and D. J. Sorin. Spin detection hardware for improved management of multithreaded systems. IEEE Transactions on Parallel and Distrubuted Systems, 17(6):508--521, 2006. Google Scholar
Digital Library
- X. Liang and D. Brooks. Mitigating the impact of process variations on processor register files and execution units. In Proceedings of the 39th Annual International Symposium on Microarchitecture (MICRO), 2006. Google Scholar
Digital Library
- T. Litt. Method and apparatus for CPU failure recovery in symmetric multi--processing systems. U.S. Patent 5,815,651, Sep 1998.Google Scholar
- P. Magnusson et al. Simics: A full system simulation platform. IEEE Computer, 35(2):50--58, Feb 2002. Google Scholar
Digital Library
- S. Mitra, M. Zhang, N. S. amd TM Mak, and K. Kim. Soft error resilient system design through error correction. In Proceedings of the Very Large Scale Integration, January 2006.Google Scholar
- T. Nanya and H. A. Goosen. The byzantine hardware fault model. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 8(11):1226--1231, Nov 1989.Google Scholar
Digital Library
- M. D. Powell, M. Gomaa, and T. N. Vijaykumar. Heat-and-run: leveraging SMT and CMP to manage power density through the operating system. In Proceedings of the 11th International conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2004. Google Scholar
Digital Library
- M. D. Powell and T. N. Vijaykumar. Exploiting resonant behavior to reduce inductive noise. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA), 2004. Google Scholar
Digital Library
- Semiconductor Industry Association. International technology roadmap for semiconductors: Executive summary, 2005.Google Scholar
- T. Sherwood, S. Sair, and B. Calder. Phase tracking and prediction. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA), 2003. Google Scholar
Digital Library
- P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, 2002. Google Scholar
Digital Library
- S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin. Ultra low-cost defect protection for microprocessor pipelines. In Proceedings of the 12th International conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2006. Google Scholar
Digital Library
- K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan. Temperature-aware microarchitecture. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA), 2003. Google Scholar
Digital Library
- T. J. Slegel et al. IBM's S/390 G5 microprocessor design. IEEE Micro, 19(2):12--23, 1999. Google Scholar
Digital Library
- J. C. Smolens, B. T. Gold, J. C. Hoe, B. Falsafi, and K.Mai. Detecting emerging wearout faults. In Proceedings of the IEEE Workshop on Silicon Errors in Logic -- System Effects, 2007.Google Scholar
- D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA), 2002. Google Scholar
Digital Library
- Sun Microsystems. Sun fire high-end and midrange systems dynamic reconfiguration user's guide. http://docs.sun.com/app/docs/doc/819-1501. Viewed 8/07/2007.Google Scholar
- J. W. Tschanz, S. G. Narendra, Y. Ye, B. A. Bloechel, S. Borkar, and V. De. Dynamic sleep transistor and body bias for active leakage power control of microprocessors. IEEE Journal of Solid-State Circuits, 38(11), 2003.Google Scholar
Cross Ref
- R. Uhlig et al. Intel virtualization technology. Computer, 38(5), 2005. Google Scholar
Digital Library
- V. Uhlig, J. LeVasseur, E. Skoglund, and U. Dannowski. Towards scalable multiprocessor virtual machines. In Proceedings of the 3rd Virtual Machine Research and Technology Symposium, 2004. Google Scholar
Digital Library
- P. M. Wells, K. Chakraborty, and G. S. Sohi. Hardware support for spin management in overcommitted virtual machines. In Proceedings of the 15th Annual International Conference on Parallel Architectures and Compilation Techniques (PACT), 2006. Google Scholar
Digital Library
- T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA), 2005. Google Scholar
Digital Library
Index Terms
Adapting to intermittent faults in multicore systems
Recommendations
Adapting to intermittent faults in multicore systems
ASPLOS '08Future multicore processors will be more susceptible to a variety of hardware failures. In particular, intermittent faults, caused in part by manufacturing, thermal, and voltage variations, can cause bursts of frequent faults that last from several ...
Adapting to intermittent faults in multicore systems
ASPLOS '08Future multicore processors will be more susceptible to a variety of hardware failures. In particular, intermittent faults, caused in part by manufacturing, thermal, and voltage variations, can cause bursts of frequent faults that last from several ...
Adapting to intermittent faults in multicore systems
ASPLOS XIII: Proceedings of the 13th international conference on Architectural support for programming languages and operating systemsFuture multicore processors will be more susceptible to a variety of hardware failures. In particular, intermittent faults, caused in part by manufacturing, thermal, and voltage variations, can cause bursts of frequent faults that last from several ...







Comments