skip to main content
research-article

Adapting to intermittent faults in multicore systems

Published:01 March 2008Publication History
Skip Abstract Section

Abstract

Future multicore processors will be more susceptible to a variety of hardware failures. In particular, intermittent faults, caused in part by manufacturing, thermal, and voltage variations, can cause bursts of frequent faults that last from several cycles to several seconds or more. Due to practical limitations of circuit techniques, cost-effective reliability will likely require the ability to temporarily suspend execution on a core during periods of intermittent faults.

We investigate three of the most obvious techniques for adapting to the dynamically changing resource availability caused by intermittent faults, and demonstrate their different system-level implications. We show that system software reconfiguration has very high overhead, that temporarily pausing execution on a faulty core can lead to cascading livelock, and that using spare cores has high fault-free cost. To remedy these and other drawbacks of the three baseline techniques, we propose using a thin hardware/firmware layer to manage an overcommitted system -- one where the OS is configured to use more virtual processors than the number of currently available physical cores. We show that this proposed technique can gracefully degrade performance during intermittent faults of various duration with low overhead, without involving system software, and without requiring spare cores.

Skip Supplemental Material Section

Supplemental Material

Video

References

  1. Advanced Micro Devices. AMD64 Architecture Programmer's Manual Volume 2: System Prog., Dec 2005.Google ScholarGoogle Scholar
  2. W. Armstrong et al. Advanced virtualization capabilities of POWER5 systems. IBMJournal and Research and Development, 49(4/5), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Bernick et al. Nonstop advanced architecture. In Proceedings of the 2005 International Conference on Dependable Systems and Networks, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. M. Blough, F. J. Kurdahi, and S. Y. Ohm. High-level synthesis of recoverable VLSI microarchitectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 7(4):401--410, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. M. Blough, G. F. Sullivan, and G. M. Masson. Intermittent fault diagnosis in multiprocessor systems. IEEE Transactions on Computers, 41(11):1430--1441, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Borkar. Microarchitecture and design challenges for gigascale integration: Keynote. In Proceedings of the 37th Annual International Symposium on Microarchitecture (MICRO), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Borkar, T. Karnik, and V. De. Design and reliability challenges in nanometer technologies. In Proceedings of the 41th Annual Conference on Design Automation, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Borkar, T. Karnik, J. Tschanz, A. Keshavarzi, and V. De. Parameter variations and impact on circuits and microarchitecture. In Proceedings of the 40th Annual Conference on Design Automation, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. F. A. Bower, D. J. Sorin, and S. Ozev. A mechanism for online diagnosis of hard faults in microprocessors. In Proceedings of the 38th Annual International Symposium on Microarchitecture (MICRO), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Bowman, S. Duvall, and J. Meindl. Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration. IEEE Journal of Solid-State Circuits, 37(2):183--190, Feb 2002.Google ScholarGoogle ScholarCross RefCross Ref
  11. J. Chang and G. S. Sohi. Cooperative cache partitioning for chip multiprocessors. In Proceedings of the 21st Annual International Conference on Supercomputing, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Constantinescu. Trends and challenges in VLSI circuit reliability. IEEE Micro, 23(4):14--19, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Constantinescu. Intermittent faults in VLSI circuits. In Proceedings of the IEEE Workshop on Silicon Errors in Logic -- System Effects, 2007.Google ScholarGoogle Scholar
  14. O. Contant, S. Lafortune, and D. Teneketzis. Diagnosis of intermittent faults. Discrete Event Dynamic Systems, 14(2):171--202, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Deen, M. Hammer, J. Bethencourt, I. Eiron, J. Thomas, and J. Kaufman. Running Quake II on a grid. IBM Journal and Research and Development, 45(1), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Ernst et al. Razor: A low-power pipeline based on circuit-level timing speculation. In Proceedings of the 36th Annual International Symposium on Microarchitecture (MICRO), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K. Govil, D. Teodosiu, Y. Huang, and M. Rosenblum. Cellular Disco: Resource management using virtual clusters on sharedmemory multiprocessors. ACM Transactions on Computer Systems, 18(3):229--262, Aug 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. H. Gunther, F. Binns, D. M. Carmean, and J. C. Hall. Managing the impact of increasing microprocessor power consumption. Intel Technology Journal, Q1, 2001.Google ScholarGoogle Scholar
  19. S. N. Hamilton and A. Orailoglu. Transient and intermittent fault recovery without rollback. In Proceedings of the 13th International Symposium on Defect and Fault-Tolerance in VLSI Systems, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. A. Ismaeel and R. Bhatnagar. Test for detection & location of intermittent faults in combinational circuits. IEEE Transactions on Reliability, 46(2):269--274, Jun 1997.Google ScholarGoogle ScholarCross RefCross Ref
  21. R. Joseph. Exploring core salvage techniques for multi-core architectures. In Proceedings of the Workshop on High Performance Computing Reliability Issues, 2006.Google ScholarGoogle Scholar
  22. R. Joseph, D. Brooks, and M. Martonosi. Control techniques to eliminate voltage emergencies in high performance processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Z. T. Kalbarczyk, R. K. Iyer, S. Bagchi, and K.Whisnant. Chameleon: A software infrastructure for adaptive fault tolerance. IEEE Transactions on Parallel and Distrubuted Systems, 10(6):560--579, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the 13th Annual International Conference on Parallel Architectures and Compilation Techniques (PACT), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. LaFrieda, E. Ipek, J. F. Martinez, and R. Manohar. Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In Proceedings of the 2007 International Conference on Dependable Systems and Networks, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. Li, A. R. Lebeck, and D. J. Sorin. Spin detection hardware for improved management of multithreaded systems. IEEE Transactions on Parallel and Distrubuted Systems, 17(6):508--521, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. X. Liang and D. Brooks. Mitigating the impact of process variations on processor register files and execution units. In Proceedings of the 39th Annual International Symposium on Microarchitecture (MICRO), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Litt. Method and apparatus for CPU failure recovery in symmetric multi--processing systems. U.S. Patent 5,815,651, Sep 1998.Google ScholarGoogle Scholar
  29. P. Magnusson et al. Simics: A full system simulation platform. IEEE Computer, 35(2):50--58, Feb 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Mitra, M. Zhang, N. S. amd TM Mak, and K. Kim. Soft error resilient system design through error correction. In Proceedings of the Very Large Scale Integration, January 2006.Google ScholarGoogle Scholar
  31. T. Nanya and H. A. Goosen. The byzantine hardware fault model. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 8(11):1226--1231, Nov 1989.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. D. Powell, M. Gomaa, and T. N. Vijaykumar. Heat-and-run: leveraging SMT and CMP to manage power density through the operating system. In Proceedings of the 11th International conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. D. Powell and T. N. Vijaykumar. Exploiting resonant behavior to reduce inductive noise. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Semiconductor Industry Association. International technology roadmap for semiconductors: Executive summary, 2005.Google ScholarGoogle Scholar
  35. T. Sherwood, S. Sair, and B. Calder. Phase tracking and prediction. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin. Ultra low-cost defect protection for microprocessor pipelines. In Proceedings of the 12th International conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan. Temperature-aware microarchitecture. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. T. J. Slegel et al. IBM's S/390 G5 microprocessor design. IEEE Micro, 19(2):12--23, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. C. Smolens, B. T. Gold, J. C. Hoe, B. Falsafi, and K.Mai. Detecting emerging wearout faults. In Proceedings of the IEEE Workshop on Silicon Errors in Logic -- System Effects, 2007.Google ScholarGoogle Scholar
  41. D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Sun Microsystems. Sun fire high-end and midrange systems dynamic reconfiguration user's guide. http://docs.sun.com/app/docs/doc/819-1501. Viewed 8/07/2007.Google ScholarGoogle Scholar
  43. J. W. Tschanz, S. G. Narendra, Y. Ye, B. A. Bloechel, S. Borkar, and V. De. Dynamic sleep transistor and body bias for active leakage power control of microprocessors. IEEE Journal of Solid-State Circuits, 38(11), 2003.Google ScholarGoogle ScholarCross RefCross Ref
  44. R. Uhlig et al. Intel virtualization technology. Computer, 38(5), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. V. Uhlig, J. LeVasseur, E. Skoglund, and U. Dannowski. Towards scalable multiprocessor virtual machines. In Proceedings of the 3rd Virtual Machine Research and Technology Symposium, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. P. M. Wells, K. Chakraborty, and G. S. Sohi. Hardware support for spin management in overcommitted virtual machines. In Proceedings of the 15th Annual International Conference on Parallel Architectures and Compilation Techniques (PACT), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Adapting to intermittent faults in multicore systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 43, Issue 3
      ASPLOS '08
      March 2008
      339 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/1353536
      Issue’s Table of Contents
      • cover image ACM Conferences
        ASPLOS XIII: Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
        March 2008
        352 pages
        ISBN:9781595939586
        DOI:10.1145/1346281

      Copyright © 2008 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 March 2008

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!