skip to main content
10.1145/1508244.1508265acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Mixed-mode multicore reliability

Published:07 March 2009Publication History

ABSTRACT

Future processors are expected to observe increasing rates of hardware faults. Using Dual-Modular Redundancy (DMR), two cores of a multicore can be loosely coupled to redundantly execute a single software thread, providing very high coverage from many difference sources of faults. This reliability, however, comes at a high price in terms of per-thread IPC and overall system throughput.

We make the observation that a user may want to run both applications requiring high reliability, such as financial software, and more fault tolerant applications requiring high performance, such as media or web software, on the same machine at the same time. Yet a traditional DMR system must fully operate in redundant mode whenever any application requires high reliability.

This paper proposes a Mixed-Mode Multicore (MMM), which enables most applications, including the system software, to run with high reliability in DMR mode, while applications that need high performance can avoid the penalty of DMR. Though conceptually simple, two key challenges arise: 1) care must be taken to protect reliable applications from any faults occurring to applications running in high performance mode, and 2) the desire to execute additional independent software threads for a performance application complicates the scheduling of computation to cores. After solving these issues, an MMM is shown to improve overall system performance, compared to a traditional DMR system, by approximately 2X when one reliable and one performance application are concurrently executing.

References

  1. N. Aggarwal, P. Ranganathan, N. P. Jouppi, and J. E. Smith. Configurable isolation: building high availability systems with commodity multi-core processors. In Proc. of 34th ISCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. Barford and M. Crovella. Generating representative web workloads for network and server performance evaluation. In 1998 Conf. on Meas. & Model. of Comp. Sys., 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In Proc. of 19th SOSP, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Bernick, B. Bruckert, P. D. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen. Nonstop advanced architecture. In Proc. of 2005 DSN, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Borkar. Microarchitecture and design challenges for gigascale integration: Keynote. In Proc. of 37th MICRO, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Borkar, T. Karnik, J. Tschanz, A. Keshavarzi, and V. De. Parameter variations and impact on circuits and microarchitecture. In Proc. of 40th DAC, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. F. A. Bower, D. J. Sorin, and S. Ozev. A mechanism for online diagnosis of hard faults in microprocessors. In Proc. of 38th MICRO, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. Bowman, S. Duvall, and J. Meindl. Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration. J. of Solid-State Circuits, 37(2):183--190, Feb 2002.Google ScholarGoogle ScholarCross RefCross Ref
  9. X. Chen, T. Garfinkel, E. C. Lewis, P. Subrahmanyam, C. A. Waldspurger, D. Boneh, J. Dwoskin, and D. R. Ports. Overshadow: a virtualization-based approach to retrofitting protection in commodity operating systems. In Proc. of 13th ASPLOS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Constantinescu. Trends and challenges in VLSI circuit reliability. IEEE Micro, 23(4):14--19, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Conway and B. Hughes. The AMD Opteron Northbridge architecture. IEEE Micro, 27(2):10--21, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. In Proc. of 30th ISCA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Kalla, B. Sinharoy, and J. M. Tendler. IBM Power5 chip: a dual-core multithreaded processor. IEEE Micro, 24(2):40--47, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. LaFrieda, E. Ípek, J. F. Martínez, and R. Manohar. Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In Proc. of 2007 DSN, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M.-L. Li, P. Ramachandran, S. K. Sahoo, S. V. Adve, V. S. Adve, and Y. Zhou. Understanding the propagation of hard errors to software and implications for resilient system design. In Proc. of 13th ASPLOS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P.Magnusson et al. Simics: A full system simulation platform. IEEE Comp., 35(2):50--58, Feb 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. R. Marty and M. D. Hill. Virtual hierarchies to support server consolidation. In Proc. of 34th ISCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. McEvoy. The architecture of tandem's nonstop system. In Proc. of ACM 1981 Conf., 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed design and evaluation of redundant multithreading alternatives. In Proc. of 29th ISCA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Open Source Development Labs. Database test suite. Viewed 7/28/2008.Google ScholarGoogle Scholar
  21. PostgreSQL Global Development Group. PostgreSQL. Viewed 7/28/2008.Google ScholarGoogle Scholar
  22. Semiconductor Industry Association. International technology roadmap for semiconductors: Executive summary, 2005.Google ScholarGoogle Scholar
  23. J. W. Sheaffer, D. P. Luebke, and K. Skadron. The visual vulnerability spectrum: characterizing architectural vulnerability for graphics hardware. In Proc. of 21st Eurographics symposium on Graphics hardware, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proc. of 2002 DSN, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin. Ultra low-cost defect protection for microprocessor pipelines. In Proc. of 12th ASPLOS, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. J. Slegel et al. IBM's S/390 G5 microprocessor design. IEEE Micro, 19(2):12--23, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. C. Smolens. Fingerprinting: Hash-Based Error Detection in Microprocessors. PhD thesis, Carnegie Mellon University, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. C. Smolens. Personal communication, Dec 2008.Google ScholarGoogle Scholar
  29. J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe. Reunion: Complexity-effective multicore redundancy. In Proc. of 39th MICRO, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. C. Smolens, B. T. Gold, J. C. Hoe, B. Falsafi, and K. Mai. Detecting emerging wearout faults. In Proc. of Workshop on SELSE, 2007.Google ScholarGoogle Scholar
  31. R. Uhlig, G. Neiger, D. Rodgers, A. L. Santoni, F. C. M. Martins, A. V. Anderson, S.M. Bennett, A. Kagi, F. H. Leung, and L. Smith. Intel virtualization technology. IEEE Comp., 38(5), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. K. R. Walcott, G. Humphreys, and S. Gurumurthi. Dynamic prediction of architectural vulnerability from microarchitectural state. In Proc. of 34th ISCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. C. Weaver and T. M. Austin. A fault tolerant approach to microprocessor design. In Proc. of 2001 DSN, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. P. M. Wells, K. Chakraborty, and G. S. Sohi. Hardware support for spin management in overcommitted virtual machines. In Proc. of 15th PACT, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. P. M. Wells, K. Chakraborty, and G. S. Sohi. Adapting to intermittent faults in multicore systems. In Proc. of 13th ASPLOS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. P. M. Wells and G. S. Sohi. Serializing instructions in systemintensive workloads: Amdahl's law strikes again. In Proc. of 14th HPCA, 2008.Google ScholarGoogle Scholar
  37. T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In Proc. of 32nd ISCA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. H. Zhou. A case for fault tolerance and performance enhancement using chip multi-processors. Comp. Arch. Letters, 5(1):6, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mixed-mode multicore reliability

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!