ABSTRACT
Future processors are expected to observe increasing rates of hardware faults. Using Dual-Modular Redundancy (DMR), two cores of a multicore can be loosely coupled to redundantly execute a single software thread, providing very high coverage from many difference sources of faults. This reliability, however, comes at a high price in terms of per-thread IPC and overall system throughput.
We make the observation that a user may want to run both applications requiring high reliability, such as financial software, and more fault tolerant applications requiring high performance, such as media or web software, on the same machine at the same time. Yet a traditional DMR system must fully operate in redundant mode whenever any application requires high reliability.
This paper proposes a Mixed-Mode Multicore (MMM), which enables most applications, including the system software, to run with high reliability in DMR mode, while applications that need high performance can avoid the penalty of DMR. Though conceptually simple, two key challenges arise: 1) care must be taken to protect reliable applications from any faults occurring to applications running in high performance mode, and 2) the desire to execute additional independent software threads for a performance application complicates the scheduling of computation to cores. After solving these issues, an MMM is shown to improve overall system performance, compared to a traditional DMR system, by approximately 2X when one reliable and one performance application are concurrently executing.
- N. Aggarwal, P. Ranganathan, N. P. Jouppi, and J. E. Smith. Configurable isolation: building high availability systems with commodity multi-core processors. In Proc. of 34th ISCA, 2007. Google Scholar
Digital Library
- P. Barford and M. Crovella. Generating representative web workloads for network and server performance evaluation. In 1998 Conf. on Meas. & Model. of Comp. Sys., 1998. Google Scholar
Digital Library
- P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In Proc. of 19th SOSP, 2003. Google Scholar
Digital Library
- D. Bernick, B. Bruckert, P. D. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen. Nonstop advanced architecture. In Proc. of 2005 DSN, 2005. Google Scholar
Digital Library
- S. Borkar. Microarchitecture and design challenges for gigascale integration: Keynote. In Proc. of 37th MICRO, 2004. Google Scholar
Digital Library
- S. Borkar, T. Karnik, J. Tschanz, A. Keshavarzi, and V. De. Parameter variations and impact on circuits and microarchitecture. In Proc. of 40th DAC, 2003. Google Scholar
Digital Library
- F. A. Bower, D. J. Sorin, and S. Ozev. A mechanism for online diagnosis of hard faults in microprocessors. In Proc. of 38th MICRO, 2005. Google Scholar
Digital Library
- K. Bowman, S. Duvall, and J. Meindl. Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration. J. of Solid-State Circuits, 37(2):183--190, Feb 2002.Google Scholar
Cross Ref
- X. Chen, T. Garfinkel, E. C. Lewis, P. Subrahmanyam, C. A. Waldspurger, D. Boneh, J. Dwoskin, and D. R. Ports. Overshadow: a virtualization-based approach to retrofitting protection in commodity operating systems. In Proc. of 13th ASPLOS, 2008. Google Scholar
Digital Library
- C. Constantinescu. Trends and challenges in VLSI circuit reliability. IEEE Micro, 23(4):14--19, 2003. Google Scholar
Digital Library
- P. Conway and B. Hughes. The AMD Opteron Northbridge architecture. IEEE Micro, 27(2):10--21, 2007. Google Scholar
Digital Library
- M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. In Proc. of 30th ISCA, 2003. Google Scholar
Digital Library
- R. Kalla, B. Sinharoy, and J. M. Tendler. IBM Power5 chip: a dual-core multithreaded processor. IEEE Micro, 24(2):40--47, 2004. Google Scholar
Digital Library
- C. LaFrieda, E. Ípek, J. F. Martínez, and R. Manohar. Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In Proc. of 2007 DSN, 2007. Google Scholar
Digital Library
- M.-L. Li, P. Ramachandran, S. K. Sahoo, S. V. Adve, V. S. Adve, and Y. Zhou. Understanding the propagation of hard errors to software and implications for resilient system design. In Proc. of 13th ASPLOS, 2008. Google Scholar
Digital Library
- P.Magnusson et al. Simics: A full system simulation platform. IEEE Comp., 35(2):50--58, Feb 2002. Google Scholar
Digital Library
- M. R. Marty and M. D. Hill. Virtual hierarchies to support server consolidation. In Proc. of 34th ISCA, 2007. Google Scholar
Digital Library
- D. McEvoy. The architecture of tandem's nonstop system. In Proc. of ACM 1981 Conf., 1981. Google Scholar
Digital Library
- S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed design and evaluation of redundant multithreading alternatives. In Proc. of 29th ISCA, 2002. Google Scholar
Digital Library
- Open Source Development Labs. Database test suite. Viewed 7/28/2008.Google Scholar
- PostgreSQL Global Development Group. PostgreSQL. Viewed 7/28/2008.Google Scholar
- Semiconductor Industry Association. International technology roadmap for semiconductors: Executive summary, 2005.Google Scholar
- J. W. Sheaffer, D. P. Luebke, and K. Skadron. The visual vulnerability spectrum: characterizing architectural vulnerability for graphics hardware. In Proc. of 21st Eurographics symposium on Graphics hardware, 2006. Google Scholar
Digital Library
- P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proc. of 2002 DSN, 2002. Google Scholar
Digital Library
- S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin. Ultra low-cost defect protection for microprocessor pipelines. In Proc. of 12th ASPLOS, 2006. Google Scholar
Digital Library
- T. J. Slegel et al. IBM's S/390 G5 microprocessor design. IEEE Micro, 19(2):12--23, 1999. Google Scholar
Digital Library
- J. C. Smolens. Fingerprinting: Hash-Based Error Detection in Microprocessors. PhD thesis, Carnegie Mellon University, 2008. Google Scholar
Digital Library
- J. C. Smolens. Personal communication, Dec 2008.Google Scholar
- J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe. Reunion: Complexity-effective multicore redundancy. In Proc. of 39th MICRO, 2006. Google Scholar
Digital Library
- J. C. Smolens, B. T. Gold, J. C. Hoe, B. Falsafi, and K. Mai. Detecting emerging wearout faults. In Proc. of Workshop on SELSE, 2007.Google Scholar
- R. Uhlig, G. Neiger, D. Rodgers, A. L. Santoni, F. C. M. Martins, A. V. Anderson, S.M. Bennett, A. Kagi, F. H. Leung, and L. Smith. Intel virtualization technology. IEEE Comp., 38(5), 2005. Google Scholar
Digital Library
- K. R. Walcott, G. Humphreys, and S. Gurumurthi. Dynamic prediction of architectural vulnerability from microarchitectural state. In Proc. of 34th ISCA, 2007. Google Scholar
Digital Library
- C. Weaver and T. M. Austin. A fault tolerant approach to microprocessor design. In Proc. of 2001 DSN, 2001. Google Scholar
Digital Library
- P. M. Wells, K. Chakraborty, and G. S. Sohi. Hardware support for spin management in overcommitted virtual machines. In Proc. of 15th PACT, 2006. Google Scholar
Digital Library
- P. M. Wells, K. Chakraborty, and G. S. Sohi. Adapting to intermittent faults in multicore systems. In Proc. of 13th ASPLOS, 2008. Google Scholar
Digital Library
- P. M. Wells and G. S. Sohi. Serializing instructions in systemintensive workloads: Amdahl's law strikes again. In Proc. of 14th HPCA, 2008.Google Scholar
- T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In Proc. of 32nd ISCA, 2005. Google Scholar
Digital Library
- H. Zhou. A case for fault tolerance and performance enhancement using chip multi-processors. Comp. Arch. Letters, 5(1):6, 2006. Google Scholar
Digital Library
Index Terms
Mixed-mode multicore reliability
Recommendations
Mixed-mode multicore reliability
ASPLOS 2009Future processors are expected to observe increasing rates of hardware faults. Using Dual-Modular Redundancy (DMR), two cores of a multicore can be loosely coupled to redundantly execute a single software thread, providing very high coverage from many ...
Mixed-mode multicore reliability
ASPLOS 2009Future processors are expected to observe increasing rates of hardware faults. Using Dual-Modular Redundancy (DMR), two cores of a multicore can be loosely coupled to redundantly execute a single software thread, providing very high coverage from many ...
Massively LDPC Decoding on Multicore Architectures
Unlike usual VLSI approaches necessary for the computation of intensive Low-Density Parity-Check (LDPC) code decoders, this paper presents flexible software-based LDPC decoders. Algorithms and data structures suitable for parallel computing are proposed ...








Comments