Abstract
One of the adverse effects of shrinking transistor sizes is that processors have become increasingly prone to hardware faults. At the same time, the number of cores per die rises. Consequently, core failures can no longer be ruled out, and future operating systems for many-core machines will have to incorporate fault tolerance mechanisms.
We present CSR, a strategy for recovery from unexpected permanent processor faults in commodity operating systems. Our approach overcomes surprise removal of faulty cores, and also tolerates cascading core failures. When a core fails in user mode, CSR terminates the process executing on that core and migrates the remaining processes in its run-queue to other cores. We further show how hardware transactional memory may be used to overcome failures in critical kernel code. Our solution is scalable, incurs low overhead, and is designed to integrate into modern operating systems. We have implemented it in the Linux kernel, using Haswell's Transactional Synchronization Extension, and tested it on a real system.
- Alexey Kopytov. SysBench - A Modular, Cross-Platform and Multi-Threaded Benchmark Tool, 2016.Google Scholar
- AMD®. Machine Check Architecture. In AMD64 Architecture Programmer's Manual, volume 2, chapter 9. May 2013.Google Scholar
- E. Argollo, A. Falcón, P. Faraboschi, M. Monchiero, and D. Ortega. COTSon: Infrastructure for Full System Simulation. SIGOPS Oper. Syst. Rev., 43(1):52--61, 2009.Google Scholar
Digital Library
- Ashok Raj. CPU Hotplug Support in Linux Kernel. In Linux Documentation.Google Scholar
- A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The Multikernel: A New OS Architecture for Scalable Multicore Systems. In 22nd Symposium on Operating Systems Principles. Association for Computing Machinery, Inc., October 2009.Google Scholar
Digital Library
- F. Bellard. QEMU, a Fast and Portable Dynamic Translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC '05, pages 41--41, Berkeley, CA, USA, 2005. USENIX Association.Google Scholar
Digital Library
- N. Bobroff, A. Kochut, and K. Beaty. Dynamic Placement of Virtual Machines for Managing SLA Violations. In Integrated Network Management, 2007. IM '07. 10th IFIP/IEEE International Symposium on, pages 119--128, May 2007.Google Scholar
Cross Ref
- S. Borkar. Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation. IEEE Micro, 25(6):10--16, Nov. 2005.Google Scholar
Digital Library
- S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich. An Analysis of Linux Scalability to Many Cores. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI'10, pages 1--8, Berkeley, CA, USA, 2010. USENIX Association.Google Scholar
Digital Library
- G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot: A Technique for Cheap Recovery. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 3--16, Berkeley, CA, USA, 2004. USENIX Association.Google Scholar
Digital Library
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst., 26(2):4:1--4:26, June 2008.Google Scholar
Digital Library
- J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta. Hive: Fault Containment for Shared-memory Multiprocessors. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, SOSP '95, pages 12--25, New York, NY, USA, 1995. ACM.Google Scholar
Digital Library
- C. Chen and M. Hsiao. Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review. IBM Journal of Research and Development, 28(2):124--134, March 1984.Google Scholar
Digital Library
- Christer Weingel. The Linux Watchdog API. In Linux Documentation.Google Scholar
- D. Christie, J.-W. Chung, S. Diestelhorst, M. Hohmuth, M. Pohlack, C. Fetzer, M. Nowack, T. Riegel, P. Felber, P. Marlier, and E. Rivière. Evaluation of AMD's Advanced Synchronization Facility Within a Complete Transactional Memory Stack. In Proceedings of the 5th European Conference on Computer Systems, EuroSys '10, pages 27--40, New York, NY, USA, 2010. ACM.Google Scholar
Digital Library
- C. Constantinescu. Trends and Challenges in VLSI Circuit Reliability. Micro, IEEE, 23(4):14--19, July 2003.Google Scholar
Digital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM, 51(1):107--113, Jan. 2008.Google Scholar
Digital Library
- B. Döbel and H. Härtig. Who Watches the Watchmen? Protecting Operating System Reliability Mechanisms. In The Eighth Workshop on Hot Topics in System Dependability, Berkeley, CA, 2012. USENIX.Google Scholar
- S. Dolev and R. Yagel. Towards Self-Stabilizing Operating Systems. Software Engineering, IEEE Transactions on, 34(4):564--576, July 2008.Google Scholar
Digital Library
- I. Egwutuoha, D. Levy, B. Selic, and S. Chen. A Survey of Fault Tolerance Mechanisms and Checkpoint/Restart Implementations for High Performance Computing Systems. The Journal of Supercomputing, 65(3):1302--1326, 2013.Google Scholar
Digital Library
- B. Fechner, A. Garbade, S. Weis, and T. Ungerer. Fault Detection and Tolerance Mechanisms for Future 1000 Core Systems. In High Performance Computing and Simulation (HPCS), 2013 International Conference on, pages 552--554, July 2013.Google Scholar
Cross Ref
- A. Garbade, S. Weis, S. Schlingmann, B. Fechner, and T. Ungerer. Fault Localization in NoCs Exploiting Periodic Heartbeat Messages in a Many-Core Environment. In Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2013 IEEE 27th International, pages 791--795, May 2013.Google Scholar
Digital Library
- A. Garbade, S. Weis, S. Schlingmann, B. Fechner, and T. Ungerer. Impact of Message Based Fault Detectors on Applications Messages in a Network on Chip. 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 0:470--477, 2013.Google Scholar
Digital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP '03, pages 29--43, New York, NY, USA, 2003. ACM.Google Scholar
Digital Library
- R. Giorgi, R. M. Badia, F. Bodin, A. Cohen, P. Evripidou, P. Faraboschi, B. Fechner, G. R. Gao, A. Garbade, R. Gayatri, S. Girbal, D. Goodman, B. Khan, S. Koliaï, J. Landwehr, N. M. Lê, F. Li, M. Lujàn, A. Mendelson, L. Morin, N. Navarro, T. Patejko, A. Pop, P. Trancoso, T. Ungerer, I. Watson, S. Weis, S. Zuckerman, and M. Valero. TERAFLUX: Harnessing Dataflow in Next Generation Teradevices. Microprocessors and Microsystems, 38(8, Part B):976 -- 990, 2014.Google Scholar
- T. Gleixner, P. E. McKenney, and V. Guittot. Cleaning Up Linux's CPU Hotplug for Real Time and Energy Management. SIGBED Rev., 9(4):49--52, Nov. 2012.Google Scholar
Digital Library
- S. Godard. SYSSTAT Utilities - System Performance Tools for the Linux Operating System, 2016. Available at http://sebastien.godard.pagesperso-orange.fr/.Google Scholar
- G. Heiser. Many-Core Chips -- A Case for Virtual Shared Memory. In Workshop on Managed Many-Core Systems, Washington DC, USA, Mar 2009.Google Scholar
- J. L. Henning. SPEC CPU2006 Benchmark Descriptions. SIGARCH Comput. Archit. News, 34(4):1--17, Sept. 2006.Google Scholar
- J. N. Herder, H. Bos, B. Gras, P. Homburg, and A. S. Tanenbaum. MINIX 3: A Highly Reliable, Self-Repairing Operating System. In ACM SIGOPS Operating Systems Review, 2006.Google Scholar
- M. Herlihy and J. E. B. Moss. Transactional Memory: Architectural Support for Lock-Free Data Structures. SIGARCH Comput. Archit. News, 21(2):289--300, May 1993.Google Scholar
Digital Library
- Intel®. OS Machine Check Recovery on Itanium®-Based Systems. Aug. 2008.Google Scholar
- Intel®. Intel® Cache Safe Technology. In The Intel® Itanium® Processor 9300 Series. 2014.Google Scholar
- Intel®. Instruction Set Reference. In Intel 64 and IA-32 Architectures Software Developer's Manual, volume 2, chapter 4. Dec 2015.Google Scholar
- Intel®. Intel TSX Recommendations. In Intel 64 and IA-32 Architectures Optimization Reference Manual, chapter 12. Sep 2015.Google Scholar
- Intel®. Intel® Transactional Synchronization Extensions. In Intel 64 and IA-32 Architectures Software Developer's Manual, volume 1, chapter 15. Dec 2015.Google Scholar
- Intel®. Machine-Check Architecture. In Intel 64 and IA-32 Architectures Software Developer's Manual, volume 3, chapter 15. Dec 2015.Google Scholar
- Intel®. RAPL Interface. In Intel 64 and IA-32 Architectures Software Developer's Manual, volume 3, chapter 14. Dec 2015.Google Scholar
- R. Iyer, R. Illikkal, O. Tickoo, L. Zhao, P. Apparao, and D. Newell. VM3: Measuring, Modeling and Managing VM Shared Resources. Comput. Netw., 53(17):2873--2887, Dec. 2009.Google Scholar
- Jeffrey Katcher. Postmark: a New File System Benchmark. Technical report, October 1997. TR3022, Network Appliance.Google Scholar
- Jonathan Corbet. Scheduling Domains, 2004. Available at http://lwn.net/Articles/80911/.Google Scholar
- C.-K. Koh, W.-F. Wong, Y. Chen, and H. Li. The Salvage Cache: A Fault-Tolerant Cache Architecture for Next-Generation Memory Technologies. In Computer Design, 2009. ICCD 2009. IEEE International Conference on, pages 268--274, Oct 2009.Google Scholar
Cross Ref
- Y. Koh, R. Knauerhase, P. Brett, M. Bowman, Z. Wen, and C. Pu. An Analysis of Performance Interference Effects in Virtual Environments. In In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2007.Google Scholar
- A. Lenharth, V. Adve, and S. King. Recovery Domains: An Organizing Principle for Recoverable Operating Systems. In International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS, pages 49--60, 12 2008.Google Scholar
- LSE. Linux Scalability Effort Homepage, 2004. Available at https://lse.sourceforge.net/.Google Scholar
- Y. Mao, R. Morris, and M. F. Kaashoek. Optimizing MapReduce for Multicore Architectures. Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Tech. Rep, 2010.Google Scholar
- W. Maurer. Professional Linux Kernel Architecture. 2008.Google Scholar
- P. E. Mckenney, J. Appavoo, A. Kleen, O. Krieger, O. Krieger, R. Russell, D. Sarma, and M. Soni. Read-Copy Update. In In Ottawa Linux Symposium, pages 338--367, 2001.Google Scholar
- P. E. Mckenney and S. Boyd-wickizer. RCU Usage in the Linux Kernel: One Decade Later. Technical Report, sep 2012.Google Scholar
- Microsoft®. Windows Hot Add CPU.Google Scholar
- Z. Mwaikambo, A. Raj, R. Russell, J. Schopp, and S. Vaddagiri. Linux Kernel Hotplug CPU Support. In Linux Symposium, 2004.Google Scholar
- E. B. Nightingale, J. R. Douceur, and V. Orgovan. Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs. In Proceedings of the Sixth Conference on Computer Systems, EuroSys '11, pages 343--356, New York, NY, USA, 2011. ACM.Google Scholar
Digital Library
- A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta, and A. Sivasubramaniam. Fault-Aware Job Scheduling for BlueGene/L Systems. In IPDPS, 2004.Google Scholar
- S. Panneerselvam and M. M. Swift. Chameleon: Operating System Support for Dynamic Processors. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 99--110, New York, NY, USA, 2012. ACM.Google Scholar
Digital Library
- D. A. Patterson. An Introduction to Dependability. login, pages 61--65, 2002.Google Scholar
- M. Radetzki, C. Feng, X. Zhao, and A. Jantsch. Methods for Fault Tolerance in Networks-On-Chip. ACM Comput. Surv., 46(1):8:1--8:38, July 2013.Google Scholar
Digital Library
- R. Rajwar and J. R. Goodman. Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 34, pages 294--305, Washington, DC, USA, 2001. IEEE Computer Society.Google Scholar
Digital Library
- B. Rhoden, K. Klues, D. Zhu, and E. Brewer. Improving Per-node Efficiency in the Datacenter with New OS Abstractions. In Proceedings of the 2Nd ACM Symposium on Cloud Computing, SOCC '11, pages 25:1--25:8, New York, NY, USA, 2011. ACM.Google Scholar
Digital Library
- C. J. Rossbach, O. S. Hofmann, D. E. Porter, H. E. Ramadan, A. Bhandari, and E. Witchel. TxLinux: Using and Managing Hardware Transactional Memory in an Operating System. In SOSP, 2007.Google Scholar
Digital Library
- D. Rossi, N. Timoncini, M. Spica, and C. Metra. Error Correcting Code Analysis for Cache Memory High Reliability and Performance. In Design, Automation Test in Europe Conference Exhibition (DATE), 2011, pages 1--6, March 2011.Google Scholar
Cross Ref
- A. Roytman, S. Govindan, J. Liu, A. Kansal, and S. Nath. Algorithm Design for Performance Aware VM Consolidation. Technical report, 2013.Google Scholar
- R. D. Schlichting and F. B. Schneider. Fail-Stop Processors: An Approach to Designing Fault-Tolerant Computing Systems. ACM Trans. Comput. Syst., 1(3):222--238, Aug. 1983.Google Scholar
Digital Library
- B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM Errors in the Wild: A Large-Scale Field Study. In SIGMETRICS, 2009.Google Scholar
Digital Library
- J. Song, J. Wittrock, and G. Parmer. Predictable, Efficient System-Level Fault Tolerance in C$^3$. 2013 IEEE 34th Real-Time Systems Symposium, 0:21--32, 2013.Google Scholar
Digital Library
- S. Srikantaiah, A. Kansal, and F. Zhao. Energy Aware Consolidation for Cloud Computing. In Proceedings of the 2008 Conference on Power Aware Computing and Systems, HotPower'08, pages 10--10, Berkeley, CA, USA, 2008. USENIX Association.Google Scholar
Digital Library
- J. Srinivasan, S. Adve, P. Bose, and J. Rivers. The Impact of Technology Scaling on Lifetime Reliability. In Dependable Systems and Networks, 2004 International Conference on, pages 177--186, June 2004.Google Scholar
Cross Ref
- Srivatsa S. Bhat. CPU Hotplug: stop_machine()-Free CPU Hotplug. Available at http://lwn.net/Articles/533553/.Google Scholar
- M. M. Swift, M. Annamalai, B. N. Bershad, and H. M. Levy. Recovering Device Drivers. ACM Trans. Comput. Syst., 24(4):333--360, Nov. 2006.Google Scholar
Digital Library
- M. M. Swift, B. N. Bershad, and H. M. Levy. Improving the Reliability of Commodity Operating Systems. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP '03, pages 207--222, New York, NY, USA, 2003. ACM.Google Scholar
Digital Library
- S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar. An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS. Solid-State Circuits, IEEE Journal of, 43(1):29--41, Jan. 2008.Google Scholar
- A. Wang, M. Gaudet, P. Wu, J. N. Amaral, M. Ohmacht, C. Barton, R. Silvera, and M. Michael. Evaluation of Blue Gene/Q Hardware Support for Transactional Memories. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 127--136, New York, NY, USA, 2012. ACM.Google Scholar
Digital Library
- S. Weis, A. Garbade, B. Fechner, A. Mendelson, R. Giorgi, and T. Ungerer. Architectural Support for Fault Tolerance in a Teradevice Dataflow System. International Journal of Parallel Programming, pages 1--25, 2014.Google Scholar
- S. Weis, A. Garbade, and T. Ungerer. Design Exploration of FDUs and Core-Internal Fault-Detection. Exploiting Dataflow Parallelism in Tera-Device Computing, 2010.Google Scholar
- T. White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 1st edition, 2009.Google Scholar
- G. Yalcin, O. Unsal, and A. Cristal. FaulTM: Error Detection and Recovery Using Hardware Transactional Memory. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE '13, pages 220--225, San Jose, CA, USA, 2013. EDA Consortium.Google Scholar
Cross Ref
- G.-C. Yang. Reliability of Semiconductor RAMs with Soft-Error Scrubbing Techniques. Computers and Digital Techniques, IEE Proceedings, 142(5):337--344, Sep 1995.Google Scholar
- R. M. Yoo, C. J. Hughes, K. Lai, and R. Rajwar. Performance Evaluation of Inteltextsuperscript® Transactional Synchronization Extensions for High-performance Computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '13, pages 19:1--19:11, New York, NY, USA, 2013. ACM.Google Scholar
Digital Library
- G. Zellweger, S. Gerber, K. Kourtis, and T. Roscoe. Decoupling Cores, Kernels, and Operating Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 17--31, Broomfield, CO, Oct. 2014. USENIX Association.Google Scholar
Digital Library
- F. Zhou, J. Condit, Z. Anderson, I. Bagrak, R. Ennals, M. Harren, G. Necula, and E. Brewer. SafeDrive: Safe and Recoverable Extensions Using Language-based Techniques. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation, OSDI '06, pages 45--60, Berkeley, CA, USA, 2006. USENIX Association.Google Scholar
Digital Library
Index Terms
CSR: Core Surprise Removal in Commodity Operating Systems
Recommendations
CSR: Core Surprise Removal in Commodity Operating Systems
ASPLOS'16One of the adverse effects of shrinking transistor sizes is that processors have become increasingly prone to hardware faults. At the same time, the number of cores per die rises. Consequently, core failures can no longer be ruled out, and future ...
CSR: Core Surprise Removal in Commodity Operating Systems
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating SystemsOne of the adverse effects of shrinking transistor sizes is that processors have become increasingly prone to hardware faults. At the same time, the number of cores per die rises. Consequently, core failures can no longer be ruled out, and future ...
CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingSparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, ...







Comments