skip to main content
research-article

CSR: Core Surprise Removal in Commodity Operating Systems

Published:25 March 2016Publication History
Skip Abstract Section

Abstract

One of the adverse effects of shrinking transistor sizes is that processors have become increasingly prone to hardware faults. At the same time, the number of cores per die rises. Consequently, core failures can no longer be ruled out, and future operating systems for many-core machines will have to incorporate fault tolerance mechanisms.

We present CSR, a strategy for recovery from unexpected permanent processor faults in commodity operating systems. Our approach overcomes surprise removal of faulty cores, and also tolerates cascading core failures. When a core fails in user mode, CSR terminates the process executing on that core and migrates the remaining processes in its run-queue to other cores. We further show how hardware transactional memory may be used to overcome failures in critical kernel code. Our solution is scalable, incurs low overhead, and is designed to integrate into modern operating systems. We have implemented it in the Linux kernel, using Haswell's Transactional Synchronization Extension, and tested it on a real system.

References

  1. Alexey Kopytov. SysBench - A Modular, Cross-Platform and Multi-Threaded Benchmark Tool, 2016.Google ScholarGoogle Scholar
  2. AMD®. Machine Check Architecture. In AMD64 Architecture Programmer's Manual, volume 2, chapter 9. May 2013.Google ScholarGoogle Scholar
  3. E. Argollo, A. Falcón, P. Faraboschi, M. Monchiero, and D. Ortega. COTSon: Infrastructure for Full System Simulation. SIGOPS Oper. Syst. Rev., 43(1):52--61, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ashok Raj. CPU Hotplug Support in Linux Kernel. In Linux Documentation.Google ScholarGoogle Scholar
  5. A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The Multikernel: A New OS Architecture for Scalable Multicore Systems. In 22nd Symposium on Operating Systems Principles. Association for Computing Machinery, Inc., October 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. Bellard. QEMU, a Fast and Portable Dynamic Translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC '05, pages 41--41, Berkeley, CA, USA, 2005. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. N. Bobroff, A. Kochut, and K. Beaty. Dynamic Placement of Virtual Machines for Managing SLA Violations. In Integrated Network Management, 2007. IM '07. 10th IFIP/IEEE International Symposium on, pages 119--128, May 2007.Google ScholarGoogle ScholarCross RefCross Ref
  8. S. Borkar. Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation. IEEE Micro, 25(6):10--16, Nov. 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich. An Analysis of Linux Scalability to Many Cores. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI'10, pages 1--8, Berkeley, CA, USA, 2010. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot: A Technique for Cheap Recovery. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 3--16, Berkeley, CA, USA, 2004. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst., 26(2):4:1--4:26, June 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta. Hive: Fault Containment for Shared-memory Multiprocessors. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, SOSP '95, pages 12--25, New York, NY, USA, 1995. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Chen and M. Hsiao. Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review. IBM Journal of Research and Development, 28(2):124--134, March 1984.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Christer Weingel. The Linux Watchdog API. In Linux Documentation.Google ScholarGoogle Scholar
  15. D. Christie, J.-W. Chung, S. Diestelhorst, M. Hohmuth, M. Pohlack, C. Fetzer, M. Nowack, T. Riegel, P. Felber, P. Marlier, and E. Rivière. Evaluation of AMD's Advanced Synchronization Facility Within a Complete Transactional Memory Stack. In Proceedings of the 5th European Conference on Computer Systems, EuroSys '10, pages 27--40, New York, NY, USA, 2010. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. Constantinescu. Trends and Challenges in VLSI Circuit Reliability. Micro, IEEE, 23(4):14--19, July 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM, 51(1):107--113, Jan. 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Döbel and H. Härtig. Who Watches the Watchmen? Protecting Operating System Reliability Mechanisms. In The Eighth Workshop on Hot Topics in System Dependability, Berkeley, CA, 2012. USENIX.Google ScholarGoogle Scholar
  19. S. Dolev and R. Yagel. Towards Self-Stabilizing Operating Systems. Software Engineering, IEEE Transactions on, 34(4):564--576, July 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. I. Egwutuoha, D. Levy, B. Selic, and S. Chen. A Survey of Fault Tolerance Mechanisms and Checkpoint/Restart Implementations for High Performance Computing Systems. The Journal of Supercomputing, 65(3):1302--1326, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. Fechner, A. Garbade, S. Weis, and T. Ungerer. Fault Detection and Tolerance Mechanisms for Future 1000 Core Systems. In High Performance Computing and Simulation (HPCS), 2013 International Conference on, pages 552--554, July 2013.Google ScholarGoogle ScholarCross RefCross Ref
  22. A. Garbade, S. Weis, S. Schlingmann, B. Fechner, and T. Ungerer. Fault Localization in NoCs Exploiting Periodic Heartbeat Messages in a Many-Core Environment. In Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2013 IEEE 27th International, pages 791--795, May 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Garbade, S. Weis, S. Schlingmann, B. Fechner, and T. Ungerer. Impact of Message Based Fault Detectors on Applications Messages in a Network on Chip. 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 0:470--477, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP '03, pages 29--43, New York, NY, USA, 2003. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. Giorgi, R. M. Badia, F. Bodin, A. Cohen, P. Evripidou, P. Faraboschi, B. Fechner, G. R. Gao, A. Garbade, R. Gayatri, S. Girbal, D. Goodman, B. Khan, S. Koliaï, J. Landwehr, N. M. Lê, F. Li, M. Lujàn, A. Mendelson, L. Morin, N. Navarro, T. Patejko, A. Pop, P. Trancoso, T. Ungerer, I. Watson, S. Weis, S. Zuckerman, and M. Valero. TERAFLUX: Harnessing Dataflow in Next Generation Teradevices. Microprocessors and Microsystems, 38(8, Part B):976 -- 990, 2014.Google ScholarGoogle Scholar
  26. T. Gleixner, P. E. McKenney, and V. Guittot. Cleaning Up Linux's CPU Hotplug for Real Time and Energy Management. SIGBED Rev., 9(4):49--52, Nov. 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Godard. SYSSTAT Utilities - System Performance Tools for the Linux Operating System, 2016. Available at http://sebastien.godard.pagesperso-orange.fr/.Google ScholarGoogle Scholar
  28. G. Heiser. Many-Core Chips -- A Case for Virtual Shared Memory. In Workshop on Managed Many-Core Systems, Washington DC, USA, Mar 2009.Google ScholarGoogle Scholar
  29. J. L. Henning. SPEC CPU2006 Benchmark Descriptions. SIGARCH Comput. Archit. News, 34(4):1--17, Sept. 2006.Google ScholarGoogle Scholar
  30. J. N. Herder, H. Bos, B. Gras, P. Homburg, and A. S. Tanenbaum. MINIX 3: A Highly Reliable, Self-Repairing Operating System. In ACM SIGOPS Operating Systems Review, 2006.Google ScholarGoogle Scholar
  31. M. Herlihy and J. E. B. Moss. Transactional Memory: Architectural Support for Lock-Free Data Structures. SIGARCH Comput. Archit. News, 21(2):289--300, May 1993.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Intel®. OS Machine Check Recovery on Itanium®-Based Systems. Aug. 2008.Google ScholarGoogle Scholar
  33. Intel®. Intel® Cache Safe Technology. In The Intel® Itanium® Processor 9300 Series. 2014.Google ScholarGoogle Scholar
  34. Intel®. Instruction Set Reference. In Intel 64 and IA-32 Architectures Software Developer's Manual, volume 2, chapter 4. Dec 2015.Google ScholarGoogle Scholar
  35. Intel®. Intel TSX Recommendations. In Intel 64 and IA-32 Architectures Optimization Reference Manual, chapter 12. Sep 2015.Google ScholarGoogle Scholar
  36. Intel®. Intel® Transactional Synchronization Extensions. In Intel 64 and IA-32 Architectures Software Developer's Manual, volume 1, chapter 15. Dec 2015.Google ScholarGoogle Scholar
  37. Intel®. Machine-Check Architecture. In Intel 64 and IA-32 Architectures Software Developer's Manual, volume 3, chapter 15. Dec 2015.Google ScholarGoogle Scholar
  38. Intel®. RAPL Interface. In Intel 64 and IA-32 Architectures Software Developer's Manual, volume 3, chapter 14. Dec 2015.Google ScholarGoogle Scholar
  39. R. Iyer, R. Illikkal, O. Tickoo, L. Zhao, P. Apparao, and D. Newell. VM3: Measuring, Modeling and Managing VM Shared Resources. Comput. Netw., 53(17):2873--2887, Dec. 2009.Google ScholarGoogle Scholar
  40. Jeffrey Katcher. Postmark: a New File System Benchmark. Technical report, October 1997. TR3022, Network Appliance.Google ScholarGoogle Scholar
  41. Jonathan Corbet. Scheduling Domains, 2004. Available at http://lwn.net/Articles/80911/.Google ScholarGoogle Scholar
  42. C.-K. Koh, W.-F. Wong, Y. Chen, and H. Li. The Salvage Cache: A Fault-Tolerant Cache Architecture for Next-Generation Memory Technologies. In Computer Design, 2009. ICCD 2009. IEEE International Conference on, pages 268--274, Oct 2009.Google ScholarGoogle ScholarCross RefCross Ref
  43. Y. Koh, R. Knauerhase, P. Brett, M. Bowman, Z. Wen, and C. Pu. An Analysis of Performance Interference Effects in Virtual Environments. In In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2007.Google ScholarGoogle Scholar
  44. A. Lenharth, V. Adve, and S. King. Recovery Domains: An Organizing Principle for Recoverable Operating Systems. In International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS, pages 49--60, 12 2008.Google ScholarGoogle Scholar
  45. LSE. Linux Scalability Effort Homepage, 2004. Available at https://lse.sourceforge.net/.Google ScholarGoogle Scholar
  46. Y. Mao, R. Morris, and M. F. Kaashoek. Optimizing MapReduce for Multicore Architectures. Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Tech. Rep, 2010.Google ScholarGoogle Scholar
  47. W. Maurer. Professional Linux Kernel Architecture. 2008.Google ScholarGoogle Scholar
  48. P. E. Mckenney, J. Appavoo, A. Kleen, O. Krieger, O. Krieger, R. Russell, D. Sarma, and M. Soni. Read-Copy Update. In In Ottawa Linux Symposium, pages 338--367, 2001.Google ScholarGoogle Scholar
  49. P. E. Mckenney and S. Boyd-wickizer. RCU Usage in the Linux Kernel: One Decade Later. Technical Report, sep 2012.Google ScholarGoogle Scholar
  50. Microsoft®. Windows Hot Add CPU.Google ScholarGoogle Scholar
  51. Z. Mwaikambo, A. Raj, R. Russell, J. Schopp, and S. Vaddagiri. Linux Kernel Hotplug CPU Support. In Linux Symposium, 2004.Google ScholarGoogle Scholar
  52. E. B. Nightingale, J. R. Douceur, and V. Orgovan. Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs. In Proceedings of the Sixth Conference on Computer Systems, EuroSys '11, pages 343--356, New York, NY, USA, 2011. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta, and A. Sivasubramaniam. Fault-Aware Job Scheduling for BlueGene/L Systems. In IPDPS, 2004.Google ScholarGoogle Scholar
  54. S. Panneerselvam and M. M. Swift. Chameleon: Operating System Support for Dynamic Processors. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 99--110, New York, NY, USA, 2012. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. D. A. Patterson. An Introduction to Dependability. login, pages 61--65, 2002.Google ScholarGoogle Scholar
  56. M. Radetzki, C. Feng, X. Zhao, and A. Jantsch. Methods for Fault Tolerance in Networks-On-Chip. ACM Comput. Surv., 46(1):8:1--8:38, July 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. R. Rajwar and J. R. Goodman. Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 34, pages 294--305, Washington, DC, USA, 2001. IEEE Computer Society.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. B. Rhoden, K. Klues, D. Zhu, and E. Brewer. Improving Per-node Efficiency in the Datacenter with New OS Abstractions. In Proceedings of the 2Nd ACM Symposium on Cloud Computing, SOCC '11, pages 25:1--25:8, New York, NY, USA, 2011. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. C. J. Rossbach, O. S. Hofmann, D. E. Porter, H. E. Ramadan, A. Bhandari, and E. Witchel. TxLinux: Using and Managing Hardware Transactional Memory in an Operating System. In SOSP, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. D. Rossi, N. Timoncini, M. Spica, and C. Metra. Error Correcting Code Analysis for Cache Memory High Reliability and Performance. In Design, Automation Test in Europe Conference Exhibition (DATE), 2011, pages 1--6, March 2011.Google ScholarGoogle ScholarCross RefCross Ref
  61. A. Roytman, S. Govindan, J. Liu, A. Kansal, and S. Nath. Algorithm Design for Performance Aware VM Consolidation. Technical report, 2013.Google ScholarGoogle Scholar
  62. R. D. Schlichting and F. B. Schneider. Fail-Stop Processors: An Approach to Designing Fault-Tolerant Computing Systems. ACM Trans. Comput. Syst., 1(3):222--238, Aug. 1983.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM Errors in the Wild: A Large-Scale Field Study. In SIGMETRICS, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. J. Song, J. Wittrock, and G. Parmer. Predictable, Efficient System-Level Fault Tolerance in C$^3$. 2013 IEEE 34th Real-Time Systems Symposium, 0:21--32, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. S. Srikantaiah, A. Kansal, and F. Zhao. Energy Aware Consolidation for Cloud Computing. In Proceedings of the 2008 Conference on Power Aware Computing and Systems, HotPower'08, pages 10--10, Berkeley, CA, USA, 2008. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. J. Srinivasan, S. Adve, P. Bose, and J. Rivers. The Impact of Technology Scaling on Lifetime Reliability. In Dependable Systems and Networks, 2004 International Conference on, pages 177--186, June 2004.Google ScholarGoogle ScholarCross RefCross Ref
  67. Srivatsa S. Bhat. CPU Hotplug: stop_machine()-Free CPU Hotplug. Available at http://lwn.net/Articles/533553/.Google ScholarGoogle Scholar
  68. M. M. Swift, M. Annamalai, B. N. Bershad, and H. M. Levy. Recovering Device Drivers. ACM Trans. Comput. Syst., 24(4):333--360, Nov. 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. M. M. Swift, B. N. Bershad, and H. M. Levy. Improving the Reliability of Commodity Operating Systems. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP '03, pages 207--222, New York, NY, USA, 2003. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar. An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS. Solid-State Circuits, IEEE Journal of, 43(1):29--41, Jan. 2008.Google ScholarGoogle Scholar
  71. A. Wang, M. Gaudet, P. Wu, J. N. Amaral, M. Ohmacht, C. Barton, R. Silvera, and M. Michael. Evaluation of Blue Gene/Q Hardware Support for Transactional Memories. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 127--136, New York, NY, USA, 2012. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. S. Weis, A. Garbade, B. Fechner, A. Mendelson, R. Giorgi, and T. Ungerer. Architectural Support for Fault Tolerance in a Teradevice Dataflow System. International Journal of Parallel Programming, pages 1--25, 2014.Google ScholarGoogle Scholar
  73. S. Weis, A. Garbade, and T. Ungerer. Design Exploration of FDUs and Core-Internal Fault-Detection. Exploiting Dataflow Parallelism in Tera-Device Computing, 2010.Google ScholarGoogle Scholar
  74. T. White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 1st edition, 2009.Google ScholarGoogle Scholar
  75. G. Yalcin, O. Unsal, and A. Cristal. FaulTM: Error Detection and Recovery Using Hardware Transactional Memory. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE '13, pages 220--225, San Jose, CA, USA, 2013. EDA Consortium.Google ScholarGoogle ScholarCross RefCross Ref
  76. G.-C. Yang. Reliability of Semiconductor RAMs with Soft-Error Scrubbing Techniques. Computers and Digital Techniques, IEE Proceedings, 142(5):337--344, Sep 1995.Google ScholarGoogle Scholar
  77. R. M. Yoo, C. J. Hughes, K. Lai, and R. Rajwar. Performance Evaluation of Inteltextsuperscript® Transactional Synchronization Extensions for High-performance Computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '13, pages 19:1--19:11, New York, NY, USA, 2013. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. G. Zellweger, S. Gerber, K. Kourtis, and T. Roscoe. Decoupling Cores, Kernels, and Operating Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 17--31, Broomfield, CO, Oct. 2014. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. F. Zhou, J. Condit, Z. Anderson, I. Bagrak, R. Ennals, M. Harren, G. Necula, and E. Brewer. SafeDrive: Safe and Recoverable Extensions Using Language-based Techniques. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation, OSDI '06, pages 45--60, Berkeley, CA, USA, 2006. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CSR: Core Surprise Removal in Commodity Operating Systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!