skip to main content
research-article
Public Access

Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems

Authors Info & Claims
Published:03 October 2018Publication History
Skip Abstract Section

Abstract

Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.

References

  1. 2011. NAND Flash Media Management Through RAIN. Micron.Google ScholarGoogle Scholar
  2. 2017. Open Hardware Monitor. Retrieved December 2017 from http://openhardwaremonitor.org.Google ScholarGoogle Scholar
  3. 2018. UCARE: Fail-Slow Database. Retrieved February 2018 from http://ucare.cs.uchicago.edu/projects/failslow/.Google ScholarGoogle Scholar
  4. Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated crash vulnerabilities. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. 2001. Fail-stutter fault tolerance. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS VIII). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Mona Attariyan and Jason Flinn. 2010. Automating configuration troubleshooting with dynamic information flow analysis. In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Robert C. Baumann. 2005. Radiation-induced soft errors in advanced semiconductor technologies. IEEE Transactions on Device and Materials Reliability (TDMR) 5, 3 (September 2005).Google ScholarGoogle ScholarCross RefCross Ref
  10. Eric Brewer. 2016. Spinning disks and their cloudy future (keynote), In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST’16).Google ScholarGoogle Scholar
  11. Yu Cai, Yixin Luo, Saugata Ghose, and Onur Mutlu. 2015. Read disturb errors in MLC NAND flash memory: Characterization and mitigation. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu. 2015. Data retention in MLC NAND flash memory: Characterization, optimization, and recovery. In Proceedings of the 15th International Symposium on High Performance Computer Architecture (HPCA-21).Google ScholarGoogle ScholarCross RefCross Ref
  13. George Candea and Armando Fox. 2003. Crash-only software. In Proceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS IX). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Christine S. Chan, Boxiang Pan, Kenny Gross, Kenny Gross, and Tajana Simunic Rosing. 2013. Correcting vibration-induced performance degradation in enterprise servers. In Proceedings of the Greenmetrics Workshop (Greenmetrics’13).Google ScholarGoogle Scholar
  15. Allen Clement, Edmund L. Wong, Lorenzo Alvisi, Michael Dahlin, and Mirco Marchetti. 2009. Making Byzantine fault tolerant systems tolerate byzantine faults. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Daniel J. Dean, Hiep Nguyen, Xiaohui Gu, Hui Zhang, Junghwan Rhee, Nipun Arora, and Geoff Jiang. 2014. PerfScope: Practical online server performance bug inference in production cloud computing infrastructures. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi. 2013. Limplock: Understanding the impact of limpware on scale-out cloud systems. In Proceedings of the 4th ACM Symposium on Cloud Computing (SoCC’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Thanh Do, Tyler Harter, Yingchao Liu, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2013. HARDFS: Hardening HDFS with selective and lightweight versioning. In Proceedings of the 11th USENIX Symposium on File and Storage Technologies (FAST’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Nosayba El-Sayed, Ioan A. Stefanovici, George Amvrosiadis, Andy A. Hwang, and Bianca Schroeder. 2012. Temperature management in data centers: Why some (might) like it hot. In Proceedings of the 2012 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar. 2016. Why does the cloud stop computing? Lessons from hundreds of service outages. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mingzhe Hao, Huaicheng Li, Michael Hao Tong, Chrisma Pakha, Riza O. Suminto, Cesar A. Stuardo, Andrew A. Chien, and Haryadi S. Gunawi. 2017. MittOS: Supporting millisecond tail tolerance with fast rejecting SLO-aware OS interface. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP’17). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Mingzhe Hao, Gokul Soundararajan, Deepak Kenchammana-Hosekote, Andrew A. Chien, and Haryadi S. Gunawi. 2016. The tail at store: A revelation from millions of hours of disk and SSD deployments. In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Peng Huang, Chuanxiong Guo, Lindong Znhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randonph Yao. 2017. Gray failure: The Achilles’ heel of cloud scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS XVII). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Asim Kadav, Matthew J. Renzelmann, and Michael M. Swift. 2009. Tolerating hardware device failures in software. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Michael P. Kasick, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. 2010. Black-box problem diagnosis in parallel file systems. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jaeho Kim, Donghee Lee, and Sam H. Noh. 2015. Towards SLO complying SSDs through OPS isolation. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ao Ma, Fred Douglis, Guanlin Lu, Darren Sawyer, Surendar Chandra, and Windsor Hsu. 2015. RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu. 2015. A large-scale study of flash memory failures in the field. In Proceedings of the 2015 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Thanumalayan Sankaranarayana Pillai, Ramnatthan Alagappan, Lanyue Lu, Vijay Chidambaram, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Application crash consistency and performance with CCFS. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. IRON file system. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Bianca Schroeder, Sotirios Damouras, and Phillipa Gill. 2010. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM errors in the wild: A large-scale field study. In Proceedings of the 2009 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Brian D. Strom, SungChang Lee, George W. Tyndall, and Andrei Khurshudov. 2007. Hard disk drive reliability modeling and failure prediction. IEEE Transactions on Magnetics (TMAG) 43, 9 (September 2007).Google ScholarGoogle Scholar
  40. Riza O. Suminto, Cesar A. Stuardo, Alexandra Clark, Huan Ke, Tanakorn Leesatapornwongsa, Bo Fu, Daniar H. Kurniawan, Vincentius Martin, Uma Maheswara Rao G., and Haryadi S. Gunawi. 2017. PBSE: A robust path-based speculative execution for degraded-network tail tolerance in data-parallel frameworks. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC’17). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Eitan Yaakobi, Laura Grupp, Paul H. Siegel, Steven Swanson, and Jack K. Wolf. 2012. Characterization and error-correcting codes for TLC flash memories. In Proceedings of the International Conference on Computing, Networking and Communications (ICNC’12).Google ScholarGoogle Scholar
  42. Shiqin Yan, Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swaminathan Sundararaman, Andrew A. Chien, and Haryadi S. Gunawi. 2017. Tiny-tail flash: Near-perfect elimination of garbage collection tail latencies in NAND SSDs. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Zuoning Yin, Xiao Ma, Jing Zheng, Yuanyuan Zhou, Lakshmi N. Bairavasundaram, and Shankar Pasupathy. 2011. An empirical study on configuration errors in commercial and open source systems. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Storage
        ACM Transactions on Storage  Volume 14, Issue 3
        Special Issue on FAST 2018 and Regular Papers
        August 2018
        210 pages
        ISSN:1553-3077
        EISSN:1553-3093
        DOI:10.1145/3282875
        • Editor:
        • Sam H. Noh
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 3 October 2018
        • Accepted: 1 July 2018
        • Received: 1 June 2018
        Published in tos Volume 14, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!