Abstract
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.
- 2011. NAND Flash Media Management Through RAIN. Micron.Google Scholar
- 2017. Open Hardware Monitor. Retrieved December 2017 from http://openhardwaremonitor.org.Google Scholar
- 2018. UCARE: Fail-Slow Database. Retrieved February 2018 from http://ucare.cs.uchicago.edu/projects/failslow/.Google Scholar
- Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated crash vulnerabilities. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI’16). Google Scholar
Digital Library
- Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. 2001. Fail-stutter fault tolerance. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS VIII). Google Scholar
Digital Library
- Mona Attariyan and Jason Flinn. 2010. Automating configuration troubleshooting with dynamic information flow analysis. In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI’10). Google Scholar
Digital Library
- Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07). Google Scholar
Digital Library
- Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST’08). Google Scholar
Digital Library
- Robert C. Baumann. 2005. Radiation-induced soft errors in advanced semiconductor technologies. IEEE Transactions on Device and Materials Reliability (TDMR) 5, 3 (September 2005).Google Scholar
Cross Ref
- Eric Brewer. 2016. Spinning disks and their cloudy future (keynote), In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST’16).Google Scholar
- Yu Cai, Yixin Luo, Saugata Ghose, and Onur Mutlu. 2015. Read disturb errors in MLC NAND flash memory: Characterization and mitigation. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’15). Google Scholar
Digital Library
- Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu. 2015. Data retention in MLC NAND flash memory: Characterization, optimization, and recovery. In Proceedings of the 15th International Symposium on High Performance Computer Architecture (HPCA-21).Google Scholar
Cross Ref
- George Candea and Armando Fox. 2003. Crash-only software. In Proceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS IX). Google Scholar
Digital Library
- Christine S. Chan, Boxiang Pan, Kenny Gross, Kenny Gross, and Tajana Simunic Rosing. 2013. Correcting vibration-induced performance degradation in enterprise servers. In Proceedings of the Greenmetrics Workshop (Greenmetrics’13).Google Scholar
- Allen Clement, Edmund L. Wong, Lorenzo Alvisi, Michael Dahlin, and Mirco Marchetti. 2009. Making Byzantine fault tolerant systems tolerate byzantine faults. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI’09). Google Scholar
Digital Library
- Daniel J. Dean, Hiep Nguyen, Xiaohui Gu, Hui Zhang, Junghwan Rhee, Nipun Arora, and Geoff Jiang. 2014. PerfScope: Practical online server performance bug inference in production cloud computing infrastructures. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC’14). Google Scholar
Digital Library
- Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI’04). Google Scholar
Digital Library
- Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi. 2013. Limplock: Understanding the impact of limpware on scale-out cloud systems. In Proceedings of the 4th ACM Symposium on Cloud Computing (SoCC’13). Google Scholar
Digital Library
- Thanh Do, Tyler Harter, Yingchao Liu, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2013. HARDFS: Hardening HDFS with selective and lightweight versioning. In Proceedings of the 11th USENIX Symposium on File and Storage Technologies (FAST’13). Google Scholar
Digital Library
- Nosayba El-Sayed, Ioan A. Stefanovici, George Amvrosiadis, Andy A. Hwang, and Bianca Schroeder. 2012. Temperature management in data centers: Why some (might) like it hot. In Proceedings of the 2012 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’12). Google Scholar
Digital Library
- Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17). Google Scholar
Digital Library
- Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC’14). Google Scholar
Digital Library
- Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar. 2016. Why does the cloud stop computing? Lessons from hundreds of service outages. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC’16). Google Scholar
Digital Library
- Mingzhe Hao, Huaicheng Li, Michael Hao Tong, Chrisma Pakha, Riza O. Suminto, Cesar A. Stuardo, Andrew A. Chien, and Haryadi S. Gunawi. 2017. MittOS: Supporting millisecond tail tolerance with fast rejecting SLO-aware OS interface. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP’17). Google Scholar
Digital Library
- Mingzhe Hao, Gokul Soundararajan, Deepak Kenchammana-Hosekote, Andrew A. Chien, and Haryadi S. Gunawi. 2016. The tail at store: A revelation from millions of hours of disk and SSD deployments. In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST’16). Google Scholar
Digital Library
- Peng Huang, Chuanxiong Guo, Lindong Znhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randonph Yao. 2017. Gray failure: The Achilles’ heel of cloud scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS XVII). Google Scholar
Digital Library
- Asim Kadav, Matthew J. Renzelmann, and Michael M. Swift. 2009. Tolerating hardware device failures in software. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP’09). Google Scholar
Digital Library
- Michael P. Kasick, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. 2010. Black-box problem diagnosis in parallel file systems. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST’10). Google Scholar
Digital Library
- Jaeho Kim, Donghee Lee, and Sam H. Noh. 2015. Towards SLO complying SSDs through OPS isolation. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST’15). Google Scholar
Digital Library
- Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’16). Google Scholar
Digital Library
- Ao Ma, Fred Douglis, Guanlin Lu, Darren Sawyer, Surendar Chandra, and Windsor Hsu. 2015. RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST’15). Google Scholar
Digital Library
- Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu. 2015. A large-scale study of flash memory failures in the field. In Proceedings of the 2015 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’15). Google Scholar
Digital Library
- Thanumalayan Sankaranarayana Pillai, Ramnatthan Alagappan, Lanyue Lu, Vijay Chidambaram, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Application crash consistency and performance with CCFS. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17). Google Scholar
Digital Library
- Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. IRON file system. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05). Google Scholar
Digital Library
- Bianca Schroeder, Sotirios Damouras, and Phillipa Gill. 2010. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST’10). Google Scholar
Digital Library
- Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST’07). Google Scholar
Digital Library
- Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST’16). Google Scholar
Digital Library
- Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM errors in the wild: A large-scale field study. In Proceedings of the 2009 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’09). Google Scholar
Digital Library
- Brian D. Strom, SungChang Lee, George W. Tyndall, and Andrei Khurshudov. 2007. Hard disk drive reliability modeling and failure prediction. IEEE Transactions on Magnetics (TMAG) 43, 9 (September 2007).Google Scholar
- Riza O. Suminto, Cesar A. Stuardo, Alexandra Clark, Huan Ke, Tanakorn Leesatapornwongsa, Bo Fu, Daniar H. Kurniawan, Vincentius Martin, Uma Maheswara Rao G., and Haryadi S. Gunawi. 2017. PBSE: A robust path-based speculative execution for degraded-network tail tolerance in data-parallel frameworks. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC’17). Google Scholar
Digital Library
- Eitan Yaakobi, Laura Grupp, Paul H. Siegel, Steven Swanson, and Jack K. Wolf. 2012. Characterization and error-correcting codes for TLC flash memories. In Proceedings of the International Conference on Computing, Networking and Communications (ICNC’12).Google Scholar
- Shiqin Yan, Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swaminathan Sundararaman, Andrew A. Chien, and Haryadi S. Gunawi. 2017. Tiny-tail flash: Near-perfect elimination of garbage collection tail latencies in NAND SSDs. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17). Google Scholar
Digital Library
- Zuoning Yin, Xiao Ma, Jing Zheng, Yuanyuan Zhou, Lakshmi N. Bairavasundaram, and Shankar Pasupathy. 2011. An empirical study on configuration errors in commercial and open source systems. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). Google Scholar
Digital Library
Index Terms
Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems
Recommendations
Fail-slow at scale: evidence of hardware performance faults in large production systems
FAST'18: Proceedings of the 16th USENIX Conference on File and Storage TechnologiesFail-slow hardware is an under-studied failure mode. We present a study of 101 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 12 institutions. We show that all hardware types such as disk, SSD, CPU, memory and ...
A Distributed Recovery Block Approach to Fault-Tolerant Execution of Application Tasks in Hypercubes
An approach to fault-tolerant execution of real-time application tasks in hypercubes isproposed. The approach is based on the distributed recovery block (DRB) scheme anddoes not require special hardware mechanisms in support of fault tolerance. Each ...
A Novel Approach for Fault Tolerance Control System and Embedded System Security
The paper proposes novel approach for providing security mechanism for faults estimation and fault tolerance in the automated vehicle in order to provide safety through diagnose and rectification of hardware faults. For providing authentication to the ...






Comments