ABSTRACT
Modern day datacenters host hundreds of thousands of servers that coordinate tasks in order to deliver highly available cloud computing services. These servers consist of multiple hard disks, memory modules, network cards, processors etc., each of which while carefully engineered are capable of failing. While the probability of seeing any such failure in the lifetime (typically 3-5 years in industry) of a server can be somewhat small, these numbers get magnified across all devices hosted in a datacenter. At such a large scale, hardware component failure is the norm rather than an exception.
Hardware failure can lead to a degradation in performance to end-users and can result in losses to the business. A sound understanding of the numbers as well as the causes behind these failures helps improve operational experience by not only allowing us to be better equipped to tolerate failures but also to bring down the hardware cost through engineering, directly leading to a saving for the company. To the best of our knowledge, this paper is the first attempt to study server failures and hardware repairs for large datacenters. We present a detailed analysis of failure characteristics as well as a preliminary analysis on failure predictors. We hope that the results presented in this paper will serve as motivation to foster further research in this area.
- M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Commodity, Data Center Network Architecture. In ACM SIGCOMM, 2008. Google Scholar
Digital Library
- L. Bairavasundaram, G. Goodson, S. Pasupathy, and J. Schindler. An Analysis of Latent Sector Errors in Disk Drives. In ACM SIGMETRICS, 2007. Google Scholar
Digital Library
- L. A. Barroso, J. Dean, and U. Holzle. Web Search for a Planet: The Google Cluster Architecture. In IEEE Micro, 2003. Google Scholar
Digital Library
- L. A. Barroso and U. Holzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. In Synthesis Lectures on Computer Architecture, 2009. Google Scholar
Digital Library
- J. Dean. Large-Scale Distributed Systems at Google: Current Systems and Future Directions, 2009.Google Scholar
- D.Oppenheimer, A.Ganapathi, and D.Patterson. Why do Internet Service Fail and What Can be Done About It? In 4th USENIX Symposium on Internet Technologies and Systems, 2003. Google Scholar
Digital Library
- A. Greenberg, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta. VL2: A Scalable and Flexible Data Center Network. In ACM SIGCOMM, 2009. Google Scholar
Digital Library
- J. Hamilton. An Architecture for Modular Data Centers. In CIDR, 2007.Google Scholar
- J. Hamilton. On Designing and Deploying Internet-Scale Services. In USENIX LISA, 2007. Google Scholar
Digital Library
- D. Hawkins and G. Kass. Automatic Interaction Detection. In D.M. Hawkins (ed) Topics in Applied Multivariate Analysis. Cambridge University Press, Cambridge, 1982.Google Scholar
Cross Ref
- W. Jiang, C. Hu, and Y. Zhou. Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics. In 6th USENIX Conference of File and Storage Technologies (FAST), 2008. Google Scholar
Digital Library
- R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric. In ACM SIGCOMM, 2009. Google Scholar
Digital Library
- Nicola Brace and Richard Kemp and Rosemary Snelgar. SPSS for Psychologists. In Palgrave Macmillan, 2003.Google Scholar
- D. Patterson, A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P. Enriquez, A. Fox, E. Kiciman, M. Merzbacher, D. Oppenheimer, N. Sastry, W. Tetzlaff, J. Traupamn, and N. Treuhaft. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. Technical report, UC Berkeley, March 2002. Google Scholar
Digital Library
- E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure Trends in a Large Disk Drive Population. In 5th USENIX Conference of File and Storage Technologies (FAST), 2007. Google Scholar
Digital Library
- How Much is the Reputation of Your SaaS Provider Worth? http://cloudsecurity.org/ 2009/03/13/how-much-is-the-reputation-of-your-saas-provider-worth/.Google Scholar
- B. Schroeder and G. A. Gibson. Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In 5th USENIX Conference of File and Storage Technologies (FAST), 2007. Google Scholar
Digital Library
- B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM Errors in the Wild: A Large-Scale Field Study. In ACM SIGMETRICS, 2009. Google Scholar
Digital Library
- K. V. Vishwanath, A. Greenberg, and D. A. Reed. Modular Data Centers: How to Design Them? In LSAP '09: Proceedings of the 1st ACM workshop on Large-Scale system and application performance, pages 3--10, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
- Web Startups Crumble under Amazon S3 Outage http://www.theregister.co.uk/2008/02/15/amazon_s3_outage_feb_20%08/.Google Scholar
Index Terms
Characterizing cloud computing hardware reliability
Recommendations
Failure-aware energy-efficient VM consolidation in cloud computing systems
AbstractVM consolidation is an important technique used in cloud computing systems to improve energy efficiency. It migrates the running VMs from under utilized physical resources to other resources in order to reduce the energy consumption. ...
Highlights- Reliability, energy consumption and task finishing time modelling under failures.
Reliable and Energy Efficient Resource Provisioning and Allocation in Cloud Computing
UCC '17: Proceedings of the10th International Conference on Utility and Cloud ComputingReliability and Energy-efficiency is one of the biggest trade-off challenges confronting cloud service providers. This paper provides a mathematical model of both reliability and energy consumption in cloud computing systems and analyses their ...
A Large-Scale Study of Failures in High-Performance Computing Systems
Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-...





Comments