skip to main content
10.1145/1807128.1807161acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Characterizing cloud computing hardware reliability

Published:10 June 2010Publication History

ABSTRACT

Modern day datacenters host hundreds of thousands of servers that coordinate tasks in order to deliver highly available cloud computing services. These servers consist of multiple hard disks, memory modules, network cards, processors etc., each of which while carefully engineered are capable of failing. While the probability of seeing any such failure in the lifetime (typically 3-5 years in industry) of a server can be somewhat small, these numbers get magnified across all devices hosted in a datacenter. At such a large scale, hardware component failure is the norm rather than an exception.

Hardware failure can lead to a degradation in performance to end-users and can result in losses to the business. A sound understanding of the numbers as well as the causes behind these failures helps improve operational experience by not only allowing us to be better equipped to tolerate failures but also to bring down the hardware cost through engineering, directly leading to a saving for the company. To the best of our knowledge, this paper is the first attempt to study server failures and hardware repairs for large datacenters. We present a detailed analysis of failure characteristics as well as a preliminary analysis on failure predictors. We hope that the results presented in this paper will serve as motivation to foster further research in this area.

References

  1. M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Commodity, Data Center Network Architecture. In ACM SIGCOMM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Bairavasundaram, G. Goodson, S. Pasupathy, and J. Schindler. An Analysis of Latent Sector Errors in Disk Drives. In ACM SIGMETRICS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. A. Barroso, J. Dean, and U. Holzle. Web Search for a Planet: The Google Cluster Architecture. In IEEE Micro, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. A. Barroso and U. Holzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. In Synthesis Lectures on Computer Architecture, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Dean. Large-Scale Distributed Systems at Google: Current Systems and Future Directions, 2009.Google ScholarGoogle Scholar
  6. D.Oppenheimer, A.Ganapathi, and D.Patterson. Why do Internet Service Fail and What Can be Done About It? In 4th USENIX Symposium on Internet Technologies and Systems, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Greenberg, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta. VL2: A Scalable and Flexible Data Center Network. In ACM SIGCOMM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Hamilton. An Architecture for Modular Data Centers. In CIDR, 2007.Google ScholarGoogle Scholar
  9. J. Hamilton. On Designing and Deploying Internet-Scale Services. In USENIX LISA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Hawkins and G. Kass. Automatic Interaction Detection. In D.M. Hawkins (ed) Topics in Applied Multivariate Analysis. Cambridge University Press, Cambridge, 1982.Google ScholarGoogle ScholarCross RefCross Ref
  11. W. Jiang, C. Hu, and Y. Zhou. Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics. In 6th USENIX Conference of File and Storage Technologies (FAST), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric. In ACM SIGCOMM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Nicola Brace and Richard Kemp and Rosemary Snelgar. SPSS for Psychologists. In Palgrave Macmillan, 2003.Google ScholarGoogle Scholar
  14. D. Patterson, A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P. Enriquez, A. Fox, E. Kiciman, M. Merzbacher, D. Oppenheimer, N. Sastry, W. Tetzlaff, J. Traupamn, and N. Treuhaft. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. Technical report, UC Berkeley, March 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure Trends in a Large Disk Drive Population. In 5th USENIX Conference of File and Storage Technologies (FAST), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. How Much is the Reputation of Your SaaS Provider Worth? http://cloudsecurity.org/ 2009/03/13/how-much-is-the-reputation-of-your-saas-provider-worth/.Google ScholarGoogle Scholar
  17. B. Schroeder and G. A. Gibson. Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In 5th USENIX Conference of File and Storage Technologies (FAST), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM Errors in the Wild: A Large-Scale Field Study. In ACM SIGMETRICS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. V. Vishwanath, A. Greenberg, and D. A. Reed. Modular Data Centers: How to Design Them? In LSAP '09: Proceedings of the 1st ACM workshop on Large-Scale system and application performance, pages 3--10, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Web Startups Crumble under Amazon S3 Outage http://www.theregister.co.uk/2008/02/15/amazon_s3_outage_feb_20%08/.Google ScholarGoogle Scholar

Index Terms

  1. Characterizing cloud computing hardware reliability

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SoCC '10: Proceedings of the 1st ACM symposium on Cloud computing
            June 2010
            264 pages
            ISBN:9781450300360
            DOI:10.1145/1807128

            Copyright © 2010 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 10 June 2010

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate169of722submissions,23%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader