skip to main content
10.1145/2503210.2503217acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Optimization of cloud task processing with checkpoint-restart mechanism

Published:17 November 2013Publication History

ABSTRACT

In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young's formula by 3-10 percent, reducing wall-clock lengths by 50-100 seconds per job on average.

References

  1. M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. Above the clouds: A berkeley view of cloud computing. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2009-28, Feb 2009.Google ScholarGoogle Scholar
  2. J. E. Smith and R. Nair. Virtual Machines: Versatile Platforms For Systems And Processes. Morgan Kaufmann, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Gupta, L. Cherkasova, R. Gardner, and A. Vahdat. Enforcing performance isolation across virtual machines in xen. in Proceedings of 7th ACM/IFIP/USENIX Int'l Conf. on Middleware (Middleware'06), pages 342--362, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C Evangelinos and C. N. Hill. Cloud Computing for parallel Scientific HPC Applications: Feasibility of Running Coupled Atmosphere-Ocean Climate Models on Amazon's EC2. in Computability and Complexity in Analysis (CAA'08), 2008.Google ScholarGoogle Scholar
  5. Amazon elastic compute cloud: on line at http://aws.amazon.com/ec2/.Google ScholarGoogle Scholar
  6. D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff and D. Zagorodnov. Eucalyptus: an open-source cloud computing infrastructure. in Journal of Physics: Conference Series, 180(1):1--14, 2009.Google ScholarGoogle Scholar
  7. J. N. Glosli, K. J. Caspersen, J. A. Gunnels, D. F. Richards, R. E. Rudd, and F. H. Streitz. Extending Stability Beyond CPU Millennium: A Micron-Scale Atomistic Simulation of Kelvin-Helmholtz Instability. in Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC'07), pages 58:1--58:11, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Wilkes. More Google cluster data. Google research blog, Nov. 2011, posted at http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html.Google ScholarGoogle Scholar
  9. C. Reiss, J. Wilkes, and J. L. Hellerstein. Google cluster-usage traces: format + schema. Google Inc., Mountain View, CA, USA, Technical Report, Nov. 2011, revised 2012.03.20.Google ScholarGoogle Scholar
  10. C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Towards understanding heterogeneous clouds at scale: Google trace analysis. Intel science and technology center for cloud computing, Carnegie Mellon University, Pittsburgh, PA, USA, Tech. Rep. ISTC-CC-TR-12-101, Apr. 2012.Google ScholarGoogle Scholar
  11. S. Di, D. Kondo, and W. Cirne. Characterization and comparison of cloud versus grid workloads. in Proceedings of the IEEE International Conference on Cluster Computing (Cluster'12), pages 230--238, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Yi, A. Andrzejak and D. Kondo. Monetary Cost-Aware Checkpointing and Migration on Amazon Cloud Spot Instances. in IEEE Trans. on Services Computing, 5(4):512--524, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. Cirne, G. Chaudhry, and S. Johnson. Managing Descheduling Risk in the Google Cloud. posted at http://cloud.berkeley.edu/data/managing-descheduling-risk-in-the-google-cloud-berkeley.pdf.Google ScholarGoogle Scholar
  14. J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. in Future Generation Computer Systems, 22(3):303--312, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Subramaniyan, E. Grobelny, S. Studham, A. George. Optimization of checkpointing-related I/O for high-performance parallel and distributed computing. in Journal of Supercomputing, 46(2):150--180, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. S. Bouguerra, T. Gautier, D. Trystram, J. M. Vincent. A flexible checkpoint/restart model in distributed systems. in Proceedings of the 8th international conference on Parallel processing and applied mathematics (PPAM'10), pages 206--215, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. W. Young. A first order approximation to the optimum checkpoint interval. in Communications ACM, 17(9):530--531, 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. in Proceedings of the 19th ACM symposium on Operating systems principles (SOSP '03). pages 164--177, New York, NY, USA: ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. in Commun. ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (BLCR) for Linux clusters. in Journal of Physics: Conference Series, 46(1):494, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  21. L. A. Barroso, J. Dean and U. Holzle. Web search for a planet: The Google cluster architecture. in Journal of Micro, 23(2):22--28, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. Huang, J. Jia, B. Yu, B. G. Chun, P. Maniatis, and M. Naik. Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression. in Proceedings of 24th International Conference on Neural Information Processing Systems (NIPS'10), pages 1--9, 2010.Google ScholarGoogle Scholar
  23. B. Nicolae and F. Cappello. BlobCR: efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots. in Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC'11), pages 34:1--34:12, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Boyd and L. Vandenbergheąč Convex Optimization. Cambridge University Press, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Di and C. L. Wang. Error-Tolerant Resource Allocation and Payment Minimization for Cloud System. in IEEE Trans. on Parallel and Distributed Systems (TPDS), 24(6):1097--1106, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Gideon-II Cluster: http://i.cs.hku.hk/~clwang/Gideon-II.Google ScholarGoogle Scholar
  27. C. H. C. Leung and Q. H. Choo. On the Execution of Large Batch Programs in Unreliable Computing Systems. in IEEE Trans. on Software Engineering, 10(4):444--450, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. K. Wolter, Stochastic models for checkpointing. in Stochastic Models for Fault Tolerance, pages 177--236, Springer Berlin Heidelberg, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. T. Nakagawa. Optimum retrial number of reliability models. in Advanced Reliability Models and Maintenance Policies, pages 101--122, ser. Springer Series in Reliability Engineering. Springer London, 2008.Google ScholarGoogle Scholar
  30. A. Tchana, L Broto, and D. Hagimont. Fault Tolerant Approaches in Cloud Computing Infrastructures. in Proceedings of the 8th International Conference on Autonomic and Autonomous Systems (ICAS'12), pages 42--48, 2012.Google ScholarGoogle Scholar
  31. J. Barr, A. Narin, and J. Varia. Building Fault-Tolerant Applications on AWS. Tech. Rep., Oct 2011.Google ScholarGoogle Scholar

Index Terms

  1. Optimization of cloud task processing with checkpoint-restart mechanism

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
        November 2013
        1123 pages
        ISBN:9781450323789
        DOI:10.1145/2503210
        • General Chair:
        • William Gropp,
        • Program Chair:
        • Satoshi Matsuoka

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 November 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SC '13 Paper Acceptance Rate91of449submissions,20%Overall Acceptance Rate1,516of6,373submissions,24%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader