ABSTRACT
In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young's formula by 3-10 percent, reducing wall-clock lengths by 50-100 seconds per job on average.
- M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. Above the clouds: A berkeley view of cloud computing. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2009-28, Feb 2009.Google Scholar
- J. E. Smith and R. Nair. Virtual Machines: Versatile Platforms For Systems And Processes. Morgan Kaufmann, 2005. Google Scholar
Digital Library
- D. Gupta, L. Cherkasova, R. Gardner, and A. Vahdat. Enforcing performance isolation across virtual machines in xen. in Proceedings of 7th ACM/IFIP/USENIX Int'l Conf. on Middleware (Middleware'06), pages 342--362, 2006. Google Scholar
Digital Library
- C Evangelinos and C. N. Hill. Cloud Computing for parallel Scientific HPC Applications: Feasibility of Running Coupled Atmosphere-Ocean Climate Models on Amazon's EC2. in Computability and Complexity in Analysis (CAA'08), 2008.Google Scholar
- Amazon elastic compute cloud: on line at http://aws.amazon.com/ec2/.Google Scholar
- D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff and D. Zagorodnov. Eucalyptus: an open-source cloud computing infrastructure. in Journal of Physics: Conference Series, 180(1):1--14, 2009.Google Scholar
- J. N. Glosli, K. J. Caspersen, J. A. Gunnels, D. F. Richards, R. E. Rudd, and F. H. Streitz. Extending Stability Beyond CPU Millennium: A Micron-Scale Atomistic Simulation of Kelvin-Helmholtz Instability. in Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC'07), pages 58:1--58:11, 2007. Google Scholar
Digital Library
- J. Wilkes. More Google cluster data. Google research blog, Nov. 2011, posted at http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html.Google Scholar
- C. Reiss, J. Wilkes, and J. L. Hellerstein. Google cluster-usage traces: format + schema. Google Inc., Mountain View, CA, USA, Technical Report, Nov. 2011, revised 2012.03.20.Google Scholar
- C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Towards understanding heterogeneous clouds at scale: Google trace analysis. Intel science and technology center for cloud computing, Carnegie Mellon University, Pittsburgh, PA, USA, Tech. Rep. ISTC-CC-TR-12-101, Apr. 2012.Google Scholar
- S. Di, D. Kondo, and W. Cirne. Characterization and comparison of cloud versus grid workloads. in Proceedings of the IEEE International Conference on Cluster Computing (Cluster'12), pages 230--238, 2012. Google Scholar
Digital Library
- S. Yi, A. Andrzejak and D. Kondo. Monetary Cost-Aware Checkpointing and Migration on Amazon Cloud Spot Instances. in IEEE Trans. on Services Computing, 5(4):512--524, 2012. Google Scholar
Digital Library
- W. Cirne, G. Chaudhry, and S. Johnson. Managing Descheduling Risk in the Google Cloud. posted at http://cloud.berkeley.edu/data/managing-descheduling-risk-in-the-google-cloud-berkeley.pdf.Google Scholar
- J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. in Future Generation Computer Systems, 22(3):303--312, 2006. Google Scholar
Digital Library
- R. Subramaniyan, E. Grobelny, S. Studham, A. George. Optimization of checkpointing-related I/O for high-performance parallel and distributed computing. in Journal of Supercomputing, 46(2):150--180, 2008. Google Scholar
Digital Library
- M. S. Bouguerra, T. Gautier, D. Trystram, J. M. Vincent. A flexible checkpoint/restart model in distributed systems. in Proceedings of the 8th international conference on Parallel processing and applied mathematics (PPAM'10), pages 206--215, 2010. Google Scholar
Digital Library
- J. W. Young. A first order approximation to the optimum checkpoint interval. in Communications ACM, 17(9):530--531, 1974. Google Scholar
Digital Library
- P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. in Proceedings of the 19th ACM symposium on Operating systems principles (SOSP '03). pages 164--177, New York, NY, USA: ACM, 2003. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. in Commun. ACM, 51(1):107--113, 2008. Google Scholar
Digital Library
- P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (BLCR) for Linux clusters. in Journal of Physics: Conference Series, 46(1):494, 2006.Google Scholar
Cross Ref
- L. A. Barroso, J. Dean and U. Holzle. Web search for a planet: The Google cluster architecture. in Journal of Micro, 23(2):22--28, 2003. Google Scholar
Digital Library
- L. Huang, J. Jia, B. Yu, B. G. Chun, P. Maniatis, and M. Naik. Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression. in Proceedings of 24th International Conference on Neural Information Processing Systems (NIPS'10), pages 1--9, 2010.Google Scholar
- B. Nicolae and F. Cappello. BlobCR: efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots. in Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC'11), pages 34:1--34:12, 2011. Google Scholar
Digital Library
- S. Boyd and L. Vandenbergheąč Convex Optimization. Cambridge University Press, 2009. Google Scholar
Digital Library
- S. Di and C. L. Wang. Error-Tolerant Resource Allocation and Payment Minimization for Cloud System. in IEEE Trans. on Parallel and Distributed Systems (TPDS), 24(6):1097--1106, 2013. Google Scholar
Digital Library
- Gideon-II Cluster: http://i.cs.hku.hk/~clwang/Gideon-II.Google Scholar
- C. H. C. Leung and Q. H. Choo. On the Execution of Large Batch Programs in Unreliable Computing Systems. in IEEE Trans. on Software Engineering, 10(4):444--450, 1984. Google Scholar
Digital Library
- K. Wolter, Stochastic models for checkpointing. in Stochastic Models for Fault Tolerance, pages 177--236, Springer Berlin Heidelberg, 2010. Google Scholar
Digital Library
- T. Nakagawa. Optimum retrial number of reliability models. in Advanced Reliability Models and Maintenance Policies, pages 101--122, ser. Springer Series in Reliability Engineering. Springer London, 2008.Google Scholar
- A. Tchana, L Broto, and D. Hagimont. Fault Tolerant Approaches in Cloud Computing Infrastructures. in Proceedings of the 8th International Conference on Autonomic and Autonomous Systems (ICAS'12), pages 42--48, 2012.Google Scholar
- J. Barr, A. Narin, and J. Varia. Building Fault-Tolerant Applications on AWS. Tech. Rep., Oct 2011.Google Scholar
Index Terms
Optimization of cloud task processing with checkpoint-restart mechanism
Recommendations
Running MPI Applications over an Opportunistic Infrastructure
CISIS '15: Proceedings of the 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive SystemsWe propose a method based on Open MPI and BLCR checkpoints to allow executing MPI applications over non-dedicated and failure-prone computing infrastructures. To this end, the method allows automatic detection and recovery of MPI applications in case of ...
Cloud Computing Roundtable
As part of this special issue on cloud computing, guest editors Iván Arce and Anup Ghosh put together a roundtable discussion so readers can hear about cloud computing security from those who are on the front lines, providing services and looking at the ...
A reliability-aware approach for an optimal checkpoint/restart model in HPC environments
CLUSTER '07: Proceedings of the 2007 IEEE International Conference on Cluster ComputingThe increase in the physical size of High Performance Computing (HPC) platform makes system reliability more challenging. In order to minimize the performance loss due to unexpected failures or unnecessary overhead of fault tolerant mechanisms, we ...




Comments