Abstract
Cloud environments offer low-cost computing resources as a subscription-based service. These resources are elastically scalable and dynamically provisioned. Furthermore, cloud providers have also pioneered new pricing models like spot instances that are cost-effective. As a result, scientific workflows are increasingly adopting cloud computing. However, spot instances are terminated when the market price exceeds the users bid price. Likewise, cloud is not a utopian environment. Failures are inevitable in such large complex distributed systems. It is also well studied that cloud resources experience fluctuations in the delivered performance. These challenges make fault tolerance an important criterion in workflow scheduling. This article presents an adaptive, just-in-time scheduling algorithm for scientific workflows. This algorithm judiciously uses both spot and on-demand instances to reduce cost and provide fault tolerance. The proposed scheduling algorithm also consolidates resources to further minimize execution time and cost. Extensive simulations show that the proposed heuristics are fault tolerant and are effective, especially under short deadlines, providing robust schedules with minimal makespan and cost.
- Amazon 2009. Amazon EC2 Spot Instances. Retrieved November 24, 2014 from http://aws.amazon.com/ec2/purchasing-options/spot-instances/.Google Scholar
- S. Abrishami, M. Naghibzadeh, and D. H. J. Epema. 2013. Deadline-constrained workflow scheduling algorithms for infrastructure as a service clouds. Future Generation Computer Systems 29, 1 (2013), 158--169. Google Scholar
Digital Library
- M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. 2010. A view of cloud computing. ACM Communications 53, 4 (April 2010), 50--58. DOI:http://dx.doi.org/10.1145/1721654.1721672 Google Scholar
Digital Library
- A. Benoit, M. Hakem, and Y. Robert. 2008. Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08). 1--8. DOI:http://dx.doi.org/10.1109/IPDPS.2008.4536133Google Scholar
- I. Brandic, D. Music, and S. Dustdar. 2009. Service mediation and negotiation bootstrapping as first achievements towards self-adaptable grid and cloud services. In Proceedings of the 6th International Conference Industry Session on Grids Meets Autonomic Computing (GMAC’09). ACM, New York, NY, 1--8. DOI:http://dx.doi.org/10.1145/1555301.1555302 Google Scholar
Digital Library
- R. N. Calheiros and R. Buyya. 2013. Meeting deadlines of scientific workflows in public clouds with tasks replication. IEEE Transactions on Parallel and Distributed Systems 99 (2013), 1--1. DOI:http://dx.doi.org/10.1109/TPDS.2013.238 Google Scholar
Digital Library
- R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. F. De Rose, and R. Buyya. 2011. CloudSim: A toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Software: Practice and Experience 41, 1 (2011), 23--50. DOI:http://dx.doi.org/10.1002/spe.995 Google Scholar
Digital Library
- J. Chen and Y. Yang. 2007. Adaptive selection of necessary and sufficient checkpoints for dynamic verification of temporal constraints in grid workflow systems. ACM Transactions on Autonomous and Adaptive Systems 2, 2, Article 6 (June 2007). DOI:http://dx.doi.org/10.1145/1242060.1242063 Google Scholar
Digital Library
- A. Chervenak, E. Deelman, M. Livny, M. Su, R. Schuler, S. Bharathi, G. Mehta, and K. Vahi. 2007. Data placement for scientific applications in distributed environments. In Proceedings of the 8th IEEE/ACM International Conference on Grid Computing (GRID’07). IEEE Computer Society, Washington, DC, 267--274. DOI:http://dx.doi.org/10.1109/GRID.2007.4354142 Google Scholar
Digital Library
- W. Cirne and F. Berman. 2001. A model for moldable supercomputer jobs. In Proceedings of the 15th International Symposium of Parallel and Distributed Processing. DOI:http://dx.doi.org/10.1109/IPDPS.2001.925004 Google Scholar
Digital Library
- W. Cirne, F. Brasileiro, D. Paranhos, L. F. W. Goes, and W. Voorsluys. 2007. On the efficacy, efficiency and emergent behavior of task replication in large distributed systems. Parallel Computing 33, 3 (2007), 213--234. DOI:http://dx.doi.org/10.1016/j.parco.2007.01.002 Google Scholar
Digital Library
- S. Darbha and D. P. Agrawal. 1994. A task duplication based optimal scheduling algorithm for variable execution time tasks. In Proceedings of the International Conference on Parallel Processing (ICPP 1994), Vol. 2. 52--56. DOI:http://dx.doi.org/10.1109/ICPP.1994.47 Google Scholar
Digital Library
- A. V. Dastjerdi and R. Buyya. 2012. An autonomous reliability-aware negotiation strategy for cloud computing environments. In Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’12). IEEE, 284--291. Google Scholar
Digital Library
- E. Deelman, G. Juve, M. Rynge, J. Voeckler, and G. Berriman. 2013. Comparing FutureGrid, Amazon EC2, and Open Science grid for scientific workflows. Computing in Science Engineering 15, 4 (July 2013), 20--29. DOI:http://dx.doi.org/10.1109/MCSE.2013.44 Google Scholar
Digital Library
- J. Dejun, G. Pierre, and C. H. Chi. 2010. EC2 performance analysis for resource provisioning of service-oriented applications. In Workshop on Service-Oriented Computing. ICSOC/ServiceWave 2009. Springer, 197--207. Google Scholar
Digital Library
- A. Dogan and F. Ozguner. 2002. LDBS: A duplication based scheduling algorithm for heterogeneous computing systems. In Proceedings of the International Conference on Parallel Processing, 2002. 352--359. DOI:http://dx.doi.org/10.1109/ICPP.2002.1040891 Google Scholar
Digital Library
- A. B. Downey. 1997. A Model for Speedup of Parallel Programs. University of California, Berkeley, Computer Science Division.Google Scholar
- F. C. Gärtner. 1999. Fundamentals of fault-tolerant distributed computing in asynchronous environments. Computing Surveys 31, 1 (March 1999), 1--26. DOI:http://dx.doi.org/10.1145/311531.311532 Google Scholar
Digital Library
- K. Hashimoto, T. Tsuchiya, and T. Kikuno. 2002. Effective scheduling of duplicated tasks for fault tolerance in multiprocessor systems. IEICE Transactions on Information and Systems 85, 3 (2002), 525--534.Google Scholar
- S. Hwang and C. Kesselman. 2003. Grid workflow: A flexible failure handling framework for the grid. In Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing, 2003. IEEE, 126--137. Google Scholar
Digital Library
- A. Iosup, M. Jan, O. Sonmez, and D. Epema. 2007. On the dynamic resource availability in grids. In Proceedings of the 8th IEEE/ACM International Conference on Grid Computing (GRID’07). IEEE Computer Society, Washington, DC, 26--33. DOI:http://dx.doi.org/10.1109/GRID.2007.4354112 Google Scholar
Digital Library
- B. Javadi, J. Abawajy, and R. Buyya. 2012. Failure-aware resource provisioning for hybrid cloud infrastructure. Journal of Parallel and Distributed Computing 72, 10 (2012), 1318--1331. DOI:http://dx.doi.org/10.1016/j.jpdc.2012.06.012 Google Scholar
Digital Library
- B. Javadi, R .K. Thulasiram, and R. Buyya. 2011. Statistical modeling of spot instance prices in public cloud environments. In Proceedings of the 4th IEEE International Conference on Utility and Cloud Computing. DOI:http://dx.doi.org/10.1109/UCC.2011.37 Google Scholar
Digital Library
- D. S. Johnson and M. R. Garey. 1979. Computers and Intractability—-A Guide to the Theory of NP-Completeness. W. H. Freeman. Google Scholar
Digital Library
- G. Juve, A. Chervenak, E. Deelman, S. Bharathi, G. Mehta, and K. Vahi. 2013. Characterizing and profiling scientific workflows. Future Generation Computer Systems 29, 3 (2013). DOI:http://dx.doi.org/10.1016/j.future.2012.08.015 Google Scholar
Digital Library
- G. Juve and E. Deelman. 2010. Scientific workflows and clouds. Crossroads 16, 3 (2010), 14--18. http://dl.acm.org/citation.cfm?id=1734166 Google Scholar
Digital Library
- G. Kandaswamy, A. Mandal, and D. A. Reed. 2008. Fault tolerance and recovery of scientific workflows on computational grids. In Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGRID’08). 777--782. DOI:http://dx.doi.org/10.1109/CCGRID.2008.79 Google Scholar
Digital Library
- D. Kondo, B. Javadi, A Iosup, and D. Epema. 2010. The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems. In Proceedings of the IEEE 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid’10). 398--407. DOI:http://dx.doi.org/10.1109/CCGRID.2010.71 Google Scholar
Digital Library
- J. Li, M. Humphrey, Y. Cheah, Y. Ryu, D. Agarwal, K. Jackson, and C. van Ingen. 2010. Fault tolerance and scaling in e-science cloud applications: Observations from the continuing development of MODISAzure. In Proceedings of the IEEE 6th International Conference on e-Science (e-Science’10). 246--253. DOI:http://dx.doi.org/10.1109/eScience.2010.47 Google Scholar
Digital Library
- D. Lifka, I. Foster, S. Mehringer, M. Parashar, P. Redfern, C. Stewart, and S. Tuecke. 2013. XSEDE Cloud Survey Report. Technical Report. National Science Foundation, USA. https://www.ideals.illinois.edu/handle/2142/45766/.Google Scholar
- A. Litke, D. Skoutas, K. Tserpes, and K. Varvarigou. 2007. Efficient task replication and management for adaptive fault tolerance in Mobile Grid environments. Future Generation Computer Systems 23, 2 (2007), 163--178. DOI:http://dx.doi.org/10.1016/j.future.2006.04.014 Google Scholar
Digital Library
- M. Ming and M. Humphrey. 2012. A performance study on the VM startup time in the cloud. In Proceedings of the IEEE 5th International Conference on Cloud Computing. DOI:http://dx.doi.org/10.1109/CLOUD.2012.103 Google Scholar
Digital Library
- D. Mosse, R. Melhem, and S. Ghosh. 1994. Analysis of a fault-tolerant multiprocessor scheduling algorithm. In Proceedings of the 24th International Symposium on Fault-Tolerant Computing, 1994. FTCS-24. Digest of Papers, 16--25. DOI:http://dx.doi.org/10.1109/FTCS.1994.315661Google Scholar
Cross Ref
- S. Ostermann, A. Iosup, N. Yigitbasi, R. Prodan, T. Fahringer, and D. Epema. 2010. A performance analysis of EC2 cloud computing services for scientific computing. Cloud Computing (2010), 115--131.Google Scholar
- S. Ostermann and R. Prodan. 2012. Impact of variable priced cloud resources on scientific workflow scheduling. In Parallel Processing Euro-Par. Vol. 7484. Springer. Google Scholar
Digital Library
- K. Plankensteiner, R. Prodan, T. Fahringer, A. Kertesz, and P. Kacsuk. 2009. Fault detection, prevention and recovery in current grid workflow systems. In Grid and Services Evolution. Springer US, 1--13. DOI:http://dx.doi.org/10.1007/978-0-387-85966-8_9Google Scholar
- D. Poola, K. Ramamohanarao, and R. Buyya. 2014. Fault-tolerant workflow scheduling using spot instances on clouds. In Proceedings of the International Conference on Computational Science in the Procedia Computer Science, 2014. 29 (2014), 523--533. DOI:http://dx.doi.org/10.1016/j.procs.2014.05.047Google Scholar
- S. Ranaweera and D. P. Agrawal. 2000. A task duplication based scheduling algorithm for heterogeneous systems. In Proceedings of the 14th International Symposium on Parallel and Distributed Processing, 2000. IPDPS 2000. 445--450. DOI:http://dx.doi.org/10.1109/IPDPS.2000.846020 Google Scholar
Digital Library
- Z. Shi, E. Jeannot, and J. J. Dongarra. 2006. Robust task scheduling in non-deterministic heterogeneous computing systems. In Proceedings of the IEEE International Conference on Cluster Computing, 2006. IEEE, 1--10.Google Scholar
- D. Sun, G. Chang, C. Miao, and X. Wang. 2013. Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments. Journal of Supercomputing 66, 1 (2013). DOI:http://dx.doi.org/10.1007/s11227-013-0898-7 Google Scholar
Digital Library
- X. Tang, K. Li, and G. Liao. 2014. An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems. Cluster Computing (2014), 1--13. DOI:http://dx.doi.org/10.1007/s10586-014-0372-1 Google Scholar
Digital Library
- X. Tang, K. Li, G. Liao, and R. Li. 2010. List scheduling with duplication for heterogeneous computing systems. Journal of Parallel and Distributed Computing 70, 4 (2010), 323--329. DOI:http://dx.doi.org/10.1016/j.jpdc.2010.01.003 Google Scholar
Digital Library
- W. Voorsluys, S. Garg, and R. Buyya. 2011. Provisioning spot market cloud resources to create cost-effective virtual clusters. In Algorithms and Architectures for Parallel Processing, Vol. 7016. Springer. DOI:http://dx.doi.org/10.1007/978-3-642-24650-0_34 Google Scholar
Digital Library
- Z. Yang, A. Mandal, C. Koelbel, and K. Cooper. 2009. Combined fault tolerance and scheduling techniques for workflow applications on computational grids. In Proceedings of the IEEE 9th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID’09). 244--251. DOI:http://dx.doi.org/10.1109/CCGRID.2009.59 Google Scholar
Digital Library
- N. Yigitbasi, M. Gallet, D. Kondo, A Iosup, and D. Epema. 2010. Analysis and modeling of time-correlated failures in large-scale distributed systems. In 11th IEEE/ACM International Conference on Grid Computing (GRID), 2010. 65--72. DOI:http://dx.doi.org/10.1109/GRID.2010.5697961Google Scholar
Cross Ref
- J. Yu and R. Buyya. 2005. A taxonomy of workflow management systems for grid computing. Journal of Grid Computing 3, 3 (2005), 171--200.Google Scholar
Cross Ref
Index Terms
Enhancing Reliability of Workflow Execution Using Task Replication and Spot Instances
Recommendations
Planning workflow executions when using spot instances in the cloud
SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied ComputingWhen running workflows in the cloud it is appealing to use spot instances that can be acquired at a fraction of the cost of on-demand instances. Unfortunately, spot instances can be revoked at any time, creating uncertainty about task completion times, ...
Achieving Performance and Availability Guarantees with Spot Instances
HPCC '11: Proceedings of the 2011 IEEE International Conference on High Performance Computing and CommunicationsIn the Infrastructure-as-a-Service (IaaS) cloud computing market, spot instances refer to virtual servers that are rented via an auction. Spot instances allow IaaS providers to sell spare capacity while enabling IaaS users to acquire virtual servers at ...
A multi-class workflow ensemble management system using on-demand and spot instances in cloud
AbstractNowadays, cloud computing is an attractive and competitive market, and many computational jobs have migrated to cloud resources. Scheduling a workflow is a common issue in cloud computing. In some applications, a group of interrelated ...
Highlights- A workflow ensemble scheduling system using on-demand and spot instances.
- ...






Comments