skip to main content
research-article

Enhancing Reliability of Workflow Execution Using Task Replication and Spot Instances

Published:03 February 2016Publication History
Skip Abstract Section

Abstract

Cloud environments offer low-cost computing resources as a subscription-based service. These resources are elastically scalable and dynamically provisioned. Furthermore, cloud providers have also pioneered new pricing models like spot instances that are cost-effective. As a result, scientific workflows are increasingly adopting cloud computing. However, spot instances are terminated when the market price exceeds the users bid price. Likewise, cloud is not a utopian environment. Failures are inevitable in such large complex distributed systems. It is also well studied that cloud resources experience fluctuations in the delivered performance. These challenges make fault tolerance an important criterion in workflow scheduling. This article presents an adaptive, just-in-time scheduling algorithm for scientific workflows. This algorithm judiciously uses both spot and on-demand instances to reduce cost and provide fault tolerance. The proposed scheduling algorithm also consolidates resources to further minimize execution time and cost. Extensive simulations show that the proposed heuristics are fault tolerant and are effective, especially under short deadlines, providing robust schedules with minimal makespan and cost.

References

  1. Amazon 2009. Amazon EC2 Spot Instances. Retrieved November 24, 2014 from http://aws.amazon.com/ec2/purchasing-options/spot-instances/.Google ScholarGoogle Scholar
  2. S. Abrishami, M. Naghibzadeh, and D. H. J. Epema. 2013. Deadline-constrained workflow scheduling algorithms for infrastructure as a service clouds. Future Generation Computer Systems 29, 1 (2013), 158--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. 2010. A view of cloud computing. ACM Communications 53, 4 (April 2010), 50--58. DOI:http://dx.doi.org/10.1145/1721654.1721672 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Benoit, M. Hakem, and Y. Robert. 2008. Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08). 1--8. DOI:http://dx.doi.org/10.1109/IPDPS.2008.4536133Google ScholarGoogle Scholar
  5. I. Brandic, D. Music, and S. Dustdar. 2009. Service mediation and negotiation bootstrapping as first achievements towards self-adaptable grid and cloud services. In Proceedings of the 6th International Conference Industry Session on Grids Meets Autonomic Computing (GMAC’09). ACM, New York, NY, 1--8. DOI:http://dx.doi.org/10.1145/1555301.1555302 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. N. Calheiros and R. Buyya. 2013. Meeting deadlines of scientific workflows in public clouds with tasks replication. IEEE Transactions on Parallel and Distributed Systems 99 (2013), 1--1. DOI:http://dx.doi.org/10.1109/TPDS.2013.238 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. F. De Rose, and R. Buyya. 2011. CloudSim: A toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Software: Practice and Experience 41, 1 (2011), 23--50. DOI:http://dx.doi.org/10.1002/spe.995 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Chen and Y. Yang. 2007. Adaptive selection of necessary and sufficient checkpoints for dynamic verification of temporal constraints in grid workflow systems. ACM Transactions on Autonomous and Adaptive Systems 2, 2, Article 6 (June 2007). DOI:http://dx.doi.org/10.1145/1242060.1242063 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Chervenak, E. Deelman, M. Livny, M. Su, R. Schuler, S. Bharathi, G. Mehta, and K. Vahi. 2007. Data placement for scientific applications in distributed environments. In Proceedings of the 8th IEEE/ACM International Conference on Grid Computing (GRID’07). IEEE Computer Society, Washington, DC, 267--274. DOI:http://dx.doi.org/10.1109/GRID.2007.4354142 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. Cirne and F. Berman. 2001. A model for moldable supercomputer jobs. In Proceedings of the 15th International Symposium of Parallel and Distributed Processing. DOI:http://dx.doi.org/10.1109/IPDPS.2001.925004 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Cirne, F. Brasileiro, D. Paranhos, L. F. W. Goes, and W. Voorsluys. 2007. On the efficacy, efficiency and emergent behavior of task replication in large distributed systems. Parallel Computing 33, 3 (2007), 213--234. DOI:http://dx.doi.org/10.1016/j.parco.2007.01.002 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Darbha and D. P. Agrawal. 1994. A task duplication based optimal scheduling algorithm for variable execution time tasks. In Proceedings of the International Conference on Parallel Processing (ICPP 1994), Vol. 2. 52--56. DOI:http://dx.doi.org/10.1109/ICPP.1994.47 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. V. Dastjerdi and R. Buyya. 2012. An autonomous reliability-aware negotiation strategy for cloud computing environments. In Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’12). IEEE, 284--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. Deelman, G. Juve, M. Rynge, J. Voeckler, and G. Berriman. 2013. Comparing FutureGrid, Amazon EC2, and Open Science grid for scientific workflows. Computing in Science Engineering 15, 4 (July 2013), 20--29. DOI:http://dx.doi.org/10.1109/MCSE.2013.44 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Dejun, G. Pierre, and C. H. Chi. 2010. EC2 performance analysis for resource provisioning of service-oriented applications. In Workshop on Service-Oriented Computing. ICSOC/ServiceWave 2009. Springer, 197--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Dogan and F. Ozguner. 2002. LDBS: A duplication based scheduling algorithm for heterogeneous computing systems. In Proceedings of the International Conference on Parallel Processing, 2002. 352--359. DOI:http://dx.doi.org/10.1109/ICPP.2002.1040891 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. B. Downey. 1997. A Model for Speedup of Parallel Programs. University of California, Berkeley, Computer Science Division.Google ScholarGoogle Scholar
  18. F. C. Gärtner. 1999. Fundamentals of fault-tolerant distributed computing in asynchronous environments. Computing Surveys 31, 1 (March 1999), 1--26. DOI:http://dx.doi.org/10.1145/311531.311532 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Hashimoto, T. Tsuchiya, and T. Kikuno. 2002. Effective scheduling of duplicated tasks for fault tolerance in multiprocessor systems. IEICE Transactions on Information and Systems 85, 3 (2002), 525--534.Google ScholarGoogle Scholar
  20. S. Hwang and C. Kesselman. 2003. Grid workflow: A flexible failure handling framework for the grid. In Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing, 2003. IEEE, 126--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Iosup, M. Jan, O. Sonmez, and D. Epema. 2007. On the dynamic resource availability in grids. In Proceedings of the 8th IEEE/ACM International Conference on Grid Computing (GRID’07). IEEE Computer Society, Washington, DC, 26--33. DOI:http://dx.doi.org/10.1109/GRID.2007.4354112 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. Javadi, J. Abawajy, and R. Buyya. 2012. Failure-aware resource provisioning for hybrid cloud infrastructure. Journal of Parallel and Distributed Computing 72, 10 (2012), 1318--1331. DOI:http://dx.doi.org/10.1016/j.jpdc.2012.06.012 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Javadi, R .K. Thulasiram, and R. Buyya. 2011. Statistical modeling of spot instance prices in public cloud environments. In Proceedings of the 4th IEEE International Conference on Utility and Cloud Computing. DOI:http://dx.doi.org/10.1109/UCC.2011.37 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. S. Johnson and M. R. Garey. 1979. Computers and Intractability—-A Guide to the Theory of NP-Completeness. W. H. Freeman. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Juve, A. Chervenak, E. Deelman, S. Bharathi, G. Mehta, and K. Vahi. 2013. Characterizing and profiling scientific workflows. Future Generation Computer Systems 29, 3 (2013). DOI:http://dx.doi.org/10.1016/j.future.2012.08.015 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Juve and E. Deelman. 2010. Scientific workflows and clouds. Crossroads 16, 3 (2010), 14--18. http://dl.acm.org/citation.cfm?id=1734166 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. Kandaswamy, A. Mandal, and D. A. Reed. 2008. Fault tolerance and recovery of scientific workflows on computational grids. In Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGRID’08). 777--782. DOI:http://dx.doi.org/10.1109/CCGRID.2008.79 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Kondo, B. Javadi, A Iosup, and D. Epema. 2010. The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems. In Proceedings of the IEEE 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid’10). 398--407. DOI:http://dx.doi.org/10.1109/CCGRID.2010.71 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Li, M. Humphrey, Y. Cheah, Y. Ryu, D. Agarwal, K. Jackson, and C. van Ingen. 2010. Fault tolerance and scaling in e-science cloud applications: Observations from the continuing development of MODISAzure. In Proceedings of the IEEE 6th International Conference on e-Science (e-Science’10). 246--253. DOI:http://dx.doi.org/10.1109/eScience.2010.47 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Lifka, I. Foster, S. Mehringer, M. Parashar, P. Redfern, C. Stewart, and S. Tuecke. 2013. XSEDE Cloud Survey Report. Technical Report. National Science Foundation, USA. https://www.ideals.illinois.edu/handle/2142/45766/.Google ScholarGoogle Scholar
  31. A. Litke, D. Skoutas, K. Tserpes, and K. Varvarigou. 2007. Efficient task replication and management for adaptive fault tolerance in Mobile Grid environments. Future Generation Computer Systems 23, 2 (2007), 163--178. DOI:http://dx.doi.org/10.1016/j.future.2006.04.014 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Ming and M. Humphrey. 2012. A performance study on the VM startup time in the cloud. In Proceedings of the IEEE 5th International Conference on Cloud Computing. DOI:http://dx.doi.org/10.1109/CLOUD.2012.103 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. D. Mosse, R. Melhem, and S. Ghosh. 1994. Analysis of a fault-tolerant multiprocessor scheduling algorithm. In Proceedings of the 24th International Symposium on Fault-Tolerant Computing, 1994. FTCS-24. Digest of Papers, 16--25. DOI:http://dx.doi.org/10.1109/FTCS.1994.315661Google ScholarGoogle ScholarCross RefCross Ref
  34. S. Ostermann, A. Iosup, N. Yigitbasi, R. Prodan, T. Fahringer, and D. Epema. 2010. A performance analysis of EC2 cloud computing services for scientific computing. Cloud Computing (2010), 115--131.Google ScholarGoogle Scholar
  35. S. Ostermann and R. Prodan. 2012. Impact of variable priced cloud resources on scientific workflow scheduling. In Parallel Processing Euro-Par. Vol. 7484. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. K. Plankensteiner, R. Prodan, T. Fahringer, A. Kertesz, and P. Kacsuk. 2009. Fault detection, prevention and recovery in current grid workflow systems. In Grid and Services Evolution. Springer US, 1--13. DOI:http://dx.doi.org/10.1007/978-0-387-85966-8_9Google ScholarGoogle Scholar
  37. D. Poola, K. Ramamohanarao, and R. Buyya. 2014. Fault-tolerant workflow scheduling using spot instances on clouds. In Proceedings of the International Conference on Computational Science in the Procedia Computer Science, 2014. 29 (2014), 523--533. DOI:http://dx.doi.org/10.1016/j.procs.2014.05.047Google ScholarGoogle Scholar
  38. S. Ranaweera and D. P. Agrawal. 2000. A task duplication based scheduling algorithm for heterogeneous systems. In Proceedings of the 14th International Symposium on Parallel and Distributed Processing, 2000. IPDPS 2000. 445--450. DOI:http://dx.doi.org/10.1109/IPDPS.2000.846020 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Z. Shi, E. Jeannot, and J. J. Dongarra. 2006. Robust task scheduling in non-deterministic heterogeneous computing systems. In Proceedings of the IEEE International Conference on Cluster Computing, 2006. IEEE, 1--10.Google ScholarGoogle Scholar
  40. D. Sun, G. Chang, C. Miao, and X. Wang. 2013. Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments. Journal of Supercomputing 66, 1 (2013). DOI:http://dx.doi.org/10.1007/s11227-013-0898-7 Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. X. Tang, K. Li, and G. Liao. 2014. An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems. Cluster Computing (2014), 1--13. DOI:http://dx.doi.org/10.1007/s10586-014-0372-1 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. X. Tang, K. Li, G. Liao, and R. Li. 2010. List scheduling with duplication for heterogeneous computing systems. Journal of Parallel and Distributed Computing 70, 4 (2010), 323--329. DOI:http://dx.doi.org/10.1016/j.jpdc.2010.01.003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. W. Voorsluys, S. Garg, and R. Buyya. 2011. Provisioning spot market cloud resources to create cost-effective virtual clusters. In Algorithms and Architectures for Parallel Processing, Vol. 7016. Springer. DOI:http://dx.doi.org/10.1007/978-3-642-24650-0_34 Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Z. Yang, A. Mandal, C. Koelbel, and K. Cooper. 2009. Combined fault tolerance and scheduling techniques for workflow applications on computational grids. In Proceedings of the IEEE 9th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID’09). 244--251. DOI:http://dx.doi.org/10.1109/CCGRID.2009.59 Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. N. Yigitbasi, M. Gallet, D. Kondo, A Iosup, and D. Epema. 2010. Analysis and modeling of time-correlated failures in large-scale distributed systems. In 11th IEEE/ACM International Conference on Grid Computing (GRID), 2010. 65--72. DOI:http://dx.doi.org/10.1109/GRID.2010.5697961Google ScholarGoogle ScholarCross RefCross Ref
  46. J. Yu and R. Buyya. 2005. A taxonomy of workflow management systems for grid computing. Journal of Grid Computing 3, 3 (2005), 171--200.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Enhancing Reliability of Workflow Execution Using Task Replication and Spot Instances

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Autonomous and Adaptive Systems
              ACM Transactions on Autonomous and Adaptive Systems  Volume 10, Issue 4
              Special Section on Best Papers from SEAMS 2014 and Regular Articles
              February 2016
              211 pages
              ISSN:1556-4665
              EISSN:1556-4703
              DOI:10.1145/2872308
              Issue’s Table of Contents

              Copyright © 2016 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 3 February 2016
              • Accepted: 1 August 2015
              • Revised: 1 April 2015
              • Received: 1 December 2014
              Published in taas Volume 10, Issue 4

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!