skip to main content
research-article

Adaptive Speculation for Efficient Internetware Application Execution in Clouds

Published:20 January 2018Publication History
Skip Abstract Section

Abstract

Modern Cloud computing systems are massive in scale, featuring environments that can execute highly dynamic Internetware applications with huge numbers of interacting tasks. This has led to a substantial challenge—the straggler problem, whereby a small subset of slow tasks significantly impede parallel job completion. This problem results in longer service responses, degraded system performance, and late timing failures that can easily threaten Quality of Service (QoS) compliance. Speculative execution (or speculation) is the prominent method deployed in Clouds to tolerate stragglers by creating task replicas at runtime. The method detects stragglers by specifying a predefined threshold to calculate the difference between individual tasks and the average task progression within a job. However, such a static threshold debilitates speculation effectiveness as it fails to capture the intrinsic diversity of timing constraints in Internetware applications, as well as dynamic environmental factors, such as resource utilization. By considering such characteristics, different levels of strictness for replica creation can be imposed to adaptively achieve specified levels of QoS for different applications. In this article, we present an algorithm to improve the execution efficiency of Internetware applications by dynamically calculating the straggler threshold, considering key parameters including job QoS timing constraints, task execution progress, and optimal system resource utilization. We implement this dynamic straggler threshold into the YARN architecture to evaluate it’s effectiveness against existing state-of-the-art solutions. Results demonstrate that the proposed approach is capable of reducing parallel job response time by up to 20% compared to the static threshold, as well as a higher speculation success rate, achieving up to 66.67% against 16.67% in comparison to the static method.

References

  1. Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective straggler mitigation: Attack of the clones. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation. 185--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ganesh Ananthanarayanan, Srikanth Kandula, Albert G. Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the outliers in map-reduce clusters using mantri. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI’10), Vol. 10. 24--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Algirdas Avizienis, J.-C. Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Depend. Secure Comput. 1, 1 (2004), 11--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. E. Blelloch, L. Dagum, S. J. Smith, K. Thearling, and M. Zagha. 1993. An evaluation of sorting as a supercomputer benchmark. Int. J. High Speed Comput. (1993).Google ScholarGoogle Scholar
  5. Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, and Ivona Brandic. 2009. Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Gen. Comput. Syst. 25, 6 (2009), 599--616. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Maria Carla Calzarossa, Luisa Massari, and Daniele Tessera. 2016. Workload characterization: A survey revisited. ACM Computing Surveys (CSUR’16) 48, 3 (2016), 48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Qi Chen, Cheng Liu, and Zhen Xiao. 2014. Improving mapreduce performance using smart speculative execution strategy. IEEE Trans. Comput. 63, 4 (2014), 954--967. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Quan Chen, Daqiang Zhang, Minyi Guo, Qianni Deng, and Song Guo. 2010. Samr: A self-adaptive mapreduce scheduling algorithm in heterogeneous environment. In Proceedings of the IEEE 10th International Conference on Computer and Information Technology (CIT’10). 2736--2743. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (2013), 74--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Marisol García-Valls, Tommaso Cucinotta, and Chenyang Lu. 2014. Challenges in real-time virtualization and predictable cloud computing. J. Syst. Arch. 60, 9 (2014), 726--740. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Peter Garraghan, David McKee, Xue Ouyang, David Webster, and Jie Xu. 2016a. SEED: A scalable approach for cyber-physical system simulation. IEEE Trans. Services Comput. 9, 2 (2016), 199--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Peter Garraghan, Xue Ouyang, Renyu Yang, David McKee, and Jie Xu. 2016b. Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters. IEEE Trans. Services Comput. (2016).Google ScholarGoogle Scholar
  14. Hadoop. 2016. {Online}. Available: http://hadoop.apache.org/.Google ScholarGoogle Scholar
  15. Umesh Kumar and Jitendar Kumar. 2014. A comprehensive review of straggler handling algorithms for mapreduce framework. Int. J. Grid Distrib. Comput. 7, 4 (2014), 139--148.Google ScholarGoogle ScholarCross RefCross Ref
  16. YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. 2012. Skewtune: Mitigating skew in mapreduce applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jialin Li, Naveen Kr Sharma, Dan RK Ports, and Steven D Gribble. 2014. Tales of the tail: Hardware, os, and application-level sources of tail latency. In Proceedings of the ACM Symposium on Cloud Computing. ACM, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jian Lü, Yu Huang, Chang Xu, and Xiaoxing Ma. 2013. Managing environment and adaptation risks for the internetware paradigm. In Theories of Programming and Formal Methods. Springer, 271--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hong Mei. 2010. Internetware: Challenges and future direction of software paradigm for internet as a computer. In Proceedings of the IEEE 34th Annual Computer Software and Applications Conference (COMPSAC’10). 14--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hong Mei, Gang Huang, and Tao Xie. 2012. Internetware: A software paradigm for internet computing. Computer 45, 6 (2012), 26--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hong Mei and Xuan-Zhe Liu. 2011. Internetware: An emerging software paradigm for internet computing. J. Comput. Sci. Technol. 26, 4 (2011), 588--599.Google ScholarGoogle ScholarCross RefCross Ref
  22. OpenCloud. 2016. OpenCloud hadoop cluster trace. {Online}. Available: http://ftp.pdl.cmu.edu/pub/datasets/hla/dataset.html.Google ScholarGoogle Scholar
  23. OpenNebula. 2016. Flexible enterprise cloud made simple. {Online}. Available: https://opennebula.org/.Google ScholarGoogle Scholar
  24. Xue Ouyang, Peter Garraghan, David McKee, Paul Townend, and Jie Xu. 2016a. Straggler detection in parallel computing systems through dynamic threshold calculation. In Proceedings of the IEEE 30th International Conference on Advanced Information Networking and Applications (AINA’16). 414--421.Google ScholarGoogle ScholarCross RefCross Ref
  25. Xue Ouyang, Peter Garraghan, Renyu Yang, Paul Townend, and Jie Xu. 2016b. Reducing late-timing failure at scale: Straggler root-cause analysis in cloud datacenters. In Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’16).Google ScholarGoogle Scholar
  26. Pankesh Patel, Ajith H. Ranabahu, and Amit P. Sheth. 2009. Service level agreement in cloud computing. {Online}. Available: http://corescholar.libraries.wright.edu/knoesis/78.Google ScholarGoogle Scholar
  27. Michael Rabinovich, Irina Rabinovich, Rajmohan Rajaraman, and Amit Aggarwal. 1999. A dynamic object replication and migration protocol for an internet hosting service. In Proceedings of the 19th IEEE International Conference on Distributed Computing Systems. 101--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Charles Reiss and John Wilkes. 2011. Google cluster-usage traces: Format+ schema. Google Inc., White Paper (2011), 1--14.Google ScholarGoogle Scholar
  29. Josh Rosen. 2012. Fine-grained micro-tasks for mapreduce skew-handling. White Paper, University of Berkeley.Google ScholarGoogle Scholar
  30. Dawei Sun, Guiran Chang, and Xingwei Wang. 2012. Modeling a dynamic data replication strategy to increase system availability in cloud computing environments. J. Comput. Sci. Technol. 27, 2 (2012), 256--272.Google ScholarGoogle Scholar
  31. Google Cluster Data V2. 2016. {Online}. Available: https://github.com/google/cluster-data.Google ScholarGoogle Scholar
  32. Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, and others. 2013. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing. ACM, 5--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kun Wang, Ben Tan, Juwei Shi, and Bo Yang. 2011. Automatic task slots assignment in hadoop mapreduce. In Proceedings of the 1st Workshop on Architectures and Systems for Big Data. ACM, 24--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Chang Xu, YePang Liu, Shing Chi Cheung, Chun Cao, and Jian Lv. 2013a. Towards context consistency by concurrent checking for internetware applications. Sci. China Info. Sci. 56, 8 (2013), 1--20.Google ScholarGoogle Scholar
  35. Huanle Xu and Wing Cheong Lau. 2013. Resource optimization for speculative execution in a mapreduce cluster. In Proceedings of the 21st IEEE International Conference on Network Protocols (ICNP’13). IEEE, 1--3.Google ScholarGoogle Scholar
  36. Jianlong Xu, Zibin Zheng, and Michael R. Lyu. 2016. Web service personalized quality of service prediction via reputation-based matrix factorization. IEEE Trans. Reliabil. 65, 1 (2016), 28--37.Google ScholarGoogle ScholarCross RefCross Ref
  37. Yunjing Xu, Zachary Musgrave, Brian Noble, and Michael Bailey. 2013b. Bobtail: Avoiding long tails in the cloud. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation. 329--341. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yadwadkar and Wontae. 2012. Proactive straggler avoidance using machine learning. White Paper, University of Berkeley.Google ScholarGoogle Scholar
  39. Chunyang Ye, Jun Wei, Hua Zhong, and Tao Huang. 2010. Middleware support for internetware: A service perspective. In Proceedings of the 2nd Asia-Pacific Symposium on Internetware. ACM, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In HotCloud’10. 10--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy H. Katz, and Ion Stoica. 2008. Improving mapreduce performance in heterogeneous environments. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI’08), Vol. 8. 7--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Zhuo Zhang, Chao Li, Yangyu Tao, Renyu Yang, Hong Tang, and Jie Xu. 2014. Fuxi: A fault-tolerant resource management and job scheduling system at internet scale. Proc. VLDB Endow. 7, 13 (2014), 1393--1404. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Adaptive Speculation for Efficient Internetware Application Execution in Clouds

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Internet Technology
          ACM Transactions on Internet Technology  Volume 18, Issue 2
          Special Issue on Internetware and Devops and Regular Papers
          May 2018
          294 pages
          ISSN:1533-5399
          EISSN:1557-6051
          DOI:10.1145/3182619
          • Editor:
          • Munindar P. Singh
          Issue’s Table of Contents

          Copyright © 2018 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 20 January 2018
          • Revised: 1 May 2017
          • Accepted: 1 May 2017
          • Received: 1 October 2016
          Published in toit Volume 18, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!