Abstract
Modern Cloud computing systems are massive in scale, featuring environments that can execute highly dynamic Internetware applications with huge numbers of interacting tasks. This has led to a substantial challenge—the straggler problem, whereby a small subset of slow tasks significantly impede parallel job completion. This problem results in longer service responses, degraded system performance, and late timing failures that can easily threaten Quality of Service (QoS) compliance. Speculative execution (or speculation) is the prominent method deployed in Clouds to tolerate stragglers by creating task replicas at runtime. The method detects stragglers by specifying a predefined threshold to calculate the difference between individual tasks and the average task progression within a job. However, such a static threshold debilitates speculation effectiveness as it fails to capture the intrinsic diversity of timing constraints in Internetware applications, as well as dynamic environmental factors, such as resource utilization. By considering such characteristics, different levels of strictness for replica creation can be imposed to adaptively achieve specified levels of QoS for different applications. In this article, we present an algorithm to improve the execution efficiency of Internetware applications by dynamically calculating the straggler threshold, considering key parameters including job QoS timing constraints, task execution progress, and optimal system resource utilization. We implement this dynamic straggler threshold into the YARN architecture to evaluate it’s effectiveness against existing state-of-the-art solutions. Results demonstrate that the proposed approach is capable of reducing parallel job response time by up to 20% compared to the static threshold, as well as a higher speculation success rate, achieving up to 66.67% against 16.67% in comparison to the static method.
- Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective straggler mitigation: Attack of the clones. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation. 185--198. Google Scholar
Digital Library
- Ganesh Ananthanarayanan, Srikanth Kandula, Albert G. Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the outliers in map-reduce clusters using mantri. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI’10), Vol. 10. 24--37. Google Scholar
Digital Library
- Algirdas Avizienis, J.-C. Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Depend. Secure Comput. 1, 1 (2004), 11--33. Google Scholar
Digital Library
- G. E. Blelloch, L. Dagum, S. J. Smith, K. Thearling, and M. Zagha. 1993. An evaluation of sorting as a supercomputer benchmark. Int. J. High Speed Comput. (1993).Google Scholar
- Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, and Ivona Brandic. 2009. Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Gen. Comput. Syst. 25, 6 (2009), 599--616. Google Scholar
Digital Library
- Maria Carla Calzarossa, Luisa Massari, and Daniele Tessera. 2016. Workload characterization: A survey revisited. ACM Computing Surveys (CSUR’16) 48, 3 (2016), 48. Google Scholar
Digital Library
- Qi Chen, Cheng Liu, and Zhen Xiao. 2014. Improving mapreduce performance using smart speculative execution strategy. IEEE Trans. Comput. 63, 4 (2014), 954--967. Google Scholar
Digital Library
- Quan Chen, Daqiang Zhang, Minyi Guo, Qianni Deng, and Song Guo. 2010. Samr: A self-adaptive mapreduce scheduling algorithm in heterogeneous environment. In Proceedings of the IEEE 10th International Conference on Computer and Information Technology (CIT’10). 2736--2743. Google Scholar
Digital Library
- Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (2013), 74--80. Google Scholar
Digital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google Scholar
Digital Library
- Marisol García-Valls, Tommaso Cucinotta, and Chenyang Lu. 2014. Challenges in real-time virtualization and predictable cloud computing. J. Syst. Arch. 60, 9 (2014), 726--740. Google Scholar
Digital Library
- Peter Garraghan, David McKee, Xue Ouyang, David Webster, and Jie Xu. 2016a. SEED: A scalable approach for cyber-physical system simulation. IEEE Trans. Services Comput. 9, 2 (2016), 199--212. Google Scholar
Digital Library
- Peter Garraghan, Xue Ouyang, Renyu Yang, David McKee, and Jie Xu. 2016b. Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters. IEEE Trans. Services Comput. (2016).Google Scholar
- Hadoop. 2016. {Online}. Available: http://hadoop.apache.org/.Google Scholar
- Umesh Kumar and Jitendar Kumar. 2014. A comprehensive review of straggler handling algorithms for mapreduce framework. Int. J. Grid Distrib. Comput. 7, 4 (2014), 139--148.Google Scholar
Cross Ref
- YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. 2012. Skewtune: Mitigating skew in mapreduce applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 25--36. Google Scholar
Digital Library
- Jialin Li, Naveen Kr Sharma, Dan RK Ports, and Steven D Gribble. 2014. Tales of the tail: Hardware, os, and application-level sources of tail latency. In Proceedings of the ACM Symposium on Cloud Computing. ACM, 1--14. Google Scholar
Digital Library
- Jian Lü, Yu Huang, Chang Xu, and Xiaoxing Ma. 2013. Managing environment and adaptation risks for the internetware paradigm. In Theories of Programming and Formal Methods. Springer, 271--284. Google Scholar
Digital Library
- Hong Mei. 2010. Internetware: Challenges and future direction of software paradigm for internet as a computer. In Proceedings of the IEEE 34th Annual Computer Software and Applications Conference (COMPSAC’10). 14--16. Google Scholar
Digital Library
- Hong Mei, Gang Huang, and Tao Xie. 2012. Internetware: A software paradigm for internet computing. Computer 45, 6 (2012), 26--31. Google Scholar
Digital Library
- Hong Mei and Xuan-Zhe Liu. 2011. Internetware: An emerging software paradigm for internet computing. J. Comput. Sci. Technol. 26, 4 (2011), 588--599.Google Scholar
Cross Ref
- OpenCloud. 2016. OpenCloud hadoop cluster trace. {Online}. Available: http://ftp.pdl.cmu.edu/pub/datasets/hla/dataset.html.Google Scholar
- OpenNebula. 2016. Flexible enterprise cloud made simple. {Online}. Available: https://opennebula.org/.Google Scholar
- Xue Ouyang, Peter Garraghan, David McKee, Paul Townend, and Jie Xu. 2016a. Straggler detection in parallel computing systems through dynamic threshold calculation. In Proceedings of the IEEE 30th International Conference on Advanced Information Networking and Applications (AINA’16). 414--421.Google Scholar
Cross Ref
- Xue Ouyang, Peter Garraghan, Renyu Yang, Paul Townend, and Jie Xu. 2016b. Reducing late-timing failure at scale: Straggler root-cause analysis in cloud datacenters. In Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’16).Google Scholar
- Pankesh Patel, Ajith H. Ranabahu, and Amit P. Sheth. 2009. Service level agreement in cloud computing. {Online}. Available: http://corescholar.libraries.wright.edu/knoesis/78.Google Scholar
- Michael Rabinovich, Irina Rabinovich, Rajmohan Rajaraman, and Amit Aggarwal. 1999. A dynamic object replication and migration protocol for an internet hosting service. In Proceedings of the 19th IEEE International Conference on Distributed Computing Systems. 101--113. Google Scholar
Digital Library
- Charles Reiss and John Wilkes. 2011. Google cluster-usage traces: Format+ schema. Google Inc., White Paper (2011), 1--14.Google Scholar
- Josh Rosen. 2012. Fine-grained micro-tasks for mapreduce skew-handling. White Paper, University of Berkeley.Google Scholar
- Dawei Sun, Guiran Chang, and Xingwei Wang. 2012. Modeling a dynamic data replication strategy to increase system availability in cloud computing environments. J. Comput. Sci. Technol. 27, 2 (2012), 256--272.Google Scholar
- Google Cluster Data V2. 2016. {Online}. Available: https://github.com/google/cluster-data.Google Scholar
- Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, and others. 2013. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing. ACM, 5--20. Google Scholar
Digital Library
- Kun Wang, Ben Tan, Juwei Shi, and Bo Yang. 2011. Automatic task slots assignment in hadoop mapreduce. In Proceedings of the 1st Workshop on Architectures and Systems for Big Data. ACM, 24--29. Google Scholar
Digital Library
- Chang Xu, YePang Liu, Shing Chi Cheung, Chun Cao, and Jian Lv. 2013a. Towards context consistency by concurrent checking for internetware applications. Sci. China Info. Sci. 56, 8 (2013), 1--20.Google Scholar
- Huanle Xu and Wing Cheong Lau. 2013. Resource optimization for speculative execution in a mapreduce cluster. In Proceedings of the 21st IEEE International Conference on Network Protocols (ICNP’13). IEEE, 1--3.Google Scholar
- Jianlong Xu, Zibin Zheng, and Michael R. Lyu. 2016. Web service personalized quality of service prediction via reputation-based matrix factorization. IEEE Trans. Reliabil. 65, 1 (2016), 28--37.Google Scholar
Cross Ref
- Yunjing Xu, Zachary Musgrave, Brian Noble, and Michael Bailey. 2013b. Bobtail: Avoiding long tails in the cloud. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation. 329--341. Google Scholar
Digital Library
- Yadwadkar and Wontae. 2012. Proactive straggler avoidance using machine learning. White Paper, University of Berkeley.Google Scholar
- Chunyang Ye, Jun Wei, Hua Zhong, and Tao Huang. 2010. Middleware support for internetware: A service perspective. In Proceedings of the 2nd Asia-Pacific Symposium on Internetware. ACM, 4. Google Scholar
Digital Library
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In HotCloud’10. 10--16. Google Scholar
Digital Library
- Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy H. Katz, and Ion Stoica. 2008. Improving mapreduce performance in heterogeneous environments. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI’08), Vol. 8. 7--20. Google Scholar
Digital Library
- Zhuo Zhang, Chao Li, Yangyu Tao, Renyu Yang, Hong Tang, and Jie Xu. 2014. Fuxi: A fault-tolerant resource management and job scheduling system at internet scale. Proc. VLDB Endow. 7, 13 (2014), 1393--1404. Google Scholar
Digital Library
Index Terms
Adaptive Speculation for Efficient Internetware Application Execution in Clouds
Recommendations
Analysis of execution efficiency in the microthreaded processor UTLEON3
ARCS'11: Proceedings of the 24th international conference on Architecture of computing systemsWe analyse an impact of long-latency instructions, the family blocksize parameter, and the thread switch modifier on execution efficiency of families of threads in a single-core configuration of the UTLEON3 processor that implements the SVP ...
Energy-Efficient Speculative Execution using Advanced Reservation for Heterogeneous Clusters
ICPP '18: Proceedings of the 47th International Conference on Parallel ProcessingMany Big Data processing applications nowadays run on large-scale multi-tenant clusters. Due to hardware heterogeneity and resource contentions, straggler problem has become the norm rather than the exception in such clusters. To handle the straggler ...
Speculation with Little Wasting: Saving Cost in Software Speculation through Transparent Learning
ICPADS '09: Proceedings of the 2009 15th International Conference on Parallel and Distributed SystemsSoftware speculation has shown promise in parallelizing programs with coarse-grained dynamic parallelism. However, most speculation systems use offline profiling for the selection of speculative regions. The mismatch with the input-sensitivity of ...






Comments