Abstract
Stragglers are exceptionally slow tasks within a job that delay its completion. Stragglers, which are uncommon within a single job, are pervasive in datacenters with many jobs. A large body of research has focused on mitigating datacenter stragglers, but relatively little research has focused on systematically and rigorously identifying their root causes. We present Hound, a statistical machine learning framework that infers the causes of stragglers from traces of datacenter-scale jobs. Hound is designed to achieve several objectives: datacenter-scale diagnosis, interpretable models, unbiased inference, and computational efficiency. We demonstrate Hound's capabilities for a production trace from Google's warehouse-scale datacenters and two Spark traces from Amazon EC2 clusters.
- Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (NSDI '13). 185--198. http://dl.acm.org/citation.cfm?id=2482626.2482645 Google Scholar
Digital Library
- Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the Outliers in Map-Reduce Clusters using Mantri. In 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI '10), Vol. 10. 24. Google Scholar
Digital Library
- J. Anderson, L. Berc, J. Dean, S. Ghemawat, M. Henzinger, S. Leung, R. Sites, M. Vandervoorde, C. Waldspurger, and W. Weihl. 1997. Continuous Profiling: Where have all the cycles gone?. In Proc. Symposium on Operating Systems Principles (SOSP) Google Scholar
Digital Library
- Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. 2015. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1383--1394. Google Scholar
Digital Library
- E. F. Wolff B. Schweizer. 1981. On Nonparametric Measures of Dependence for Random Variables. The Annals of Statistics 9, 4 (1981), 879--885. http://www.jstor.org/stable/2240856Google Scholar
Cross Ref
- Athula Balachandran, Vyas Sekar, Aditya Akella, Srinivasan Seshan, Ion Stoica, and Hui Zhang. 2013. Developing a predictive model of quality of experience for internet video. In ACM SIGCOMM Computer Communication Review (SIGCOMM'13), Vol. 43. ACM, 339--350. Google Scholar
Digital Library
- Elias Bareinboim and Judea Pearl. 2011. Controlling Selection Bias in Causal Inference. In AAAI. Google Scholar
Digital Library
- Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using Magpie for Request Extraction and Workload Modelling.. In 6th USENIX Symposium on Operating Systems Design and Implementation, Vol. 4. 18--18. Google Scholar
Digital Library
- David M. Blei. 2012. Probabilistic topic models. Commun. ACM 55, 4 (2012), 77--84. Google Scholar
Digital Library
- Peter Bodík, Moises Goldszmidt, and Armando Fox. 2008. HiLighter: Automatically Building Robust Signatures of Performance Behavior for Small- and Large-scale Systems. In Proceedings of the Third Conference on Tackling Computer Systems Problems with Machine Learning Techniques. USENIX Association, Berkeley, CA, USA, 3--3. Google Scholar
Digital Library
- Peter Bodik, Moises Goldszmidt, Armando Fox, Dawn B. Woodard, and Hans Andersen. 2010. Fingerprinting the Datacenter: Automated Classification of Performance Crises. In EuroSys 2010. 111--124. Google Scholar
Digital Library
- Edward Bortnikov, Ari Frank, Eshcar Hillel, and Sriram Rao. 2012. Predicting execution bottlenecks in map-reduce clusters. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing. USENIX Association, 18--18. Google Scholar
Digital Library
- Leo Breiman. 1996. Bagging predictors. Machine learning 24, 2 (1996), 123--140. Google Scholar
Digital Library
- Carlos Brito and Judea Pearl. 2012. Graphical condition for identification in recursive SEM. arXiv preprint: 1206.6821 (2012).Google Scholar
- Philip K. Chan and Salvatore J. Stolfo. 1993. Experiments on Multistrategy Learning by Meta-learning. In CIKM '93. 314--323. Google Scholar
Digital Library
- Mike Y. Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. 2002. Pinpoint: Problem determination in large, dynamic internet services. In Dependable Systems and Networks, 2002. DSN 2002. Proceedings. International Conference on. IEEE, 595--604. Google Scholar
Digital Library
- Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F Wenisch. 2014. The mystery machine: End-to-end performance analysis of large-scale internet services. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14). 217--231. Google Scholar
Digital Library
- Ira Cohen, Moises Goldszmidt, Terence Kelly, Julie Symons, and Jeffrey S. Chase. 2004. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI '04). 16--16. http://dl.acm.org/citation.cfm?id=1251254.1251270 Google Scholar
Digital Library
- Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, and Armando Fox. 2005. Capturing, Indexing, Clustering, and Retrieving System History. In 20th ACM Symposium on Operating Systems Principles (SOSP '05). 105--118. Google Scholar
Digital Library
- Gregory F. Cooper. 1990. The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks (Research Note). Artif. Intell. 42, 2--3 (March 1990), 393--405. Google Scholar
Digital Library
- Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. 2014. Exploiting Bounded Staleness to Speed Up Big Data Analytics. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC'14). USENIX Association, Berkeley, CA, USA, 37--48. http://dl.acm.org/citation.cfm?id=2643634.2643639 Google Scholar
Digital Library
- Suzana de Siqueira Santos, Daniel Yasumasa Takahashi, Asuka Nakata, and André Fujita. 2013. A comparative study of statistical methods used to identify dependencies between gene expression signals. Briefings in bioinformatics (2013), 051.Google Scholar
- Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (2013), 74--80. Google Scholar
Digital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. CACM 51, 1 (Jan. 2008), 107--113. Google Scholar
Digital Library
- Carsten F. Dormann, Jane Elith, Sven Bacher, Carsten Buchmann, Gudrun Carl, Gabriel Carré, Jaime R. García Marquéz, Bernd Gruber, Bruno Lafourcade, Pedro J Leitão, et al. 2013. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography 36, 1 (2013), 27--46.Google Scholar
Cross Ref
- Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. 2007. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX conference on Networked systems design & implementation. USENIX Association, 20--32. Google Scholar
Digital Library
- Yoav Freund and Robert E. Schapire. 1997. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. System Sci. 55, 1 (Aug. 1997), 119--139. Google Scholar
Digital Library
- S. Ghemawat, H. Gobioff, and S. Leung. 2003. The Google File System. In Proc. Symposium on Operating Systems Principles (SOSP '03). Google Scholar
Digital Library
- S. Graham, P. Kessler, and M. McKusick. 1982. Gprof: A call graph execution profiler. In Proc. Symposium on Compiler Construction (CC). Google Scholar
Digital Library
- Matthew Hoffman, Francis R. Bach, and David M. Blei. 2010. Online Learning for Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.). Curran Associates, Inc., 856--864. Google Scholar
Digital Library
- YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. 2010. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In Proceedings of the 1st ACM symposium on Cloud computing. ACM, 75--86. Google Scholar
Digital Library
- YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. 2011. A Study of Skew in MapReduce Applications. In The 5th International Open Cirrus Summit.Google Scholar
- YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. 2012. SkewTune: Mitigating Skew in Mapreduce Applications. In SIGMOD '12. 25--36. Google Scholar
Digital Library
- Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. 2014. Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency. In SOCC '14. Article 9, 14 pages. Google Scholar
Digital Library
- David Lopez-Paz, Philipp Hennig, and Bernhard Schölkopf. 2013. The randomized dependence coefficient. In Advances in neural information processing systems. 1--9. Google Scholar
Digital Library
- Jared K. Lunceford and Marie Davidian. 2004. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in medicine 23, 19 (2004), 2937--2960.Google Scholar
- Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2015. Pivot tracing: dynamic causal monitoring for distributed systems. In 25th ACM Symposium on Operating Systems Principles (SOSP '09). 378--393. Google Scholar
Digital Library
- Ajay Anil Mahimkar, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang, and Qi Zhao. 2009. Towards Automated Performance Diagnosis in a Large IPTV Network. In Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication (SIGCOMM '09). ACM, New York, NY, USA, 231--242. Google Scholar
Digital Library
- Carl Mela and Praveen Kopalle. 2002. The impact of collinearity on regression analysis: the asymmetric effect of negative and positive correlations. Applied Economics 34, 6 (2002), 667--677.Google Scholar
Cross Ref
- Jesús Muñoz and Ángel M. Felicísimo. 2004. Comparison of statistical methods commonly used in predictive modelling. Journal of Vegetation Science 15, 2 (2004), 285--292.Google Scholar
Cross Ref
- Karthik Nagaraj, Charles Killian, and Jennifer Neville. 2012. Structured comparative analysis of systems logs to diagnose performance problems. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI '12). 353--366. Google Scholar
Digital Library
- Mirco Nanni. 2005. Speeding-up hierarchical agglomerative clustering in presence of expensive metrics. In PAKDD '05. Springer, 378--387. Google Scholar
Digital Library
- Roger B. Nelsen. 2007. An Introduction to Copulas. Springer Science & Business Media. Google Scholar
Digital Library
- Sebastian Ordyniak and Stefan Szeider. 2010. Algorithms and Complexity Results for Exact Bayesian Structure Learning. In UAI 2010. Google Scholar
Digital Library
- Raghunath Othayoth and Meikel Poess. 2006. The making of tpc-ds. In PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, Vol. 32. 1049. Google Scholar
Digital Library
- Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15). USENIX Association, 293--307. Google Scholar
Digital Library
- Barnabás Póczos, Zoubin Ghahramani, and Jeff G Schneider. 2012. Copula-based Kernel Dependency Measures. In ICML '12. 775--782.Google Scholar
- Daryl Pregibon. 1982. Resistant fits for some commonly used logistic models with medical applications. Biometrics (1982), 485--498.Google Scholar
- Charles Reiss and John Wilkes. 2011. Google cluster-usage traces: format+ schema. Technical Report (2011).Google Scholar
- Charles Reiss, John Wilkes, and Joseph L. Hellerstein. 2012. Obfuscatory obscanturism: making workload traces of commercially-sensitive systems safe to release. In Network Operations and Management Symposium (NOMS), 2012 IEEE. IEEE, 1279--1286.Google Scholar
- Gang Ren, Eric Tune, Tipp Moseley, Yixin Shi, Silvius Rus, and Robert Hundt. 2010. Google-wide profiling: A continuous profiling infrastructure for data centers. IEEE Micro 4 (2010), 65--79. Google Scholar
Digital Library
- Xiaoqi Ren, Ganesh Ananthanarayanan, Adam Wierman, and Minlan Yu. 2015. Hopper: Decentralized Speculation- aware Cluster Scheduling at Scale. In SIGCOMM '15. 379--392. Google Scholar
Digital Library
- Alfréd Rényi. 1959. On measures of dependence. Acta mathematica hungarica 10, 3--4 (1959), 441--451.Google Scholar
- Irina Rish, Mark Brodie, Sheng Ma, Natalia Odintsova, Alina Beygelzimer, Genady Grabarnik, and Karina Hernandez. 2005. Adaptive diagnosis in distributed systems. IEEE Transactions on neural networks 16, 5 (2005), 1088--1109. Google Scholar
Digital Library
- Paul R. Rosenbaum and Donald B. Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1 (1983), 41--55.Google Scholar
Digital Library
- Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. 2011. Diagnosing Performance Changes by Comparing Request Flows. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI'11). USENIX Association, Berkeley, CA, USA, 43--56. Google Scholar
Digital Library
- C. Shannon and W. Weaver. 1949. The mathematical theory of communication. University of Illinois Press.Google Scholar
- Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. (2010).Google Scholar
- Harald Steck. 2008. Learning the Bayesian Network Structure: Dirichlet Prior versus Data. In UAI 2008. Google Scholar
Digital Library
- Jiang Su and Harry Zhang. 2006. A fast decision tree learning algorithm. In UAI. AAAI Press, 500--505. Google Scholar
Digital Library
- Mukarram Tariq, Amgad Zeitoun, Vytautas Valancius, Nick Feamster, and Mostafa Ammar. 2008. Answering what-if deployment and configuration questions with wise. In ACM SIGCOMM Computer Communication Review, Vol. 38. ACM, 99--110. Google Scholar
Digital Library
- Eno Thereska, Bjoern Doebel, Alice X Zheng, and Peter Nobel. 2010. Practical performance models for complex, popular applications. In ACM SIGMETRICS Performance Evaluation Review, Vol. 38. ACM, 1--12. Google Scholar
Digital Library
- Eno Thereska and Gregory R Ganger. 2008. IRONModel: Robust performance models in the wild. ACM SIGMETRICS Performance Evaluation Review 36, 1 (2008), 253--264. Google Scholar
Digital Library
- Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. (1996), 267--288.Google Scholar
- Jelte Peter Vink and Gerard de Haan. 2015. Comparison of machine learning techniques for target detection. Artificial Intelligence Review 43, 1 (2015), 125--139. Google Scholar
Digital Library
- Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, et al. 2014. Bigdatabench: A big data benchmark suite from internet services. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 488--499.Google Scholar
Cross Ref
- Frank Wilcoxon. 1945. Individual comparisons by ranking methods. Biometrics bulletin 1, 6 (1945), 80--83.Google Scholar
- D. H. Wolpert and W. G. Macready. 1997. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation 1, 1 (Apr 1997), 67--82. Google Scholar
Digital Library
- Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael Jordan. 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the 22nd Symposium on Operating Systems Principles (SOSP'09). ACM, 117--132. Google Scholar
Digital Library
- Neeraja J. Yadwadkar, Ganesh Ananthanarayanan, and Randy Katz. 2014. Wrangler: Predictable and Faster Jobs Using Fewer Resources. In SOCC '14. Article 26, 14 pages. Google Scholar
Digital Library
- Neeraja J. Yadwadkar, Bharath Hariharan, Joseph E. Gonzalez, and Randy Katz. 2016. Multi-Task Learning for Straggler Avoiding Predictive Job Scheduling. Journal of Machine Learning Research 17, 106 (2016), 1--37. Google Scholar
Digital Library
- Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce Perfor- mance in Heterogeneous Environments. In 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI '08). 29--42. http://dl.acm.org/citation.cfm?id=1855741.1855744 Google Scholar
Digital Library
- Steve Zhang, Ira Cohen, Julie Symons, and Armando Fox. 2005. Ensembles of Models for Automated Diagnosis of System Performance Problems. In Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN '05). IEEE Computer Society, Washington, DC, USA, 644--653. Google Scholar
Digital Library
- Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. 2013. CPI2: CPU Performancfe Isolation for Shared Compute Clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys '13). ACM, New York, NY, USA, 379--391. Google Scholar
Digital Library
- Yunqi Zhang, David Meisner, Jason Mars, and Lingjia Tang. 2016. Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference. In ISCA '16. Google Scholar
Digital Library
- Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan Ullah, Yu Luo, Ding Yuan, and Michael Stumm. 2014. lprof: A non-intrusive request flow profiler for distributed systems. In 11th USENIX Symposium on Operating Systems Design and Implementation. 629--644. Google Scholar
Digital Library
- Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society 67, 2 (2005), 301--320.Google Scholar
Cross Ref
Index Terms
Hound: Causal Learning for Datacenter-scale Straggler Diagnosis
Recommendations
Hound: Causal Learning for Datacenter-scale Straggler Diagnosis
SIGMETRICS '18: Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer SystemsStragglers are exceptionally slow tasks within a job that delay its completion. Stragglers, which are uncommon within a single job, are pervasive in datacenters with many jobs. We present Hound, a statistical machine learning framework that infers the ...
Hound: Causal Learning for Datacenter-scale Straggler Diagnosis
SIGMETRICS '18Stragglers are exceptionally slow tasks within a job that delay its completion. Stragglers, which are uncommon within a single job, are pervasive in datacenters with many jobs. We present Hound, a statistical machine learning framework that infers the ...
Enabling Instantaneous Relocation of Virtual Machines with a Lightweight VMM Extension
CCGRID '10: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid ComputingWe are developing an efficient resource management system with aggressive virtual machine (VM) relocation among physical nodes in a data center. Existing live migration technology, however, requires a long time to change the execution host of a VM, it ...






Comments