skip to main content
research-article
Public Access

Hound: Causal Learning for Datacenter-scale Straggler Diagnosis

Published:03 April 2018Publication History
Skip Abstract Section

Abstract

Stragglers are exceptionally slow tasks within a job that delay its completion. Stragglers, which are uncommon within a single job, are pervasive in datacenters with many jobs. A large body of research has focused on mitigating datacenter stragglers, but relatively little research has focused on systematically and rigorously identifying their root causes. We present Hound, a statistical machine learning framework that infers the causes of stragglers from traces of datacenter-scale jobs. Hound is designed to achieve several objectives: datacenter-scale diagnosis, interpretable models, unbiased inference, and computational efficiency. We demonstrate Hound's capabilities for a production trace from Google's warehouse-scale datacenters and two Spark traces from Amazon EC2 clusters.

References

  1. Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (NSDI '13). 185--198. http://dl.acm.org/citation.cfm?id=2482626.2482645 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the Outliers in Map-Reduce Clusters using Mantri. In 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI '10), Vol. 10. 24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Anderson, L. Berc, J. Dean, S. Ghemawat, M. Henzinger, S. Leung, R. Sites, M. Vandervoorde, C. Waldspurger, and W. Weihl. 1997. Continuous Profiling: Where have all the cycles gone?. In Proc. Symposium on Operating Systems Principles (SOSP) Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. 2015. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1383--1394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. F. Wolff B. Schweizer. 1981. On Nonparametric Measures of Dependence for Random Variables. The Annals of Statistics 9, 4 (1981), 879--885. http://www.jstor.org/stable/2240856Google ScholarGoogle ScholarCross RefCross Ref
  6. Athula Balachandran, Vyas Sekar, Aditya Akella, Srinivasan Seshan, Ion Stoica, and Hui Zhang. 2013. Developing a predictive model of quality of experience for internet video. In ACM SIGCOMM Computer Communication Review (SIGCOMM'13), Vol. 43. ACM, 339--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Elias Bareinboim and Judea Pearl. 2011. Controlling Selection Bias in Causal Inference. In AAAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using Magpie for Request Extraction and Workload Modelling.. In 6th USENIX Symposium on Operating Systems Design and Implementation, Vol. 4. 18--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. David M. Blei. 2012. Probabilistic topic models. Commun. ACM 55, 4 (2012), 77--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Peter Bodík, Moises Goldszmidt, and Armando Fox. 2008. HiLighter: Automatically Building Robust Signatures of Performance Behavior for Small- and Large-scale Systems. In Proceedings of the Third Conference on Tackling Computer Systems Problems with Machine Learning Techniques. USENIX Association, Berkeley, CA, USA, 3--3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Peter Bodik, Moises Goldszmidt, Armando Fox, Dawn B. Woodard, and Hans Andersen. 2010. Fingerprinting the Datacenter: Automated Classification of Performance Crises. In EuroSys 2010. 111--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Edward Bortnikov, Ari Frank, Eshcar Hillel, and Sriram Rao. 2012. Predicting execution bottlenecks in map-reduce clusters. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing. USENIX Association, 18--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Leo Breiman. 1996. Bagging predictors. Machine learning 24, 2 (1996), 123--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Carlos Brito and Judea Pearl. 2012. Graphical condition for identification in recursive SEM. arXiv preprint: 1206.6821 (2012).Google ScholarGoogle Scholar
  15. Philip K. Chan and Salvatore J. Stolfo. 1993. Experiments on Multistrategy Learning by Meta-learning. In CIKM '93. 314--323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mike Y. Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. 2002. Pinpoint: Problem determination in large, dynamic internet services. In Dependable Systems and Networks, 2002. DSN 2002. Proceedings. International Conference on. IEEE, 595--604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F Wenisch. 2014. The mystery machine: End-to-end performance analysis of large-scale internet services. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14). 217--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ira Cohen, Moises Goldszmidt, Terence Kelly, Julie Symons, and Jeffrey S. Chase. 2004. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI '04). 16--16. http://dl.acm.org/citation.cfm?id=1251254.1251270 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, and Armando Fox. 2005. Capturing, Indexing, Clustering, and Retrieving System History. In 20th ACM Symposium on Operating Systems Principles (SOSP '05). 105--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Gregory F. Cooper. 1990. The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks (Research Note). Artif. Intell. 42, 2--3 (March 1990), 393--405. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. 2014. Exploiting Bounded Staleness to Speed Up Big Data Analytics. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC'14). USENIX Association, Berkeley, CA, USA, 37--48. http://dl.acm.org/citation.cfm?id=2643634.2643639 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Suzana de Siqueira Santos, Daniel Yasumasa Takahashi, Asuka Nakata, and André Fujita. 2013. A comparative study of statistical methods used to identify dependencies between gene expression signals. Briefings in bioinformatics (2013), 051.Google ScholarGoogle Scholar
  23. Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (2013), 74--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. CACM 51, 1 (Jan. 2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Carsten F. Dormann, Jane Elith, Sven Bacher, Carsten Buchmann, Gudrun Carl, Gabriel Carré, Jaime R. García Marquéz, Bernd Gruber, Bruno Lafourcade, Pedro J Leitão, et al. 2013. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography 36, 1 (2013), 27--46.Google ScholarGoogle ScholarCross RefCross Ref
  26. Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. 2007. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX conference on Networked systems design & implementation. USENIX Association, 20--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yoav Freund and Robert E. Schapire. 1997. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. System Sci. 55, 1 (Aug. 1997), 119--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Ghemawat, H. Gobioff, and S. Leung. 2003. The Google File System. In Proc. Symposium on Operating Systems Principles (SOSP '03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Graham, P. Kessler, and M. McKusick. 1982. Gprof: A call graph execution profiler. In Proc. Symposium on Compiler Construction (CC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Matthew Hoffman, Francis R. Bach, and David M. Blei. 2010. Online Learning for Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.). Curran Associates, Inc., 856--864. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. 2010. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In Proceedings of the 1st ACM symposium on Cloud computing. ACM, 75--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. 2011. A Study of Skew in MapReduce Applications. In The 5th International Open Cirrus Summit.Google ScholarGoogle Scholar
  33. YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. 2012. SkewTune: Mitigating Skew in Mapreduce Applications. In SIGMOD '12. 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. 2014. Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency. In SOCC '14. Article 9, 14 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. David Lopez-Paz, Philipp Hennig, and Bernhard Schölkopf. 2013. The randomized dependence coefficient. In Advances in neural information processing systems. 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jared K. Lunceford and Marie Davidian. 2004. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in medicine 23, 19 (2004), 2937--2960.Google ScholarGoogle Scholar
  37. Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2015. Pivot tracing: dynamic causal monitoring for distributed systems. In 25th ACM Symposium on Operating Systems Principles (SOSP '09). 378--393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ajay Anil Mahimkar, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang, and Qi Zhao. 2009. Towards Automated Performance Diagnosis in a Large IPTV Network. In Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication (SIGCOMM '09). ACM, New York, NY, USA, 231--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Carl Mela and Praveen Kopalle. 2002. The impact of collinearity on regression analysis: the asymmetric effect of negative and positive correlations. Applied Economics 34, 6 (2002), 667--677.Google ScholarGoogle ScholarCross RefCross Ref
  40. Jesús Muñoz and Ángel M. Felicísimo. 2004. Comparison of statistical methods commonly used in predictive modelling. Journal of Vegetation Science 15, 2 (2004), 285--292.Google ScholarGoogle ScholarCross RefCross Ref
  41. Karthik Nagaraj, Charles Killian, and Jennifer Neville. 2012. Structured comparative analysis of systems logs to diagnose performance problems. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI '12). 353--366. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Mirco Nanni. 2005. Speeding-up hierarchical agglomerative clustering in presence of expensive metrics. In PAKDD '05. Springer, 378--387. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Roger B. Nelsen. 2007. An Introduction to Copulas. Springer Science & Business Media. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Sebastian Ordyniak and Stefan Szeider. 2010. Algorithms and Complexity Results for Exact Bayesian Structure Learning. In UAI 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Raghunath Othayoth and Meikel Poess. 2006. The making of tpc-ds. In PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, Vol. 32. 1049. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15). USENIX Association, 293--307. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Barnabás Póczos, Zoubin Ghahramani, and Jeff G Schneider. 2012. Copula-based Kernel Dependency Measures. In ICML '12. 775--782.Google ScholarGoogle Scholar
  48. Daryl Pregibon. 1982. Resistant fits for some commonly used logistic models with medical applications. Biometrics (1982), 485--498.Google ScholarGoogle Scholar
  49. Charles Reiss and John Wilkes. 2011. Google cluster-usage traces: format+ schema. Technical Report (2011).Google ScholarGoogle Scholar
  50. Charles Reiss, John Wilkes, and Joseph L. Hellerstein. 2012. Obfuscatory obscanturism: making workload traces of commercially-sensitive systems safe to release. In Network Operations and Management Symposium (NOMS), 2012 IEEE. IEEE, 1279--1286.Google ScholarGoogle Scholar
  51. Gang Ren, Eric Tune, Tipp Moseley, Yixin Shi, Silvius Rus, and Robert Hundt. 2010. Google-wide profiling: A continuous profiling infrastructure for data centers. IEEE Micro 4 (2010), 65--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Xiaoqi Ren, Ganesh Ananthanarayanan, Adam Wierman, and Minlan Yu. 2015. Hopper: Decentralized Speculation- aware Cluster Scheduling at Scale. In SIGCOMM '15. 379--392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Alfréd Rényi. 1959. On measures of dependence. Acta mathematica hungarica 10, 3--4 (1959), 441--451.Google ScholarGoogle Scholar
  54. Irina Rish, Mark Brodie, Sheng Ma, Natalia Odintsova, Alina Beygelzimer, Genady Grabarnik, and Karina Hernandez. 2005. Adaptive diagnosis in distributed systems. IEEE Transactions on neural networks 16, 5 (2005), 1088--1109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Paul R. Rosenbaum and Donald B. Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1 (1983), 41--55.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. 2011. Diagnosing Performance Changes by Comparing Request Flows. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI'11). USENIX Association, Berkeley, CA, USA, 43--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. C. Shannon and W. Weaver. 1949. The mathematical theory of communication. University of Illinois Press.Google ScholarGoogle Scholar
  58. Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. (2010).Google ScholarGoogle Scholar
  59. Harald Steck. 2008. Learning the Bayesian Network Structure: Dirichlet Prior versus Data. In UAI 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Jiang Su and Harry Zhang. 2006. A fast decision tree learning algorithm. In UAI. AAAI Press, 500--505. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Mukarram Tariq, Amgad Zeitoun, Vytautas Valancius, Nick Feamster, and Mostafa Ammar. 2008. Answering what-if deployment and configuration questions with wise. In ACM SIGCOMM Computer Communication Review, Vol. 38. ACM, 99--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Eno Thereska, Bjoern Doebel, Alice X Zheng, and Peter Nobel. 2010. Practical performance models for complex, popular applications. In ACM SIGMETRICS Performance Evaluation Review, Vol. 38. ACM, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Eno Thereska and Gregory R Ganger. 2008. IRONModel: Robust performance models in the wild. ACM SIGMETRICS Performance Evaluation Review 36, 1 (2008), 253--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. (1996), 267--288.Google ScholarGoogle Scholar
  65. Jelte Peter Vink and Gerard de Haan. 2015. Comparison of machine learning techniques for target detection. Artificial Intelligence Review 43, 1 (2015), 125--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, et al. 2014. Bigdatabench: A big data benchmark suite from internet services. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 488--499.Google ScholarGoogle ScholarCross RefCross Ref
  67. Frank Wilcoxon. 1945. Individual comparisons by ranking methods. Biometrics bulletin 1, 6 (1945), 80--83.Google ScholarGoogle Scholar
  68. D. H. Wolpert and W. G. Macready. 1997. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation 1, 1 (Apr 1997), 67--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael Jordan. 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the 22nd Symposium on Operating Systems Principles (SOSP'09). ACM, 117--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Neeraja J. Yadwadkar, Ganesh Ananthanarayanan, and Randy Katz. 2014. Wrangler: Predictable and Faster Jobs Using Fewer Resources. In SOCC '14. Article 26, 14 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Neeraja J. Yadwadkar, Bharath Hariharan, Joseph E. Gonzalez, and Randy Katz. 2016. Multi-Task Learning for Straggler Avoiding Predictive Job Scheduling. Journal of Machine Learning Research 17, 106 (2016), 1--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce Perfor- mance in Heterogeneous Environments. In 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI '08). 29--42. http://dl.acm.org/citation.cfm?id=1855741.1855744 Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Steve Zhang, Ira Cohen, Julie Symons, and Armando Fox. 2005. Ensembles of Models for Automated Diagnosis of System Performance Problems. In Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN '05). IEEE Computer Society, Washington, DC, USA, 644--653. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. 2013. CPI2: CPU Performancfe Isolation for Shared Compute Clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys '13). ACM, New York, NY, USA, 379--391. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Yunqi Zhang, David Meisner, Jason Mars, and Lingjia Tang. 2016. Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference. In ISCA '16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan Ullah, Yu Luo, Ding Yuan, and Michael Stumm. 2014. lprof: A non-intrusive request flow profiler for distributed systems. In 11th USENIX Symposium on Operating Systems Design and Implementation. 629--644. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society 67, 2 (2005), 301--320.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Hound: Causal Learning for Datacenter-scale Straggler Diagnosis

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!