skip to main content
research-article

PreFix: Switch Failure Prediction in Datacenter Networks

Published:03 April 2018Publication History
Skip Abstract Section

Abstract

In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifically, in our proposed system, named PreFix, we aim to determine during runtime whether a switch failure will happen in the near future. The prediction is based on the measurements of the current switch system status and historical switch hardware failure cases that have been carefully labelled by network operators. Our key observation is that failures of the same switch model share some common syslog patterns before failures occur, and we can apply machine learning methods to extract the common patterns for predicting switch failures. Our novel set of features (message template sequence, frequency, seasonality and surge) for machine learning can efficiently deal with the challenges of noises, sample imbalance, and computation overhead. We evaluated PreFix on a data set collected from 9397 switches (3 different switch models) deployed in more than 20 datacenters owned by a top global search engine in a 2-year period. PreFix achieved an average of 61.81% recall and 1.84 * 10^-5 false positive ratio. It outperforms the other failure prediction methods for computers and ISP devices.

References

  1. Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. A scalable, commodity data center network architecture SIGCOMM. Seattle, WA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Lloyd Allison and Trevor I Dix. 1986. A bit-string longest-common-subsequence algorithm. Inform. Process. Lett. Vol. 23, 5 (1986), 305--310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting Disk Replacement towards Reliable Data Centers Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 39--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Guo Chen, Youjian Zhao, Dan Pei, and Dan Li. 2015. Rewiring 2 Links is Enough: Accelerating Failure Recovery in Production Data Center Networks. In Distributed Computing Systems (ICDCS), 2015 IEEE 35th International Conference on. IEEE, 569--578.Google ScholarGoogle ScholarCross RefCross Ref
  5. Maxime Crochemore, Costas S Iliopoulos, Yoan J Pinzon, and James F Reid. 2001. A fast and practical bit-vector algorithm for the longest common subsequence problem. Inform. Process. Lett. Vol. 80, 6 (2001), 279--285. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Liu Dapeng, Zhao Youjian, Xu Haowen, Sun Yongqian, Pei Dan, Luo Jiao, Jing Xiaowei, and Feng Mei. 2015. Opprentice: Towards Practical and Automatic Anomaly Detection through Machine Learning ACM IMC. Tokyo, Japan. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Alain De Cheveigné and Hideki Kawahara. 2002. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America Vol. 111, 4 (2002), 1917--1930.Google ScholarGoogle Scholar
  8. Mukund Deshpande and George Karypis. 2002. Evaluation of techniques for classifying biological sequences. In Advances in Knowledge Discovery and Data Mining. Springer, 417--431. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cees Elzinga, Sven Rahmann, and Hui Wang. 2008. Algorithms for subsequence combinatorics. Theoretical Computer Science Vol. 409, 3 (2008), 394--404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R Wesley Featherstun and Errin W Fulp. 2010. Using Syslog Message Sequences for Predicting Disk Failures LISA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ilenia Fronza, Alberto Sillitti, Giancarlo Succi, Mikko Terho, and Jelena Vlasenko. 2013. Failure prediction based on log files using Random Indexing and Support Vector Machines. Journal of Systems and Software Vol. 86, 1 (2013), 2--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Errin W Fulp, Glenn A Fink, and Jereme N Haack. 2008 a. Predicting Computer System Failures Using Support Vector Machines. WASL Vol. 8 (2008), 5--5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Errin W. Fulp, Glenn A. Fink, and Jereme N. Haack. 2008 b. Predicting Computer System Failures Using Support Vector Machines Proceedings of the First USENIX Conference on Analysis of System Logs (WASL'08). 5--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. In SIGCOMM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Albert Greenberg, James R Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A Maltz, Parveen Patel, and Sudipta Sengupta. 2009. VL2: a scalable and flexible data center network. In SIGCOMM. Barcelona, Spain. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, Hua Chen, Zhi-Wei Lin, and Varugis Kurien. 2015. Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM '15). 139--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Nikhil Handigol, Brandon Heller, Vimalkumar Jeyakumar, David Mazières, and Nick McKeown. 2014. I know what your packet did last hop: Using packet histories to troubleshoot networks 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14). 71--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Tin Kam Ho. 1995. Random decision forests. In Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on, Vol. Vol. 1. IEEE, 278--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Guenther Hoffmann and Miroslaw Malek. 2006. Call availability prediction in a telecommunication system: A data driven empirical approach. In Reliable Distributed Systems, 2006. SRDS'06. 25th IEEE Symposium on. IEEE, 83--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Eamonn J Keogh and Michael J Pazzani. 2000. Scaling up dynamic time warping for datamining applications Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 285--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Tomohiro Kimura, Koji Ishibashi, Takayoshi Mori, Hideyuki Sawada, Tsuyoshi Toyono, Ken Nishimatsu, Atsuyori Watanabe, Akihiro Shimoda, and Kohei Shiomoto. 2014. Spatio-temporal factorization of log data for understanding network events INFOCOM, 2014 Proceedings IEEE. IEEE, 610--618.Google ScholarGoogle Scholar
  23. Tatsuaki Kimura, Akio Watanabe, Tsuyoshi Toyono, and Keisuke Ishibashi. 2015. Proactive failure detection learning generation patterns of large-scale network logs Network and Service Management (CNSM), 2015 11th International Conference on. IEEE, 8--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ron Kohavi. 1995. A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2 (IJCAI'95). 1137--1143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Terran Lane and Carla E Brodley. 1999. Temporal sequence learning and data reduction for anomaly detection. ACM Transactions on Information and System Security (TISSEC) Vol. 2, 3 (1999), 295--331. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Christina Leslie and Rui Kuang. 2004. Fast string kernels using inexact matching for protein sequences. The Journal of Machine Learning Research Vol. 5 (2004), 1435--1455. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Christina S Leslie, Eleazar Eskin, and William Stafford Noble. 2002. The spectrum kernel: A string kernel for SVM protein classification. Pacific symposium on biocomputing, Vol. Vol. 7. 566--575.Google ScholarGoogle Scholar
  28. Yinglung Liang, Yanyong Zhang, Morris Jette, Anand Sivasubramaniam, and Ramendra Sahoo. 2006. Bluegene/l failure analysis and prediction models. In Dependable Systems and Networks, 2006. DSN 2006. International Conference on. IEEE, 425--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Junda Liu, Aurojit Panda, Ankit Singla, Brighten Godfrey, Michael Schapira, and Scott Shenker. 2013 b. Ensuring Connectivity via Data Plane Mechanisms. In NSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Vincent Liu, Daniel Halperin, Arvind Krishnamurthy, and Thomas Anderson. 2013 a. F10: A Fault-tolerant Engineered Network. In NSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. 2002. Text classification using string kernels. The Journal of Machine Learning Research Vol. 2 (2002), 419--444. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Zhiling Luo, Ying Li, Ruisheng Fu, and Jianwei Yin. 2016. Don't Fire Me, a Kernel Autoregressive Hybrid Model for Optimal Layoff Plan Big Data (BigData Congress), 2016 IEEE International Congress on. IEEE, 470--477.Google ScholarGoogle Scholar
  33. Geoffrey McLachlan, Kim-Anh Do, and Christophe Ambroise. 2005. Analyzing microarray gene expression data. Vol. Vol. 422. John Wiley & Sons.Google ScholarGoogle Scholar
  34. G Martin Milner. 2005. Detection/classification/quantification of chemical agents using an array of surface acoustic wave (SAW) devices. In Proceedings of SPIE, Vol. Vol. 5778. 305--316.Google ScholarGoogle Scholar
  35. Andrew W Moore. 2001. Cross-validation for detecting and preventing overfitting. School of Computer Science Carneigie Mellon University (2001).Google ScholarGoogle Scholar
  36. Nasser M. Nasrabadi. 2007. Pattern Recognition and Machine Learning. Journal of Electronic Imaging Vol. 16 (2007).Google ScholarGoogle Scholar
  37. Srinivasan Parthasarathy, Sameep Mehta, and Soundararajan Srinivasan. 2006. Robust periodicity detection algorithms. In Proceedings of the 15th ACM international conference on Information and knowledge management. ACM, 874--875. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Rahul Potharaju and Navendu Jain. 2013. Demystifying the Dark Side of the Middle: A Field Study of Middlebox Failures in Datacenters. In Proceedings of the 2013 Conference on Internet Measurement Conference (IMC '13). 9--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Tongqing Qiu, Zihui Ge, Dan Pei, Jia Wang, and Jun Xu. 2010. What Happened in My Network: Mining Network Events from Router Syslogs Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement (IMC '10). 472--484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Chotirat Ann Ratanamahatana and Eamonn Keogh. {n. d.}. Making Time-series Classification More Accurate Using Learned Constraints. 11--22.Google ScholarGoogle Scholar
  41. Hiroto Saigo, Jean-Philippe Vert, Nobuhisa Ueda, and Tatsuya Akutsu. 2004. Protein homology detection using string alignment kernels. Bioinformatics Vol. 20, 11 (2004), 1682--1689. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Felix Salfner, Maren Lenk, and Miroslaw Malek. 2010. A survey of online failure prediction methods. ACM Computing Surveys (CSUR) Vol. 42, 3 (2010), 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Felix Salfner and Miroslaw Malek. 2007. Using hidden semi-Markov models for effective online failure prediction Reliable Distributed Systems, 2007. SRDS 2007. 26th IEEE International Symposium on. IEEE, 161--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Felix Salfner and Steffen Tschirpke. 2008. Error Log Processing for Accurate Failure Prediction Proceedings of the First USENIX Conference on Analysis of System Logs (WASL'08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM computing surveys (CSUR) Vol. 34, 1 (2002), 1--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Mohammed Shatnawi and Mohamed Hefeeda. 2015. Real-time failure prediction in online services. In 2015 IEEE Conference on Computer Communications (INFOCOM). IEEE, 1391--1399.Google ScholarGoogle ScholarCross RefCross Ref
  47. Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat. 2015. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM '15). 183--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Josef Sivic and Andrew Zisserman. 2009. Efficient visual search of videos cast as text retrieval. IEEE transactions on pattern analysis and machine intelligence Vol. 31, 4 (2009), 591--606. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Sören Sonnenburg, Gunnar R"atsch, and Bernhard Schölkopf. 2005. Large scale genomic sequence SVM classifiers. In Proceedings of the 22nd international conference on Machine learning. ACM, 848--855. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Pang-Ning Tan and Vipin Kumar. 2004. Discovery of web robot sessions based on their navigational patterns. In Intelligent Technologies for Information Analysis. Springer, 193--222.Google ScholarGoogle Scholar
  51. Michail Vlachos, S Yu Philip, and Vittorio Castelli. 2005. On Periodicity Detection and Structural Periodic Similarity. SDM, Vol. Vol. 5. SIAM, 449--460.Google ScholarGoogle ScholarCross RefCross Ref
  52. Meg Walraed-Sullivan, Amin Vahdat, and Keith Marzullo. 2013. Aspen Trees: Balancing Data Center Fault Tolerance, Scalability and Cost CoNEXT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Yoshihiro Watanabe, Hiroyuki Otsuka, Masataka Sonoda, Shinji Kikuchi, and Yuki Matsumoto. 2012. Online failure prediction in cloud datacenters by real-time message pattern learning Cloud Computing Technology and Science (CloudCom), 2012 IEEE 4th International Conference on. IEEE, 504--511. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Li Wei and Eamonn Keogh. 2006. Semi-supervised time series classification. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 748--753. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Xin Wu, Daniel Turner, Chao-Chih Chen, David A Maltz, Xiaowei Yang, Lihua Yuan, and Ming Zhang. 2012. NetPilot: automating datacenter network failure mitigation Proceedings of the 2012 ACM Conference on Special Interest Group on Data Communication (SIGCOMM '12). 419--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Yang Wu, Mingchen Zhao, Andreas Haeberlen, Wenchao Zhou, and Boon Thau Loo. 2014. Diagnosing missing events in distributed systems with negative provenance ACM SIGCOMM Computer Communication Review, Vol. Vol. 44. ACM, 383--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Xiaopeng Xi, Eamonn Keogh, Christian Shelton, Li Wei, and Chotirat Ann Ratanamahatana. 2006. Fast time series classification using numerosity reduction Proceedings of the 23rd international conference on Machine learning. ACM, 1033--1040. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Zhengzheng Xing, Jian Pei, and Eamonn Keogh. 2010. A brief survey on sequence classification. ACM SIGKDD Explorations Newsletter Vol. 12, 1 (2010), 40--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Minlan Yu, Albert G Greenberg, David A Maltz, Jennifer Rexford, Lihua Yuan, Srikanth Kandula, and Changhoon Kim. 2011. Profiling Network Performance for Multi-tier Data Center Applications NSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Hongyi Zeng, Peyman Kazemian, George Varghese, and Nick McKeown. 2012. Automatic test packet generation. In Proceedings of the 8th international conference on Emerging networking experiments and technologies. ACM, 241--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Shenglin Zhang, Ying Liu, Dan Pei, Yu Chen, Xianping Qu, Shimin Tao, and Zhi Zang. 2015. Rapid and Robust Impact Assessment of Software Changes in Large Internet-based Services. In CONEXT. Heidelberg, Germany. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Shenglin Zhang, Ying Liu, Dan Pei, Yu Chen, Xianping Qu, Shimin Tao, Zhi Zang, Xiaowei Jing, and Mei Feng. 2016. FUNNEL: Assessing Software Changes in Web-based Services. IEEE Transactions on Services Computing (2016).Google ScholarGoogle Scholar
  63. Shenglin Zhang, Weibin Meng, Jiahao Bu, Sen Yang, Ying Liu, Dan Pei, Jun Xu, Yu Chen, Hui Dong, Xianping Qu, et al. 2017. Syslog processing for switch failure diagnosis and prediction in datacenter networks Quality of Service (IWQoS), 2017 IEEE/ACM 25th International Symposium on. IEEE, 1--10.Google ScholarGoogle Scholar
  64. Z. Zheng, Z. Lan, B. H. Park, and A. Geist. 2009. System log pre-processing to improve failure prediction Dependable Systems Networks, 2009. DSN '09. IEEE/IFIP International Conference on. 572--577.Google ScholarGoogle Scholar

Index Terms

  1. PreFix: Switch Failure Prediction in Datacenter Networks

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!