Abstract
In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifically, in our proposed system, named PreFix, we aim to determine during runtime whether a switch failure will happen in the near future. The prediction is based on the measurements of the current switch system status and historical switch hardware failure cases that have been carefully labelled by network operators. Our key observation is that failures of the same switch model share some common syslog patterns before failures occur, and we can apply machine learning methods to extract the common patterns for predicting switch failures. Our novel set of features (message template sequence, frequency, seasonality and surge) for machine learning can efficiently deal with the challenges of noises, sample imbalance, and computation overhead. We evaluated PreFix on a data set collected from 9397 switches (3 different switch models) deployed in more than 20 datacenters owned by a top global search engine in a 2-year period. PreFix achieved an average of 61.81% recall and 1.84 * 10^-5 false positive ratio. It outperforms the other failure prediction methods for computers and ISP devices.
- Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. A scalable, commodity data center network architecture SIGCOMM. Seattle, WA, USA. Google Scholar
Digital Library
- Lloyd Allison and Trevor I Dix. 1986. A bit-string longest-common-subsequence algorithm. Inform. Process. Lett. Vol. 23, 5 (1986), 305--310. Google Scholar
Digital Library
- Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting Disk Replacement towards Reliable Data Centers Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 39--48. Google Scholar
Digital Library
- Guo Chen, Youjian Zhao, Dan Pei, and Dan Li. 2015. Rewiring 2 Links is Enough: Accelerating Failure Recovery in Production Data Center Networks. In Distributed Computing Systems (ICDCS), 2015 IEEE 35th International Conference on. IEEE, 569--578.Google Scholar
Cross Ref
- Maxime Crochemore, Costas S Iliopoulos, Yoan J Pinzon, and James F Reid. 2001. A fast and practical bit-vector algorithm for the longest common subsequence problem. Inform. Process. Lett. Vol. 80, 6 (2001), 279--285. Google Scholar
Digital Library
- Liu Dapeng, Zhao Youjian, Xu Haowen, Sun Yongqian, Pei Dan, Luo Jiao, Jing Xiaowei, and Feng Mei. 2015. Opprentice: Towards Practical and Automatic Anomaly Detection through Machine Learning ACM IMC. Tokyo, Japan. Google Scholar
Digital Library
- Alain De Cheveigné and Hideki Kawahara. 2002. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America Vol. 111, 4 (2002), 1917--1930.Google Scholar
- Mukund Deshpande and George Karypis. 2002. Evaluation of techniques for classifying biological sequences. In Advances in Knowledge Discovery and Data Mining. Springer, 417--431. Google Scholar
Digital Library
- Cees Elzinga, Sven Rahmann, and Hui Wang. 2008. Algorithms for subsequence combinatorics. Theoretical Computer Science Vol. 409, 3 (2008), 394--404. Google Scholar
Digital Library
- R Wesley Featherstun and Errin W Fulp. 2010. Using Syslog Message Sequences for Predicting Disk Failures LISA. Google Scholar
Digital Library
- Ilenia Fronza, Alberto Sillitti, Giancarlo Succi, Mikko Terho, and Jelena Vlasenko. 2013. Failure prediction based on log files using Random Indexing and Support Vector Machines. Journal of Systems and Software Vol. 86, 1 (2013), 2--11. Google Scholar
Digital Library
- Errin W Fulp, Glenn A Fink, and Jereme N Haack. 2008 a. Predicting Computer System Failures Using Support Vector Machines. WASL Vol. 8 (2008), 5--5. Google Scholar
Digital Library
- Errin W. Fulp, Glenn A. Fink, and Jereme N. Haack. 2008 b. Predicting Computer System Failures Using Support Vector Machines Proceedings of the First USENIX Conference on Analysis of System Logs (WASL'08). 5--12. Google Scholar
Digital Library
- Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. In SIGCOMM. Google Scholar
Digital Library
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org Google Scholar
Digital Library
- Albert Greenberg, James R Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A Maltz, Parveen Patel, and Sudipta Sengupta. 2009. VL2: a scalable and flexible data center network. In SIGCOMM. Barcelona, Spain. Google Scholar
Digital Library
- Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, Hua Chen, Zhi-Wei Lin, and Varugis Kurien. 2015. Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM '15). 139--152. Google Scholar
Digital Library
- Nikhil Handigol, Brandon Heller, Vimalkumar Jeyakumar, David Mazières, and Nick McKeown. 2014. I know what your packet did last hop: Using packet histories to troubleshoot networks 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14). 71--85. Google Scholar
Digital Library
- Tin Kam Ho. 1995. Random decision forests. In Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on, Vol. Vol. 1. IEEE, 278--282. Google Scholar
Digital Library
- Guenther Hoffmann and Miroslaw Malek. 2006. Call availability prediction in a telecommunication system: A data driven empirical approach. In Reliable Distributed Systems, 2006. SRDS'06. 25th IEEE Symposium on. IEEE, 83--95. Google Scholar
Digital Library
- Eamonn J Keogh and Michael J Pazzani. 2000. Scaling up dynamic time warping for datamining applications Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 285--289. Google Scholar
Digital Library
- Tomohiro Kimura, Koji Ishibashi, Takayoshi Mori, Hideyuki Sawada, Tsuyoshi Toyono, Ken Nishimatsu, Atsuyori Watanabe, Akihiro Shimoda, and Kohei Shiomoto. 2014. Spatio-temporal factorization of log data for understanding network events INFOCOM, 2014 Proceedings IEEE. IEEE, 610--618.Google Scholar
- Tatsuaki Kimura, Akio Watanabe, Tsuyoshi Toyono, and Keisuke Ishibashi. 2015. Proactive failure detection learning generation patterns of large-scale network logs Network and Service Management (CNSM), 2015 11th International Conference on. IEEE, 8--14. Google Scholar
Digital Library
- Ron Kohavi. 1995. A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2 (IJCAI'95). 1137--1143. Google Scholar
Digital Library
- Terran Lane and Carla E Brodley. 1999. Temporal sequence learning and data reduction for anomaly detection. ACM Transactions on Information and System Security (TISSEC) Vol. 2, 3 (1999), 295--331. Google Scholar
Digital Library
- Christina Leslie and Rui Kuang. 2004. Fast string kernels using inexact matching for protein sequences. The Journal of Machine Learning Research Vol. 5 (2004), 1435--1455. Google Scholar
Digital Library
- Christina S Leslie, Eleazar Eskin, and William Stafford Noble. 2002. The spectrum kernel: A string kernel for SVM protein classification. Pacific symposium on biocomputing, Vol. Vol. 7. 566--575.Google Scholar
- Yinglung Liang, Yanyong Zhang, Morris Jette, Anand Sivasubramaniam, and Ramendra Sahoo. 2006. Bluegene/l failure analysis and prediction models. In Dependable Systems and Networks, 2006. DSN 2006. International Conference on. IEEE, 425--434. Google Scholar
Digital Library
- Junda Liu, Aurojit Panda, Ankit Singla, Brighten Godfrey, Michael Schapira, and Scott Shenker. 2013 b. Ensuring Connectivity via Data Plane Mechanisms. In NSDI. Google Scholar
Digital Library
- Vincent Liu, Daniel Halperin, Arvind Krishnamurthy, and Thomas Anderson. 2013 a. F10: A Fault-tolerant Engineered Network. In NSDI. Google Scholar
Digital Library
- Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. 2002. Text classification using string kernels. The Journal of Machine Learning Research Vol. 2 (2002), 419--444. Google Scholar
Digital Library
- Zhiling Luo, Ying Li, Ruisheng Fu, and Jianwei Yin. 2016. Don't Fire Me, a Kernel Autoregressive Hybrid Model for Optimal Layoff Plan Big Data (BigData Congress), 2016 IEEE International Congress on. IEEE, 470--477.Google Scholar
- Geoffrey McLachlan, Kim-Anh Do, and Christophe Ambroise. 2005. Analyzing microarray gene expression data. Vol. Vol. 422. John Wiley & Sons.Google Scholar
- G Martin Milner. 2005. Detection/classification/quantification of chemical agents using an array of surface acoustic wave (SAW) devices. In Proceedings of SPIE, Vol. Vol. 5778. 305--316.Google Scholar
- Andrew W Moore. 2001. Cross-validation for detecting and preventing overfitting. School of Computer Science Carneigie Mellon University (2001).Google Scholar
- Nasser M. Nasrabadi. 2007. Pattern Recognition and Machine Learning. Journal of Electronic Imaging Vol. 16 (2007).Google Scholar
- Srinivasan Parthasarathy, Sameep Mehta, and Soundararajan Srinivasan. 2006. Robust periodicity detection algorithms. In Proceedings of the 15th ACM international conference on Information and knowledge management. ACM, 874--875. Google Scholar
Digital Library
- Rahul Potharaju and Navendu Jain. 2013. Demystifying the Dark Side of the Middle: A Field Study of Middlebox Failures in Datacenters. In Proceedings of the 2013 Conference on Internet Measurement Conference (IMC '13). 9--22. Google Scholar
Digital Library
- Tongqing Qiu, Zihui Ge, Dan Pei, Jia Wang, and Jun Xu. 2010. What Happened in My Network: Mining Network Events from Router Syslogs Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement (IMC '10). 472--484. Google Scholar
Digital Library
- Chotirat Ann Ratanamahatana and Eamonn Keogh. {n. d.}. Making Time-series Classification More Accurate Using Learned Constraints. 11--22.Google Scholar
- Hiroto Saigo, Jean-Philippe Vert, Nobuhisa Ueda, and Tatsuya Akutsu. 2004. Protein homology detection using string alignment kernels. Bioinformatics Vol. 20, 11 (2004), 1682--1689. Google Scholar
Digital Library
- Felix Salfner, Maren Lenk, and Miroslaw Malek. 2010. A survey of online failure prediction methods. ACM Computing Surveys (CSUR) Vol. 42, 3 (2010), 10. Google Scholar
Digital Library
- Felix Salfner and Miroslaw Malek. 2007. Using hidden semi-Markov models for effective online failure prediction Reliable Distributed Systems, 2007. SRDS 2007. 26th IEEE International Symposium on. IEEE, 161--174. Google Scholar
Digital Library
- Felix Salfner and Steffen Tschirpke. 2008. Error Log Processing for Accurate Failure Prediction Proceedings of the First USENIX Conference on Analysis of System Logs (WASL'08). Google Scholar
Digital Library
- Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM computing surveys (CSUR) Vol. 34, 1 (2002), 1--47. Google Scholar
Digital Library
- Mohammed Shatnawi and Mohamed Hefeeda. 2015. Real-time failure prediction in online services. In 2015 IEEE Conference on Computer Communications (INFOCOM). IEEE, 1391--1399.Google Scholar
Cross Ref
- Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat. 2015. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM '15). 183--197. Google Scholar
Digital Library
- Josef Sivic and Andrew Zisserman. 2009. Efficient visual search of videos cast as text retrieval. IEEE transactions on pattern analysis and machine intelligence Vol. 31, 4 (2009), 591--606. Google Scholar
Digital Library
- Sören Sonnenburg, Gunnar R"atsch, and Bernhard Schölkopf. 2005. Large scale genomic sequence SVM classifiers. In Proceedings of the 22nd international conference on Machine learning. ACM, 848--855. Google Scholar
Digital Library
- Pang-Ning Tan and Vipin Kumar. 2004. Discovery of web robot sessions based on their navigational patterns. In Intelligent Technologies for Information Analysis. Springer, 193--222.Google Scholar
- Michail Vlachos, S Yu Philip, and Vittorio Castelli. 2005. On Periodicity Detection and Structural Periodic Similarity. SDM, Vol. Vol. 5. SIAM, 449--460.Google Scholar
Cross Ref
- Meg Walraed-Sullivan, Amin Vahdat, and Keith Marzullo. 2013. Aspen Trees: Balancing Data Center Fault Tolerance, Scalability and Cost CoNEXT. Google Scholar
Digital Library
- Yoshihiro Watanabe, Hiroyuki Otsuka, Masataka Sonoda, Shinji Kikuchi, and Yuki Matsumoto. 2012. Online failure prediction in cloud datacenters by real-time message pattern learning Cloud Computing Technology and Science (CloudCom), 2012 IEEE 4th International Conference on. IEEE, 504--511. Google Scholar
Digital Library
- Li Wei and Eamonn Keogh. 2006. Semi-supervised time series classification. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 748--753. Google Scholar
Digital Library
- Xin Wu, Daniel Turner, Chao-Chih Chen, David A Maltz, Xiaowei Yang, Lihua Yuan, and Ming Zhang. 2012. NetPilot: automating datacenter network failure mitigation Proceedings of the 2012 ACM Conference on Special Interest Group on Data Communication (SIGCOMM '12). 419--430. Google Scholar
Digital Library
- Yang Wu, Mingchen Zhao, Andreas Haeberlen, Wenchao Zhou, and Boon Thau Loo. 2014. Diagnosing missing events in distributed systems with negative provenance ACM SIGCOMM Computer Communication Review, Vol. Vol. 44. ACM, 383--394. Google Scholar
Digital Library
- Xiaopeng Xi, Eamonn Keogh, Christian Shelton, Li Wei, and Chotirat Ann Ratanamahatana. 2006. Fast time series classification using numerosity reduction Proceedings of the 23rd international conference on Machine learning. ACM, 1033--1040. Google Scholar
Digital Library
- Zhengzheng Xing, Jian Pei, and Eamonn Keogh. 2010. A brief survey on sequence classification. ACM SIGKDD Explorations Newsletter Vol. 12, 1 (2010), 40--48. Google Scholar
Digital Library
- Minlan Yu, Albert G Greenberg, David A Maltz, Jennifer Rexford, Lihua Yuan, Srikanth Kandula, and Changhoon Kim. 2011. Profiling Network Performance for Multi-tier Data Center Applications NSDI. Google Scholar
Digital Library
- Hongyi Zeng, Peyman Kazemian, George Varghese, and Nick McKeown. 2012. Automatic test packet generation. In Proceedings of the 8th international conference on Emerging networking experiments and technologies. ACM, 241--252. Google Scholar
Digital Library
- Shenglin Zhang, Ying Liu, Dan Pei, Yu Chen, Xianping Qu, Shimin Tao, and Zhi Zang. 2015. Rapid and Robust Impact Assessment of Software Changes in Large Internet-based Services. In CONEXT. Heidelberg, Germany. Google Scholar
Digital Library
- Shenglin Zhang, Ying Liu, Dan Pei, Yu Chen, Xianping Qu, Shimin Tao, Zhi Zang, Xiaowei Jing, and Mei Feng. 2016. FUNNEL: Assessing Software Changes in Web-based Services. IEEE Transactions on Services Computing (2016).Google Scholar
- Shenglin Zhang, Weibin Meng, Jiahao Bu, Sen Yang, Ying Liu, Dan Pei, Jun Xu, Yu Chen, Hui Dong, Xianping Qu, et al. 2017. Syslog processing for switch failure diagnosis and prediction in datacenter networks Quality of Service (IWQoS), 2017 IEEE/ACM 25th International Symposium on. IEEE, 1--10.Google Scholar
- Z. Zheng, Z. Lan, B. H. Park, and A. Geist. 2009. System log pre-processing to improve failure prediction Dependable Systems Networks, 2009. DSN '09. IEEE/IFIP International Conference on. 572--577.Google Scholar
Index Terms
PreFix: Switch Failure Prediction in Datacenter Networks
Recommendations
PreFix: Switch Failure Prediction in Datacenter Networks
SIGMETRICS '18: Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer SystemsIn modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting ...
PreFix: Switch Failure Prediction in Datacenter Networks
SIGMETRICS '18In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting ...
VM Failure Prediction with Log Analysis using BERT-CNN Model
CNSM '22: Proceedings of the 18th International Conference on Network and Service ManagementIn this study, we present a failure prediction study of VMs and VNFs in an NFV environment. For the proof of concept, we designed a machine learning model to predict the failure with log analysis and observed the cases where the failure-related logs do ...






Comments