Abstract
For real-world distributed systems, the knowledge component at the core of the MAPE-K loop has to be inferred, as it cannot be realistically assumed to be defined a priori. Accordingly, this paper considers fault monitoring as a latent factors discovery problem. In the context of end-to-end probing, the goal is to devise an efficient sampling policy that makes the best use of a constrained sampling budget.
Previous work addresses fault monitoring in a collaborative prediction framework, where the information is a snapshot of the probes outcomes. Here, we take into account the fact that the system dynamically evolves at various time scales. We propose and evaluate Sequential Matrix Factorization (SMF) that exploits both the recent advances in matrix factorization for the instantaneous information and a new sampling heuristics based on historical information. The effectiveness of the SMF approach is exemplified on datasets of increasing difficulty and compared with state of the art history-based or snapshot-based methods. In all cases, strong adaptivity under the specific flavor of active learning is required to unleash the full potential of coupling the most confident and the most uncertain sampling heuristics, which is the cornerstone of SMF.
- Evrim Acar, Daniel M. Dunlavy, Tamara G. Kolda, and Morten Mørup. 2011. Scalable tensor factorizations for incomplete data. Chemometrics and Intelligent Laboratory Systems 106, 1 (March 2011), 41--56. DOI:http://dx.doi.org/10.1016/j.chemolab.2010.08.004Google Scholar
Cross Ref
- Sergio Andreozzi, Stephen Burke, Felix Ehm, Laurence Field, Gerson Galang, Balazs Konya, Maarten Litmaath, Paul Millar, and JP Navarro. 2009. Glue Schema Specification, V.2.0. Technical Report. Open Grid Forum.Google Scholar
- S. Bagnasco et al. 2008. AliEn: ALICE environment on the GRID. Journal of Physics: Conference Series 119, 6 (2008), 062012.Google Scholar
Cross Ref
- Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using magpie for request extraction and workload modelling. In OSDI 4, 18. Google Scholar
Digital Library
- Brian Borchers. 1999. CSDP, AC library for semidefinite programming. Optimization Methods and Software 11, 1--4 (1999), 613--623.Google Scholar
- Rasmus Bro. 1997. PARAFAC. Tutorial and applications. Chemometrics and Intelligent Laboratory Systems 38, 2 (1997), 149--171.Google Scholar
Cross Ref
- Emmanuel J. Candes and Benjamin Recht. 2009. Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9, 6 (2009), 717--772. Google Scholar
Digital Library
- Emmanuel J. Candès and Terence Tao. 2010. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory 56, 5 (2010), 2053--2080. Google Scholar
Digital Library
- Mike Y. Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. 2002. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’02). IEEE, 595--604. Google Scholar
Digital Library
- Mark A. Davenport, Yaniv Plan, Ewout van den Berg, and Mary Wootters. 2014. 1-Bit matrix completion. Information and Inference 3, 3 (2014), 189--223.Google Scholar
Cross Ref
- M. Ellert, M. Grønager, A. Konstantinov, B. Kónya, J. Lindemann, I. Livenson, J. L. Nielsen, M. Niinimäki, O. Smirnova, and A. Wäänänen. 2007. Advanced resource connector middleware for lightweight computational grids. Future Generation Computer Systems 23, 2 (2007), 219--240. Google Scholar
Digital Library
- Dawei Feng. 2014. Efficient End-to-End Monitoring for Fault Management in Distributed Systems. Ph.D. Dissertation. Universite Paris Sud.Google Scholar
- Dawei Feng, Cecile Germain-Renaud, and Tristan Glatard. 2013. Efficient distributed monitoring with active collaborative prediction. Future Generation Computer Systems 29, 8 (2013), 2272--2283. Google Scholar
Digital Library
- Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. 2007. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 20. Google Scholar
Digital Library
- I. Foster. 2001. The globus toolkit for grid computing. In Proceedings of the IEEE International Symposium on Cluster Computing and the Grid. Google Scholar
Digital Library
- Dennis Geels, Gautam Altekar, Petros Maniatis, Timothy Roscoe, and Ion Stoica. 2007. Friday: Global comprehension for distributed replay. In NSDI 7, 285--298. Google Scholar
Digital Library
- Seth Gilbert and Nancy Lynch. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33, 2 (2002), 51--59. Google Scholar
Digital Library
- Haibo He and Edwardo A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21, 9 (2009), 1263--1284. Google Scholar
Digital Library
- T. Joachims. 2005. A support vector method for multivariate performance measures. In Proceedings of the International Conference on Machine Learning (ICML’05). 377--384. Google Scholar
Digital Library
- Charles Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat. 2007. Life, death, and the critical transition: Finding liveness bugs in systems code. NSDI 07: Networked Systems Design and Implementation (2007), 243--256. Google Scholar
Digital Library
- Tamara G. Kolda and Brett W. Bader. 2009. Tensor decompositions and applications. SIAM Review 51, 3 (2009), 455--500. Google Scholar
Digital Library
- Akshay Krishnamurthy and Aarti Singh. 2013. Low-rank matrix and tensor completion via adaptive sampling. In Advances in Neural Information Processing Systems. 836--844.Google Scholar
- E. Laure et al. 2006. Programming the grid with gLite. In Computational Methods in Science and Technology 12, 33--45.Google Scholar
Cross Ref
- Bin Li, Xingquan Zhu, Ruijiang Li, Chengqi Zhang, Xiangyang Xue, and Xindong Wu. 2011. Cross-domain collaborative filtering over time. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence, vol. 3 AAAI Press, 2293--2298. Google Scholar
Digital Library
- Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. 2009. Tensor completion for estimating missing values in visual data. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision. IEEE, 2114--2121.Google Scholar
- Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. 2008. D3S: Debugging deployed distributed systems. In NSDI 8, 423--437. Google Scholar
Digital Library
- Xuezheng Liu, Wei Lin, Aimin Pan, and Zheng Zhang. 2007. WiDS checker: Combating bugs in distributed systems. In NSDI. Google Scholar
Digital Library
- T. Maeno. 2008. PanDA: Distributed production and distributed analysis system for ATLAS. Journal of Physics: Conference Series 119, 6 (2008), 062036.Google Scholar
Cross Ref
- Robert McGill, John W. Tukey, and Wayne A. Larsen. 1978. Variations of box plots. The American Statistician 32, 1 (February 1978), 12--16.Google Scholar
- J. T. Moscicki. 2003. DIANE - distributed analysis environment for GRID-enabled simulation and analysis of physics data. In Nuclear Science Symposium Conference Record, 2003 IEEE 3 (2003), 1617--1620.Google Scholar
Cross Ref
- Chengbin Peng, Ka-Chun Wong, Alyn Rockwood, Xiangliang Zhang, Jinling Jiang, and David Keyes. 2012. Multiplicative algorithms for constrained non-negative matrix factorization. In ICDM. IEEE Computer Society, 1068--1073. Google Scholar
Digital Library
- Andres Quiroz, Manish Parashar, Nathan Gnanasambandam, and Naveen Sharma. 2012. Design and evaluation of decentralized online clustering. ACM Transactions on Autonomous Adaptive Systems 7, 3 (2012), 34:1--34:31. Google Scholar
Digital Library
- Benjamin Recht. 2011. A simpler approach to matrix completion. Journal of Machine Learning Research 12 (2011), 3413--3430. Google Scholar
Digital Library
- Jasson D. M. Rennie and Nathan Srebro. 2005. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd International Conference on Machine Learning. ACM, 713--719. Google Scholar
Digital Library
- Patrick Reynolds, Charles Edwin Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, and Amin Vahdat. 2006. Pip: Detecting the unexpected in distributed systems. In NSDI 6, 115--128. Google Scholar
Digital Library
- Irina Rish, Mark Brodie, Sheng Ma, Natalia Odintsova, Alina Beygelzimer, Genady Grabarnik, and Karina Hernandez. 2005. Adaptive diagnosis in distributed systems. IEEE Transactions on Neural Networks 16, 5 (2005), 1088--1109. Google Scholar
Digital Library
- Irina Rish and Gerald Tesauro. 2007. Estimating end-to-end performance by collaborative prediction with active sampling. In Integrated Network Management. 294--303.Google Scholar
- Clayton Scott. 2007. Performance measures for Neyman-Pearson classification. IEEE Transactions on Information Theory 53, 8 (Aug. 2007), 2852--2863. Google Scholar
Digital Library
- Nathan Srebro, Jason D. M. Rennie, and Tommi S. Jaakola. 2005. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems 17, 1329--1336.Google Scholar
- Michel Tokic. 2010. Adaptive epsilon-greedy Exploration in Reinforcement Learning Based on Value Differences. In Proceedings of the 33rd Annual German Conference on Advances in Artificial Intelligence (LNCS 6359). Springer-Verlag, Berlin, 203--210. Google Scholar
Digital Library
- Erik Torres, German Molto, Damia Segrelles, and Ignacio Blanquer. 2012. A replicated information system to enable dynamic collaborations in the grid. Concurrency Computation: Practice and Experience 24, 14 (2012), 1668--1683. Google Scholar
Digital Library
- A. Tsaregorodtsev. 2009. DIRAC3. The new generation of the LHCb grid software. Journal of Physics: Conference Series 219, 6 (2009), 062029.Google Scholar
Cross Ref
- Srikumar Venugopal, Rajkumar Buyya, and Kotagiri Ramamohanarao. 2006. A taxonomy of data grids for distributed data sharing, management, and processing. ACM Computer Survey 38, 1 (June 2006). Google Scholar
Digital Library
- Weijia Wang and Michèle Sebag. 2013. Hypervolume indicator and dominance reward based multi-objective Monte-Carlo Tree Search. Machine Learning 92, 2--3 (May 2013), 403--429.Google Scholar
Digital Library
- Xiangliang Zhang, Cyril Furtlehner, Cecile Germain-Renaud, and Michele Sebag. 2014. Data stream clustering with affinity propagation. IEEE Transactions on Knowledge and Data Engineering 26, 7 (2014).Google Scholar
Cross Ref
- Wenchao Zhou. 2010. Fault Management in Distributed Systems. Technical Report MS-CIS-10-03. University of Pennsylvania Department of Computer and Information Science.Google Scholar
Index Terms
Fault Monitoring with Sequential Matrix Factorization
Recommendations
Co-manifold Matrix Factorization
ICCPR '20: Proceedings of the 2020 9th International Conference on Computing and Pattern RecognitionMatrix factorization plays a fundamental role in collaborative filtering. In collaborative filtering setting, the rating matrix R is very sparse. Thus, infinite number of matrices can fit the observed entries in the rating matrix. Without additional ...
Two Purposes for Matrix Factorization: A Historical Appraisal
Matrix factorization in numerical linear algebra (NLA) typically serves the purpose of restating some given problem in such a way that it can be solved more readily; for example, one major application is in the solution of a linear system of equations. ...
A Fast Randomized Algorithm for Computing a Hierarchically Semiseparable Representation of a Matrix
Randomized sampling has recently been proven a highly efficient technique for computing approximate factorizations of matrices that have low numerical rank. This paper describes an extension of such techniques to a wider class of matrices that are not ...






Comments