skip to main content
research-article

Fault Monitoring with Sequential Matrix Factorization

Authors Info & Claims
Published:08 October 2015Publication History
Skip Abstract Section

Abstract

For real-world distributed systems, the knowledge component at the core of the MAPE-K loop has to be inferred, as it cannot be realistically assumed to be defined a priori. Accordingly, this paper considers fault monitoring as a latent factors discovery problem. In the context of end-to-end probing, the goal is to devise an efficient sampling policy that makes the best use of a constrained sampling budget.

Previous work addresses fault monitoring in a collaborative prediction framework, where the information is a snapshot of the probes outcomes. Here, we take into account the fact that the system dynamically evolves at various time scales. We propose and evaluate Sequential Matrix Factorization (SMF) that exploits both the recent advances in matrix factorization for the instantaneous information and a new sampling heuristics based on historical information. The effectiveness of the SMF approach is exemplified on datasets of increasing difficulty and compared with state of the art history-based or snapshot-based methods. In all cases, strong adaptivity under the specific flavor of active learning is required to unleash the full potential of coupling the most confident and the most uncertain sampling heuristics, which is the cornerstone of SMF.

References

  1. Evrim Acar, Daniel M. Dunlavy, Tamara G. Kolda, and Morten Mørup. 2011. Scalable tensor factorizations for incomplete data. Chemometrics and Intelligent Laboratory Systems 106, 1 (March 2011), 41--56. DOI:http://dx.doi.org/10.1016/j.chemolab.2010.08.004Google ScholarGoogle ScholarCross RefCross Ref
  2. Sergio Andreozzi, Stephen Burke, Felix Ehm, Laurence Field, Gerson Galang, Balazs Konya, Maarten Litmaath, Paul Millar, and JP Navarro. 2009. Glue Schema Specification, V.2.0. Technical Report. Open Grid Forum.Google ScholarGoogle Scholar
  3. S. Bagnasco et al. 2008. AliEn: ALICE environment on the GRID. Journal of Physics: Conference Series 119, 6 (2008), 062012.Google ScholarGoogle ScholarCross RefCross Ref
  4. Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using magpie for request extraction and workload modelling. In OSDI 4, 18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Brian Borchers. 1999. CSDP, AC library for semidefinite programming. Optimization Methods and Software 11, 1--4 (1999), 613--623.Google ScholarGoogle Scholar
  6. Rasmus Bro. 1997. PARAFAC. Tutorial and applications. Chemometrics and Intelligent Laboratory Systems 38, 2 (1997), 149--171.Google ScholarGoogle ScholarCross RefCross Ref
  7. Emmanuel J. Candes and Benjamin Recht. 2009. Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9, 6 (2009), 717--772. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Emmanuel J. Candès and Terence Tao. 2010. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory 56, 5 (2010), 2053--2080. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Mike Y. Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. 2002. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’02). IEEE, 595--604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Mark A. Davenport, Yaniv Plan, Ewout van den Berg, and Mary Wootters. 2014. 1-Bit matrix completion. Information and Inference 3, 3 (2014), 189--223.Google ScholarGoogle ScholarCross RefCross Ref
  11. M. Ellert, M. Grønager, A. Konstantinov, B. Kónya, J. Lindemann, I. Livenson, J. L. Nielsen, M. Niinimäki, O. Smirnova, and A. Wäänänen. 2007. Advanced resource connector middleware for lightweight computational grids. Future Generation Computer Systems 23, 2 (2007), 219--240. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Dawei Feng. 2014. Efficient End-to-End Monitoring for Fault Management in Distributed Systems. Ph.D. Dissertation. Universite Paris Sud.Google ScholarGoogle Scholar
  13. Dawei Feng, Cecile Germain-Renaud, and Tristan Glatard. 2013. Efficient distributed monitoring with active collaborative prediction. Future Generation Computer Systems 29, 8 (2013), 2272--2283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. 2007. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. I. Foster. 2001. The globus toolkit for grid computing. In Proceedings of the IEEE International Symposium on Cluster Computing and the Grid. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dennis Geels, Gautam Altekar, Petros Maniatis, Timothy Roscoe, and Ion Stoica. 2007. Friday: Global comprehension for distributed replay. In NSDI 7, 285--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Seth Gilbert and Nancy Lynch. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33, 2 (2002), 51--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Haibo He and Edwardo A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21, 9 (2009), 1263--1284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Joachims. 2005. A support vector method for multivariate performance measures. In Proceedings of the International Conference on Machine Learning (ICML’05). 377--384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Charles Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat. 2007. Life, death, and the critical transition: Finding liveness bugs in systems code. NSDI 07: Networked Systems Design and Implementation (2007), 243--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Tamara G. Kolda and Brett W. Bader. 2009. Tensor decompositions and applications. SIAM Review 51, 3 (2009), 455--500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Akshay Krishnamurthy and Aarti Singh. 2013. Low-rank matrix and tensor completion via adaptive sampling. In Advances in Neural Information Processing Systems. 836--844.Google ScholarGoogle Scholar
  23. E. Laure et al. 2006. Programming the grid with gLite. In Computational Methods in Science and Technology 12, 33--45.Google ScholarGoogle ScholarCross RefCross Ref
  24. Bin Li, Xingquan Zhu, Ruijiang Li, Chengqi Zhang, Xiangyang Xue, and Xindong Wu. 2011. Cross-domain collaborative filtering over time. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence, vol. 3 AAAI Press, 2293--2298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. 2009. Tensor completion for estimating missing values in visual data. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision. IEEE, 2114--2121.Google ScholarGoogle Scholar
  26. Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. 2008. D3S: Debugging deployed distributed systems. In NSDI 8, 423--437. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Xuezheng Liu, Wei Lin, Aimin Pan, and Zheng Zhang. 2007. WiDS checker: Combating bugs in distributed systems. In NSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Maeno. 2008. PanDA: Distributed production and distributed analysis system for ATLAS. Journal of Physics: Conference Series 119, 6 (2008), 062036.Google ScholarGoogle ScholarCross RefCross Ref
  29. Robert McGill, John W. Tukey, and Wayne A. Larsen. 1978. Variations of box plots. The American Statistician 32, 1 (February 1978), 12--16.Google ScholarGoogle Scholar
  30. J. T. Moscicki. 2003. DIANE - distributed analysis environment for GRID-enabled simulation and analysis of physics data. In Nuclear Science Symposium Conference Record, 2003 IEEE 3 (2003), 1617--1620.Google ScholarGoogle ScholarCross RefCross Ref
  31. Chengbin Peng, Ka-Chun Wong, Alyn Rockwood, Xiangliang Zhang, Jinling Jiang, and David Keyes. 2012. Multiplicative algorithms for constrained non-negative matrix factorization. In ICDM. IEEE Computer Society, 1068--1073. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Andres Quiroz, Manish Parashar, Nathan Gnanasambandam, and Naveen Sharma. 2012. Design and evaluation of decentralized online clustering. ACM Transactions on Autonomous Adaptive Systems 7, 3 (2012), 34:1--34:31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Benjamin Recht. 2011. A simpler approach to matrix completion. Journal of Machine Learning Research 12 (2011), 3413--3430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jasson D. M. Rennie and Nathan Srebro. 2005. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd International Conference on Machine Learning. ACM, 713--719. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Patrick Reynolds, Charles Edwin Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, and Amin Vahdat. 2006. Pip: Detecting the unexpected in distributed systems. In NSDI 6, 115--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Irina Rish, Mark Brodie, Sheng Ma, Natalia Odintsova, Alina Beygelzimer, Genady Grabarnik, and Karina Hernandez. 2005. Adaptive diagnosis in distributed systems. IEEE Transactions on Neural Networks 16, 5 (2005), 1088--1109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Irina Rish and Gerald Tesauro. 2007. Estimating end-to-end performance by collaborative prediction with active sampling. In Integrated Network Management. 294--303.Google ScholarGoogle Scholar
  38. Clayton Scott. 2007. Performance measures for Neyman-Pearson classification. IEEE Transactions on Information Theory 53, 8 (Aug. 2007), 2852--2863. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Nathan Srebro, Jason D. M. Rennie, and Tommi S. Jaakola. 2005. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems 17, 1329--1336.Google ScholarGoogle Scholar
  40. Michel Tokic. 2010. Adaptive epsilon-greedy Exploration in Reinforcement Learning Based on Value Differences. In Proceedings of the 33rd Annual German Conference on Advances in Artificial Intelligence (LNCS 6359). Springer-Verlag, Berlin, 203--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Erik Torres, German Molto, Damia Segrelles, and Ignacio Blanquer. 2012. A replicated information system to enable dynamic collaborations in the grid. Concurrency Computation: Practice and Experience 24, 14 (2012), 1668--1683. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. Tsaregorodtsev. 2009. DIRAC3. The new generation of the LHCb grid software. Journal of Physics: Conference Series 219, 6 (2009), 062029.Google ScholarGoogle ScholarCross RefCross Ref
  43. Srikumar Venugopal, Rajkumar Buyya, and Kotagiri Ramamohanarao. 2006. A taxonomy of data grids for distributed data sharing, management, and processing. ACM Computer Survey 38, 1 (June 2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Weijia Wang and Michèle Sebag. 2013. Hypervolume indicator and dominance reward based multi-objective Monte-Carlo Tree Search. Machine Learning 92, 2--3 (May 2013), 403--429.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Xiangliang Zhang, Cyril Furtlehner, Cecile Germain-Renaud, and Michele Sebag. 2014. Data stream clustering with affinity propagation. IEEE Transactions on Knowledge and Data Engineering 26, 7 (2014).Google ScholarGoogle ScholarCross RefCross Ref
  46. Wenchao Zhou. 2010. Fault Management in Distributed Systems. Technical Report MS-CIS-10-03. University of Pennsylvania Department of Computer and Information Science.Google ScholarGoogle Scholar

Index Terms

  1. Fault Monitoring with Sequential Matrix Factorization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Autonomous and Adaptive Systems
        ACM Transactions on Autonomous and Adaptive Systems  Volume 10, Issue 3
        October 2015
        204 pages
        ISSN:1556-4665
        EISSN:1556-4703
        DOI:10.1145/2819320
        Issue’s Table of Contents

        Copyright © 2015 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 8 October 2015
        • Accepted: 1 June 2015
        • Revised: 1 February 2015
        • Received: 1 May 2014
        Published in taas Volume 10, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!