Abstract
Cloud infrastructures provide a rich set of management tasks that operate computing, storage, and networking resources in the cloud. Monitoring the executions of these tasks is crucial for cloud providers to promptly find and understand problems that compromise cloud availability. However, such monitoring is challenging because there are multiple distributed service components involved in the executions. CloudSeer enables effective workflow monitoring. It takes a lightweight non-intrusive approach that purely works on interleaved logs widely existing in cloud infrastructures. CloudSeer first builds an automaton for the workflow of each management task based on normal executions, and then it checks log messages against a set of automata for workflow divergences in a streaming manner. Divergences found during the checking process indicate potential execution problems, which may or may not be accompanied by error log messages. For each potential problem, CloudSeer outputs necessary context information including the affected task automaton and related log messages hinting where the problem occurs to help further diagnosis. Our experiments on OpenStack, a popular open-source cloud infrastructure, show that CloudSeer's efficiency and problem-detection capability are suitable for online monitoring.
- 2013 Path to an OpenStack-Powered Cloud Survey Results Highlight Aggressive OpenStack Adoption Plans by Enterprises. http://www.redhat.com/en/about/press-releases/2013-path-to-an-openstack-powered-cloud-survey-results-highlight-aggressive-openstack-adoption-plans-by-enterprises.Google Scholar
- Amazon CloudWatch. https://aws.amazon.com/cloudwatch/.Google Scholar
- Amazon Elastic Compute Cloud. http://aws.amazon.com/ec2/.Google Scholar
- Apache HTrace. http://htrace.incubator.apache.org/.Google Scholar
- Architecture. OpenStack Installation Guide, http://docs.openstack.org/havana/install-guide/install/apt/content/ch_overview.html.Google Scholar
- CirrOS: A Tiny Cloud Guest. https://launchpad.net/cirros.Google Scholar
- Elasticsearch. http://www.elasticsearch.org/overview/elasticsearch/.Google Scholar
- Logging and Monitoring. OpenStack Operations Guide, http://docs.openstack.org/openstack-ops/content/logging_monitoring.html.Google Scholar
- Logstash. http://www.elasticsearch.org/overview/logstash/.Google Scholar
- Microsoft Azure. http://azure.microsoft.com/en-us/.Google Scholar
- OpenStack. http://www.openstack.org/.Google Scholar
- Zipkin. http://zipkin.io/.Google Scholar
- P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for Request Extraction and Workload Modelling. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 18--18, Berkeley, CA, USA, 2004. USENIX Association.Google Scholar
Digital Library
- I. Beschastnikh, Y. Brun, M. D. Ernst, and A. Krishnamurthy. Inferring Models of Concurrent Systems from Logs of Their Behavior with CSight. In Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pages 468--479, New York, NY, USA, 2014. ACM.Google Scholar
Digital Library
- I. Beschastnikh, Y. Brun, M. D. Ernst, A. Krishnamurthy, and T. E. Anderson. Mining Temporal Invariants from Partially Ordered Logs. ACM SIGOPS Operating Systems Review, 45(3):39--46, Jan. 2012.Google Scholar
Digital Library
- M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 217--231, Berkeley, CA, USA, 2014. USENIX Association.Google Scholar
Digital Library
- T. Do, M. Hao, T. Leesatapornwongsa, T. Patana-anake, and H. S. Gunawi. Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, pages 14:1--14:14, New York, NY, USA, 2013. ACM.Google Scholar
Digital Library
- Q. Fu, J.-G. Lou, Y. Wang, and J. Li. Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM '09, pages 149--158, Washington, DC, USA, 2009. IEEE Computer Society.Google Scholar
Digital Library
- P. Joshi, H. S. Gunawi, and K. Sen. PREFAIL: A Programmable Tool for Multiple-Failure Injection. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '11, pages 171--188, New York, NY, USA, 2011. ACM.Google Scholar
Digital Library
- X. Ju, L. Soares, K. G. Shin, K. D. Ryu, and D. Da Silva. On Fault Resilience of OpenStack. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, pages 2:1--2:16, New York, NY, USA, 2013. ACM.Google Scholar
Digital Library
- K. Kc and X. Gu. ELT: Efficient Log-based Troubleshooting System for Cloud Computing Infrastructures. In 2011 30th IEEE Symposium on Reliable Distributed Systems (SRDS), pages 11--20, Oct 2011.Google Scholar
- D. Lo, L. Mariani, and M. Pezzè. Automatic Steering of Behavioral Model Inference. In Proceedings of the the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ESEC/FSE '09, pages 345--354, New York, NY, USA, 2009. ACM.Google Scholar
Digital Library
- J.-G. Lou, Q. Fu, S. Yang, J. Li, and B. Wu. Mining Program Workflow from Interleaved Traces. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '10, pages 613--622, New York, NY, USA, 2010. ACM.Google Scholar
Digital Library
- J.-G. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li. Mining Invariants from Console Logs for System Problem Detection. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'10, pages 24--24, Berkeley, CA, USA, 2010. USENIX Association.Google Scholar
Digital Library
- K. Nagaraj, C. Killian, and J. Neville. Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI'12, pages 26--26, Berkeley, CA, USA, 2012. USENIX Association.Google Scholar
Digital Library
- H. Nguyen, D. J. Dean, K. Kc, and X. Gu. Insight: In-situ Online Service Failure Path Inference in Production Computing Infrastructures. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC'14, pages 269--280, Berkeley, CA, USA, 2014. USENIX Association.Google Scholar
- B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical report, Google, Inc., 2010.Google Scholar
- N. Walkinshaw and K. Bogdanov. Inferring Finite-State Models with Temporal Constraints. In Proceedings of the 2008 23rd IEEE/ACM International Conference on Automated Software Engineering, ASE '08, pages 248--257, Washington, DC, USA, 2008. IEEE Computer Society.Google Scholar
Digital Library
- W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting Large-Scale System Problems by Mining Console Logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP '09, pages 117--132, New York, NY, USA, 2009. ACM.Google Scholar
Digital Library
- D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. SherLog: Error Diagnosis by Connecting Clues from Run-time Logs. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pages 143--154, New York, NY, USA, 2010. ACM.Google Scholar
Digital Library
- D. Yuan, S. Park, P. Huang, Y. Liu, M. M. Lee, X. Tang, Y. Zhou, and S. Savage. Be Conservative: Enhancing Failure Diagnosis with Proactive Logging. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI'12, pages 293--306, Berkeley, CA, USA, 2012. USENIX Association.Google Scholar
Digital Library
- X. Zhao, Y. Zhang, D. Lion, M. F. Ullah, Y. Luo, D. Yuan, and M. Stumm. lprof: A Non-intrusive Request Flow Profiler for Distributed Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 629--644, Berkeley, CA, USA, 2014. USENIX Association.Google Scholar
Digital Library
Index Terms
CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs
Recommendations
CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating SystemsCloud infrastructures provide a rich set of management tasks that operate computing, storage, and networking resources in the cloud. Monitoring the executions of these tasks is crucial for cloud providers to promptly find and understand problems that ...
CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs
ASPLOS'16Cloud infrastructures provide a rich set of management tasks that operate computing, storage, and networking resources in the cloud. Monitoring the executions of these tasks is crucial for cloud providers to promptly find and understand problems that ...
Connecting cloud infrastructures with shared services
dg.o '10: Proceedings of the 11th Annual International Digital Government Research Conference on Public Administration Online: Challenges and OpportunitiesGovernmental processes are often data-intensive and consume many resources. The basic idea of cloud infrastructures is that by sharing their resources, computing power and storage can be consumed real-time in a more efficient manner by a number of ...







Comments