Abstract
Today's cloud services extensively rely on replication techniques to ensure availability and reliability. In complex datacenter network architectures, however, seemingly independent replica servers may inadvertently share deep dependencies (e.g., aggregation switches). Such unexpected common dependencies may potentially result in correlated failures across the entire replication deployments, invalidating the efforts. Although existing cloud management and diagnosis tools have been able to offer post-failure forensics, they, nevertheless, typically lead to quite prolonged failure recovery time in the cloud-scale systems. In this paper, we propose a novel language framework, named RepAudit, that manages to prevent correlated failure risks before service outages occur, by allowing cloud administrators to proactively audit the replication deployments of interest. In particular, RepAudit consists of three new components: 1) a declarative domain-specific language, RAL, for cloud administrators to write auditing programs expressing diverse auditing tasks; 2) a high-performance RAL auditing engine that generates the auditing results by accurately and efficiently analyzing the underlying structures of the target replication deployments; and 3) an RAL-code generator that can automatically produce complex RAL programs based on easily written specifications. Our evaluation result shows that RepAudit uses 80x less lines of code than state-of-the-art efforts in expressing the auditing task of determining the top-20 critical correlated-failure root causes. To the best of our knowledge, RepAudit is the first effort capable of simultaneously offering expressive, accurate and efficient correlated failure auditing to the cloud-scale replication systems.
- Marcos Kawazoe Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. 2003. Performance debugging for distributed systems of black boxes. In 19th ACM Symposium on Operating Systems Principles (SOSP). Google Scholar
Digital Library
- N. Alon and R. B. Boppana. 1987. The monotone circuit complexity of Boolean functions. Combinatorica 7, 1 (1987), 1–22. Google Scholar
Digital Library
- Mario Alviano. 2015. Maxino: A fast MaxSAT solver. http://alviano.net/software/maxino/ . (2015). Online; accessed Feb 24 2017.Google Scholar
- Mario Alviano, Carmine Dodaro, and Francesco Ricca. 2015. A MaxSAT algorithm using cardinality constraints of bounded size. In 24th International Joint Conference on Artificial Intelligence (IJCAI).Google Scholar
Digital Library
- Carlos Ansótegui, Maria Luisa Bonet, and Jordi Levy. 2009. Solving Weighted partial MaxSAT through satisfiability testing. In 12th Theory and Applications of Satisfiability Testing (SAT).Google Scholar
- Carlos Ansótegui, Maria Luisa Bonet, and Jordi Levy. 2010. A new algorithm for weighted partial MaxSAT. In 24th Conference on Artificial Intelligence (AAAI).Google Scholar
- Paramvir Bahl, Ranveer Chandra, Albert G. Greenberg, Srikanth Kandula, David A. Maltz, and Ming Zhang. 2007. Towards highly reliable enterprise network services via inference of multi-level dependencies. In ACM SIGCOMM (SIGCOMM). Google Scholar
Digital Library
- Tomas Balyo, Marijn J. H. Heule, and Matti Jarvisalo. 2016. SAT Competition 2016 : Solver and Benchmark Descriptions. In SAT.Google Scholar
- Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using Magpie for request extraction and workload modelling. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google Scholar
- Alysson Neves Bessani, Miguel P. Correia, Bruno Quaresma, Fernando André, and Paulo Sousa. 2011. DepSky: Dependable and Secure Storage in a Cloud-of-clouds. In 6th ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys).Google Scholar
Digital Library
- Peter Bodik, Ishai Menache, Mosharaf Chowdhury, Pradeepkumar Mani, David A. Maltz, and Ion Stoica. 2012. Surviving failures in bandwidth-constrained datacenters. In ACM SIGCOMM (SIGCOMM). Google Scholar
Digital Library
- Nicolas Bonvin, Thanasis G. Papaioannou, and Karl Aberer. 2010. A self-organized, fault-tolerant and scalable replication scheme for cloud storage. In 1st ACM Symposium on Cloud Computing (SoCC). Google Scholar
Digital Library
- Danny Bradbury. 2016. The bigger they get, the harder we fall: Thinking our way out of cloud crash. http://www.theregister. co.uk/2016/07/29/bryan_ford_bigger_icebergs/ . (2016).Google Scholar
- Ang Chen, Yang Wu, Andreas Haeberlen, Boon Thau Loo, and Wenchao Zhou. 2017. Data provenance at Internet scale: Architecture, experiences, and the road ahead. In 8th Biennial Conference on Innovative Data Systems Research (CIDR).Google Scholar
- Ang Chen, Yang Wu, Andreas Haeberlen, Wenchao Zhou, and Boon Thau Loo. 2016. The good, the bad, and the differences: Better network diagnostics with differential provenance. In ACM SIGCOMM (SIGCOMM). Google Scholar
Digital Library
- Mike Y. Chen, Anthony Accardi, Emre Kiciman, David A. Patterson, Armando Fox, and Eric A. Brewer. 2004. Path-based failure and evolution management. In 1st USENIX Symposium on Networked System Design and Implementation (NSDI).Google Scholar
- Xu Chen, Ming Zhang, Zhuoqing Morley Mao, and Paramvir Bahl. 2008. Automating network application dependency discovery: Experiences, limitations, and new Solutions. In 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google Scholar
- Ira Cohen, Jeffrey S. Chase, Moisés Goldszmidt, Terence Kelly, and Julie Symons. 2004. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google Scholar
- John Dunagan, Nicholas J. A. Harvey, Michael B. Jones, Dejan Kostic, Marvin Theimer, and Alec Wolman. 2004. F USE: Lightweight guaranteed distributed failure notification. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google Scholar
- Daniel Ford, François Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. 2010. Availability in globally distributed storage systems. In 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google Scholar
- Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding network failures in data centers: Measurement, analysis, and implications. In ACM SIGCOMM (SIGCOMM). Google Scholar
Digital Library
- Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In 5th ACM Symposium on Cloud Computing (SoCC).Google Scholar
- Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar. 2016. Why does the cloud stop computing? Lessons from hundreds of service outages. In 7th ACM Symposium on Cloud Computing (SoCC).Google Scholar
- Andreas Haeberlen. 2009. A case for the accountable cloud. In 3rd ACM SIGOPS International Workshop on Large-Scale Distributed Systems and Middleware (LADIS).Google Scholar
- Andreas Haeberlen, Paarijaat Aditya, Rodrigo Rodrigues, and Peter Druschelnd. 2010. Accountable virtual machines. In 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google Scholar
- Devindra Hardaware. 2011. Apple’s iCloud runs on Microsoft’s Azure and Amazon’s cloud. http://venturebeat.com/2011/ 09/03/icloud- azure- amazon/ . (2011).Google Scholar
- Heqing Huang, Su Zhang, Xinming Ou, Atul Prakash, and Karem A. Sakallah. 2011. Distilling critical attack graph surface iteratively through minimum-cost SAT solving. In 27th Annual Computer Security Applications Conference (ACSAC). Google Scholar
Digital Library
- Peng Huang, William J. Bolosky, Abhishek Singh, and Yuanyuan Zhou. 2015. Conf Valley: A systematic configuration validation framework for cloud services. In 10th European Conference on Computer Systems (EuroSys).Google Scholar
- Andrew Johnson, Lucas Waye, Scott Moore, and Stephen Chong. 2015. Exploring and enforcing security guarantees via program dependence graphs. In 36th ACM Conference on Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Ivan P Kaminow and Thomas L Koch. 1997. Optical Fiber Telecommunications IIIA. Academic Press, New York.Google Scholar
- Srikanth Kandula, Dina Katabi, and Jean-Philippe Vasseur. 2005. Shrink: A Tool for Failure Diagnosis in IP Networks. In MineNet. Google Scholar
Digital Library
- Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl. 2009. Detailed Diagnosis in Enterprise Networks. In ACM SIGCOMM (SIGCOMM). Google Scholar
Digital Library
- Ramana Rao Kompella, Jennifer Yates, Albert G. Greenberg, and Alex C. Snoeren. 2005. IP Fault Localization Via Risk Modeling. In 2nd USENIX Symposium on Networked System Design and Implementation (NSDI).Google Scholar
- Akash Lal, Shaz Qadeer, and Shuvendu K. Lahiri. 2012. A solver for reachability modulo theories. In 24th International Conference on Computer Aided Verification (CAV). Google Scholar
Digital Library
- Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos Kawazoe Aguilera, and Michael Walfish. 2011. Detecting failures in distributed systems with the Falcon spy network. In 23rd ACM Symposium on Operating Systems Principles (SOSP). Google Scholar
Digital Library
- Nuno P. Lopes, Nikolaj Bjørner, Patrice Godefroid, Karthick Jayaraman, and George Varghese. 2015. Checking beliefs in dynamic networks. In 12th USENIX Symposium on Networked System Design and Implementation (NSDI).Google Scholar
- Jedidiah McClurg, Hossein Hojjat, Pavol Cerný, and Nate Foster. 2015. Efficient synthesis of network updates. In 36th ACM Conference on Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya, and Amin Vahdat. 2009. PortLand: A Scalable Fault-tolerant Layer 2 Data Center Network Fabric. In ACM SIGCOMM (SIGCOMM).Google Scholar
- Arun Natarajan, Peng Ning, Yao Liu, Sushil Jajodia, and Steve E. Hutchinson. 2012. NSDMiner: Automated discovery of network service dependencies. In 31st IEEE INFOCOM (INFOCOM).Google Scholar
- Suman Nath, Haifeng Yu, Phillip B. Gibbons, and Srinivasan Seshan. 2006. Subtleties in tolerating correlated failures in wide-area storage systems. In 3rd USENIX/ACM Symposium on Networked Systems Design and Implementation (NSDI).Google Scholar
- Barry Peddycord III, Peng Ning, and Sushil Jajodia. 2012. On the Accurate Identification of Network Service Dependencies in Distributed Systems. In 26th Large Installation System Administration Conference (LISA).Google Scholar
- Gordon D. Plotkin, Nikolaj Bjørner, Nuno P. Lopes, Andrey Rybalchenko, and George Varghese. 2016. Scaling network verification using symmetry and surgery. In 43rd ACM Symposium on Principles of Programming Languages (POPL). Google Scholar
Digital Library
- Patrick Reynolds, Charles Edwin Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, and Amin Vahdat. 2006. Pip: Detecting the unexpected in distributed systems. In 3rd Symposium on Networked Systems Design and Implementation (NSDI).Google Scholar
- Lorenzo Saino, Cosmin Cocora, and George Pavlou. 2013. Fast Network Simulation Setup. https://github.com/fnss/fnss . (2013).Google Scholar
- Mehul A. Shah, Mary Baker, Jeffrey C. Mogul, and Ram Swaminathan. 2007. Auditing to Keep Online Storage Services Honest. In 11th Workshop on Hot Topics in Operating Systems (HotOS).Google Scholar
Digital Library
- Rew Steven. 2014. Rackspace Outage Nov 12th. http://www.realestatewebmasters.com/blogs/rew- steven/ rackspace- outage- nov- 12th/show/ . (2014). Online; accessed Feb 24 2017.Google Scholar
- The AWS Team. 2012. Summary of the October 22, 2012 AWS Service Event in the US-East Region. https://aws.amazon. com/message/680342/ . (2012). Online; accessed Feb 24 2017.Google Scholar
- Reinhard von Hanxleden, Björn Duderstadt, Christian Motika, Steven Smyth, Michael Mendler, Joaquin Aguado, Stephen Mercer, and Owen O’Brien. 2014. SCCharts: sequentially constructive statecharts for safety-critical applications: HW/SW-synthesis for a conservative extension of synchronous statecharts. In 35th ACM Conference on Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, and Ming Zhang. 2012. NetPilot: Automating datacenter network failure mitigation. In ACM SIGCOMM (SIGCOMM).Google Scholar
- Yang Wu, Mingchen Zhao, Andreas Haeberlen, Wenchao Zhou, and Boon Thau Loo. 2014. Diagnosing missing events in distributed systems with negative provenance. In ACM SIGCOMM (SIGCOMM). Google Scholar
Digital Library
- Hongda Xiao, Bryan Ford, and Joan Feigenbaum. 2013. Structural Cloud Audits that Protect Private Information. In ACM Cloud Computing Security Workshop (CCSW). Google Scholar
Digital Library
- Andrew Chi-Chih Yao. 1982. Protocols for Secure Computations (Extended Abstract). In 23rd Annual Symposium on Foundations of Computer Science (FOCS).Google Scholar
- Ennan Zhai, Ruichuan Chen, David Isaac Wolinsky, and Bryan Ford. 2013. An Untold Story of Redundant Clouds: Making Your Service Deployment Truly Reliable. In 9th Workshop on Hot Topics in Dependable Systems (HotDep). Google Scholar
Digital Library
- Ennan Zhai, Ruichuan Chen, David Isaac Wolinsky, and Bryan Ford. 2014. Heading off correlated failures through Independence-as-a-service. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google Scholar
- Ennan Zhai, Liang Gu, and Yumei Hai. 2015. A risk-evaluation assisted system for service selection. In International Conference on Web Services (ICWS). Google Scholar
Digital Library
- Ennan Zhai, David Isaac Wolinsky, Hongda Xiao, Hongqiang Liu, Xueyuan Su, and Bryan Ford. 2013. Auditing the Structural Reliability of the Clouds. Technical Report YALEU/DCS/TR-1479. Department of Computer Science, Yale University. Available at http://cpsc.yale.edu/sites/default/files/files/tr1479.pdf .Google Scholar
- Xu Zhao, Kirk Rodrigues, Yu Luo, Ding Yuan, and Michael Stumm. 2016. Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google Scholar
- Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan Ullah, Yu Luo, Ding Yuan, and Michael Stumm. 2014. lprof: A Non-intrusive Request Flow Profiler for Distributed Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google Scholar
- Wenchao Zhou, Qiong Fei, Arjun Narayan, Andreas Haeberlen, Boon Thau Loo, and Micah Sherr. 2011a. Secure network provenance. In 23rd ACM Symposium on Operating Systems Principles (SOSP). Google Scholar
Digital Library
- Wenchao Zhou, Qiong Fei, Shengzhi Sun, Tao Tao, Andreas Haeberlen, Zachary G. Ives, Boon Thau Loo, and Micah Sherr. 2011b. NetTrails: a declarative platform for maintaining and querying provenance in distributed systems. In ACM International Conference on Management of Data (SIGMOD). Google Scholar
Digital Library
Index Terms
An auditing language for preventing correlated failures in the cloud
Recommendations
Auditing and Analysis of Network Traffic in Cloud Environment
SERVICES '13: Proceedings of the 2013 IEEE Ninth World Congress on ServicesCloud computing allows users to remotely store their data into the cloud and provides on-demand applications and services from a shared pool of configurable computing resources. The security of the outsourced data in the cloud is dependent on the ...
Modeling of Correlated Failures and Community Error Recovery in Multiversion Software
Three aspects of the modeling of multiversion software are considered. First, the beta-binomial distribution is proposed for modeling correlated failures in multiversion software. Second, a combinatorial model for predicting the reliability of a ...
Tolerating Temporal Correlated Failures from Cyclic Dependency in High Performance Computing Systems
ICPADS '08: Proceedings of the 2008 14th IEEE International Conference on Parallel and Distributed SystemsCorrelated failures have recently gained more attention in the research of failures in large scale systems. Recent studies have pointed out the negative effect of ignoring such failures when designing a fault tolerant scheme for large scale systems. In ...






Comments