skip to main content
research-article
Open Access
Artifacts Available
Artifacts Evaluated & Functional

An auditing language for preventing correlated failures in the cloud

Published:12 October 2017Publication History
Skip Abstract Section

Abstract

Today's cloud services extensively rely on replication techniques to ensure availability and reliability. In complex datacenter network architectures, however, seemingly independent replica servers may inadvertently share deep dependencies (e.g., aggregation switches). Such unexpected common dependencies may potentially result in correlated failures across the entire replication deployments, invalidating the efforts. Although existing cloud management and diagnosis tools have been able to offer post-failure forensics, they, nevertheless, typically lead to quite prolonged failure recovery time in the cloud-scale systems. In this paper, we propose a novel language framework, named RepAudit, that manages to prevent correlated failure risks before service outages occur, by allowing cloud administrators to proactively audit the replication deployments of interest. In particular, RepAudit consists of three new components: 1) a declarative domain-specific language, RAL, for cloud administrators to write auditing programs expressing diverse auditing tasks; 2) a high-performance RAL auditing engine that generates the auditing results by accurately and efficiently analyzing the underlying structures of the target replication deployments; and 3) an RAL-code generator that can automatically produce complex RAL programs based on easily written specifications. Our evaluation result shows that RepAudit uses 80x less lines of code than state-of-the-art efforts in expressing the auditing task of determining the top-20 critical correlated-failure root causes. To the best of our knowledge, RepAudit is the first effort capable of simultaneously offering expressive, accurate and efficient correlated failure auditing to the cloud-scale replication systems.

References

  1. Marcos Kawazoe Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. 2003. Performance debugging for distributed systems of black boxes. In 19th ACM Symposium on Operating Systems Principles (SOSP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. N. Alon and R. B. Boppana. 1987. The monotone circuit complexity of Boolean functions. Combinatorica 7, 1 (1987), 1–22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mario Alviano. 2015. Maxino: A fast MaxSAT solver. http://alviano.net/software/maxino/ . (2015). Online; accessed Feb 24 2017.Google ScholarGoogle Scholar
  4. Mario Alviano, Carmine Dodaro, and Francesco Ricca. 2015. A MaxSAT algorithm using cardinality constraints of bounded size. In 24th International Joint Conference on Artificial Intelligence (IJCAI).Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Carlos Ansótegui, Maria Luisa Bonet, and Jordi Levy. 2009. Solving Weighted partial MaxSAT through satisfiability testing. In 12th Theory and Applications of Satisfiability Testing (SAT).Google ScholarGoogle Scholar
  6. Carlos Ansótegui, Maria Luisa Bonet, and Jordi Levy. 2010. A new algorithm for weighted partial MaxSAT. In 24th Conference on Artificial Intelligence (AAAI).Google ScholarGoogle Scholar
  7. Paramvir Bahl, Ranveer Chandra, Albert G. Greenberg, Srikanth Kandula, David A. Maltz, and Ming Zhang. 2007. Towards highly reliable enterprise network services via inference of multi-level dependencies. In ACM SIGCOMM (SIGCOMM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Tomas Balyo, Marijn J. H. Heule, and Matti Jarvisalo. 2016. SAT Competition 2016 : Solver and Benchmark Descriptions. In SAT.Google ScholarGoogle Scholar
  9. Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using Magpie for request extraction and workload modelling. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google ScholarGoogle Scholar
  10. Alysson Neves Bessani, Miguel P. Correia, Bruno Quaresma, Fernando André, and Paulo Sousa. 2011. DepSky: Dependable and Secure Storage in a Cloud-of-clouds. In 6th ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys).Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Peter Bodik, Ishai Menache, Mosharaf Chowdhury, Pradeepkumar Mani, David A. Maltz, and Ion Stoica. 2012. Surviving failures in bandwidth-constrained datacenters. In ACM SIGCOMM (SIGCOMM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Nicolas Bonvin, Thanasis G. Papaioannou, and Karl Aberer. 2010. A self-organized, fault-tolerant and scalable replication scheme for cloud storage. In 1st ACM Symposium on Cloud Computing (SoCC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Danny Bradbury. 2016. The bigger they get, the harder we fall: Thinking our way out of cloud crash. http://www.theregister. co.uk/2016/07/29/bryan_ford_bigger_icebergs/ . (2016).Google ScholarGoogle Scholar
  14. Ang Chen, Yang Wu, Andreas Haeberlen, Boon Thau Loo, and Wenchao Zhou. 2017. Data provenance at Internet scale: Architecture, experiences, and the road ahead. In 8th Biennial Conference on Innovative Data Systems Research (CIDR).Google ScholarGoogle Scholar
  15. Ang Chen, Yang Wu, Andreas Haeberlen, Wenchao Zhou, and Boon Thau Loo. 2016. The good, the bad, and the differences: Better network diagnostics with differential provenance. In ACM SIGCOMM (SIGCOMM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mike Y. Chen, Anthony Accardi, Emre Kiciman, David A. Patterson, Armando Fox, and Eric A. Brewer. 2004. Path-based failure and evolution management. In 1st USENIX Symposium on Networked System Design and Implementation (NSDI).Google ScholarGoogle Scholar
  17. Xu Chen, Ming Zhang, Zhuoqing Morley Mao, and Paramvir Bahl. 2008. Automating network application dependency discovery: Experiences, limitations, and new Solutions. In 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google ScholarGoogle Scholar
  18. Ira Cohen, Jeffrey S. Chase, Moisés Goldszmidt, Terence Kelly, and Julie Symons. 2004. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google ScholarGoogle Scholar
  19. John Dunagan, Nicholas J. A. Harvey, Michael B. Jones, Dejan Kostic, Marvin Theimer, and Alec Wolman. 2004. F USE: Lightweight guaranteed distributed failure notification. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google ScholarGoogle Scholar
  20. Daniel Ford, François Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. 2010. Availability in globally distributed storage systems. In 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google ScholarGoogle Scholar
  21. Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding network failures in data centers: Measurement, analysis, and implications. In ACM SIGCOMM (SIGCOMM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In 5th ACM Symposium on Cloud Computing (SoCC).Google ScholarGoogle Scholar
  23. Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar. 2016. Why does the cloud stop computing? Lessons from hundreds of service outages. In 7th ACM Symposium on Cloud Computing (SoCC).Google ScholarGoogle Scholar
  24. Andreas Haeberlen. 2009. A case for the accountable cloud. In 3rd ACM SIGOPS International Workshop on Large-Scale Distributed Systems and Middleware (LADIS).Google ScholarGoogle Scholar
  25. Andreas Haeberlen, Paarijaat Aditya, Rodrigo Rodrigues, and Peter Druschelnd. 2010. Accountable virtual machines. In 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google ScholarGoogle Scholar
  26. Devindra Hardaware. 2011. Apple’s iCloud runs on Microsoft’s Azure and Amazon’s cloud. http://venturebeat.com/2011/ 09/03/icloud- azure- amazon/ . (2011).Google ScholarGoogle Scholar
  27. Heqing Huang, Su Zhang, Xinming Ou, Atul Prakash, and Karem A. Sakallah. 2011. Distilling critical attack graph surface iteratively through minimum-cost SAT solving. In 27th Annual Computer Security Applications Conference (ACSAC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Peng Huang, William J. Bolosky, Abhishek Singh, and Yuanyuan Zhou. 2015. Conf Valley: A systematic configuration validation framework for cloud services. In 10th European Conference on Computer Systems (EuroSys).Google ScholarGoogle Scholar
  29. Andrew Johnson, Lucas Waye, Scott Moore, and Stephen Chong. 2015. Exploring and enforcing security guarantees via program dependence graphs. In 36th ACM Conference on Programming Language Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ivan P Kaminow and Thomas L Koch. 1997. Optical Fiber Telecommunications IIIA. Academic Press, New York.Google ScholarGoogle Scholar
  31. Srikanth Kandula, Dina Katabi, and Jean-Philippe Vasseur. 2005. Shrink: A Tool for Failure Diagnosis in IP Networks. In MineNet. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl. 2009. Detailed Diagnosis in Enterprise Networks. In ACM SIGCOMM (SIGCOMM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ramana Rao Kompella, Jennifer Yates, Albert G. Greenberg, and Alex C. Snoeren. 2005. IP Fault Localization Via Risk Modeling. In 2nd USENIX Symposium on Networked System Design and Implementation (NSDI).Google ScholarGoogle Scholar
  34. Akash Lal, Shaz Qadeer, and Shuvendu K. Lahiri. 2012. A solver for reachability modulo theories. In 24th International Conference on Computer Aided Verification (CAV). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos Kawazoe Aguilera, and Michael Walfish. 2011. Detecting failures in distributed systems with the Falcon spy network. In 23rd ACM Symposium on Operating Systems Principles (SOSP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Nuno P. Lopes, Nikolaj Bjørner, Patrice Godefroid, Karthick Jayaraman, and George Varghese. 2015. Checking beliefs in dynamic networks. In 12th USENIX Symposium on Networked System Design and Implementation (NSDI).Google ScholarGoogle Scholar
  37. Jedidiah McClurg, Hossein Hojjat, Pavol Cerný, and Nate Foster. 2015. Efficient synthesis of network updates. In 36th ACM Conference on Programming Language Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya, and Amin Vahdat. 2009. PortLand: A Scalable Fault-tolerant Layer 2 Data Center Network Fabric. In ACM SIGCOMM (SIGCOMM).Google ScholarGoogle Scholar
  39. Arun Natarajan, Peng Ning, Yao Liu, Sushil Jajodia, and Steve E. Hutchinson. 2012. NSDMiner: Automated discovery of network service dependencies. In 31st IEEE INFOCOM (INFOCOM).Google ScholarGoogle Scholar
  40. Suman Nath, Haifeng Yu, Phillip B. Gibbons, and Srinivasan Seshan. 2006. Subtleties in tolerating correlated failures in wide-area storage systems. In 3rd USENIX/ACM Symposium on Networked Systems Design and Implementation (NSDI).Google ScholarGoogle Scholar
  41. Barry Peddycord III, Peng Ning, and Sushil Jajodia. 2012. On the Accurate Identification of Network Service Dependencies in Distributed Systems. In 26th Large Installation System Administration Conference (LISA).Google ScholarGoogle Scholar
  42. Gordon D. Plotkin, Nikolaj Bjørner, Nuno P. Lopes, Andrey Rybalchenko, and George Varghese. 2016. Scaling network verification using symmetry and surgery. In 43rd ACM Symposium on Principles of Programming Languages (POPL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Patrick Reynolds, Charles Edwin Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, and Amin Vahdat. 2006. Pip: Detecting the unexpected in distributed systems. In 3rd Symposium on Networked Systems Design and Implementation (NSDI).Google ScholarGoogle Scholar
  44. Lorenzo Saino, Cosmin Cocora, and George Pavlou. 2013. Fast Network Simulation Setup. https://github.com/fnss/fnss . (2013).Google ScholarGoogle Scholar
  45. Mehul A. Shah, Mary Baker, Jeffrey C. Mogul, and Ram Swaminathan. 2007. Auditing to Keep Online Storage Services Honest. In 11th Workshop on Hot Topics in Operating Systems (HotOS).Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Rew Steven. 2014. Rackspace Outage Nov 12th. http://www.realestatewebmasters.com/blogs/rew- steven/ rackspace- outage- nov- 12th/show/ . (2014). Online; accessed Feb 24 2017.Google ScholarGoogle Scholar
  47. The AWS Team. 2012. Summary of the October 22, 2012 AWS Service Event in the US-East Region. https://aws.amazon. com/message/680342/ . (2012). Online; accessed Feb 24 2017.Google ScholarGoogle Scholar
  48. Reinhard von Hanxleden, Björn Duderstadt, Christian Motika, Steven Smyth, Michael Mendler, Joaquin Aguado, Stephen Mercer, and Owen O’Brien. 2014. SCCharts: sequentially constructive statecharts for safety-critical applications: HW/SW-synthesis for a conservative extension of synchronous statecharts. In 35th ACM Conference on Programming Language Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, and Ming Zhang. 2012. NetPilot: Automating datacenter network failure mitigation. In ACM SIGCOMM (SIGCOMM).Google ScholarGoogle Scholar
  50. Yang Wu, Mingchen Zhao, Andreas Haeberlen, Wenchao Zhou, and Boon Thau Loo. 2014. Diagnosing missing events in distributed systems with negative provenance. In ACM SIGCOMM (SIGCOMM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Hongda Xiao, Bryan Ford, and Joan Feigenbaum. 2013. Structural Cloud Audits that Protect Private Information. In ACM Cloud Computing Security Workshop (CCSW). Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Andrew Chi-Chih Yao. 1982. Protocols for Secure Computations (Extended Abstract). In 23rd Annual Symposium on Foundations of Computer Science (FOCS).Google ScholarGoogle Scholar
  53. Ennan Zhai, Ruichuan Chen, David Isaac Wolinsky, and Bryan Ford. 2013. An Untold Story of Redundant Clouds: Making Your Service Deployment Truly Reliable. In 9th Workshop on Hot Topics in Dependable Systems (HotDep). Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ennan Zhai, Ruichuan Chen, David Isaac Wolinsky, and Bryan Ford. 2014. Heading off correlated failures through Independence-as-a-service. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google ScholarGoogle Scholar
  55. Ennan Zhai, Liang Gu, and Yumei Hai. 2015. A risk-evaluation assisted system for service selection. In International Conference on Web Services (ICWS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Ennan Zhai, David Isaac Wolinsky, Hongda Xiao, Hongqiang Liu, Xueyuan Su, and Bryan Ford. 2013. Auditing the Structural Reliability of the Clouds. Technical Report YALEU/DCS/TR-1479. Department of Computer Science, Yale University. Available at http://cpsc.yale.edu/sites/default/files/files/tr1479.pdf .Google ScholarGoogle Scholar
  57. Xu Zhao, Kirk Rodrigues, Yu Luo, Ding Yuan, and Michael Stumm. 2016. Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google ScholarGoogle Scholar
  58. Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan Ullah, Yu Luo, Ding Yuan, and Michael Stumm. 2014. lprof: A Non-intrusive Request Flow Profiler for Distributed Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google ScholarGoogle Scholar
  59. Wenchao Zhou, Qiong Fei, Arjun Narayan, Andreas Haeberlen, Boon Thau Loo, and Micah Sherr. 2011a. Secure network provenance. In 23rd ACM Symposium on Operating Systems Principles (SOSP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Wenchao Zhou, Qiong Fei, Shengzhi Sun, Tao Tao, Andreas Haeberlen, Zachary G. Ives, Boon Thau Loo, and Micah Sherr. 2011b. NetTrails: a declarative platform for maintaining and querying provenance in distributed systems. In ACM International Conference on Management of Data (SIGMOD). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An auditing language for preventing correlated failures in the cloud

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!