Abstract
Online software upgrades are often plagued by runtime behaviors that are poorly understood and difficult to ascertain. For example, the interactions among multiple versions of the software expose the system to race conditions that can introduce latent errors or data corruption. Moreover, industry trends suggest that online upgrades are currently needed in large-scale enterprise systems, which often span multiple administrative domains (e.g., Web 2.0 applications that rely on AJAX client-side code or systems that lease cloud-computing resources). In such systems, the enterprise does not control all the tiers of the system and cannot coordinate the upgrade process, making existing techniques inadequate to prevent mixed-version races. In this paper, we present an analytical framework for impact assessment, which allows system administrators to directly compare the risk of following an online-upgrade plan with the risk of delaying or canceling the upgrade. We also describe an executable model that implements our formal impact assessment and enables a systematic approach for deciding whether an online upgrade is appropriate. Our model provides a method of last resort for avoiding undesirable program behaviors, in situations where mixed-version races cannot be avoided through other technical means.
- }}S. Ajmani, B. Liskov, and L. Shrira. Modular software upgrades for distributed systems. In European Conference on Object-Oriented Programming, pages 452--476, Nantes, France, Jul 2006. Google Scholar
Digital Library
- }}S. Beattie, S. Arnold, C. Cowan, P. Wagle, and C. Wright. Timing the application of security patches for optimal uptime. In Large Installation System Administration Conference, pages 233--242, Philadelphia, PA, Nov 2002. Google Scholar
Digital Library
- }}T. Bloom. Dynamic Module Replacement in a Distributed Programming System. PhD thesis, MIT, 1983.Google Scholar
Digital Library
- }}M. Bond, K. Coons, and K. McKinley. Pacer: Proportional detection of data races. In ACM Conference on Programming Language Design and Implementation, Toronto, CA, Jun 2010. Google Scholar
Digital Library
- }}E. A. Brewer. Lessons from giant-scale services. IEEE Internet Computing, 5(4):46--55, Jul/Aug 2001. Google Scholar
Digital Library
- }}A. Choi. Online application upgrade using edition-based redefinition. In ACM Workshop on Hot Topics in Software Upgrades, Orlando, FL, Oct 2009. Google Scholar
Digital Library
- }}O. Crameri, N. Knezevic, D. Kostic, R. Bianchini, and W. Zwaenepoel. Staged deployment in Mirage, an integrated software upgrade testing and distribution system. In Symposium on Operating Systems Principles, pages 221--236, Stevenson, WA, Oct 2007. Google Scholar
Digital Library
- }}CWE/SANS. Top 25 most dangerous programming errors. Feb 2010.Google Scholar
- }}A. Downing, Oracle Corporation. Personal communication, 2008.Google Scholar
- }}T. Dumitras and P. Narasimhan. Why do upgrades fail and what can we do about it? Toward dependable, online upgrades in enterprise systems. In ACM/IEEE/IFIP Middleware Conference, pages 349--372, Urbana-Champaign, IL, Nov/Dec 2009. Google Scholar
Digital Library
- }}T. Dumitras, D. Rosu, A. Dan, and P. Narasimhan. Ecotopia: An ecological framework for change management in distributed systems. In C. Gacek, A. Romanovsky, and R. de Lemos, editors, Architecting Dependable Systems IV, pages 262--286. Springer-Verlag, LNCS 4615, 2007. Google Scholar
Digital Library
- }}S. Hansell. Glitch makes teller machines take twice what they give. The New York Times, Feb 18 1994.Google Scholar
- }}M. Hicks. Dynamic Software Updating. PhD thesis, Department of Computer and Information Science, University of Pennsylvania, August 2001. Google Scholar
Digital Library
- }}J. Kramer and J. Magee. Dynamic configuration for distributed systems. IEEE Transactions on Software Engineering, 11(4):424--436, 1985. Google Scholar
Digital Library
- }}B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In ACM Conference on Programming Language Design and Implementation, San Diego, CA, Jun 2003. Google Scholar
Digital Library
- }}Microsoft Corporation. Perform a rolling upgrade from Windows 2000. TechNet Library, Jan 2005. http://technet.microsoft.com/en-us/library/cc738005(WS.10).aspx.Google Scholar
- }}Microsoft Developer Network. Windows Update Agent. http://msdn2.microsoft.com/en-us/library/aa387099.aspx. Retrieved on 18 Feb 2008.Google Scholar
- }}Office of Government Commerce. Service Transition. Information Technology Infrastructure Library (ITIL). 2007.Google Scholar
- }}F. Oliveira, K. Nagaraja, R. Bachwani, R. Bianchini, R. P. Martin, and T. D. Nguyen. Understanding and validating database system administration. USENIX Annual Technical Conference, Jun 2006. Google Scholar
Digital Library
- }}D. Oppenheimer, A. Ganapathi, and D. A. Patterson. Why do Internet services fail, and what can be done about it? In USENIX Symposium on Internet Technologies and Systems, Seattle, WA, Mar 2003. Google Scholar
Digital Library
- }}Oracle Corporation. Database rolling upgrade using Data Guard SQL Apply. Maximum Availability Architecture White Paper, Dec 2008. http://www.oracle.com/technology/deploy/availability/pdf/maa_wp_10gr2_rollingupgradebestpractices.pdf.Google Scholar
- }}D. Patterson. A simple way to estimate the cost of downtime. In Large Installation System Administration Conference, pages 185--188, Philadelphia, PA, Nov 2002. Google Scholar
Digital Library
- }}D. Reiss, Facebook. Personal communication, 2009.Google Scholar
- }}J. S. Rellermeyer, M. Duller, and G. Alonso. Consistently applying updates to compositions of distributed OSGi modules. In ACM Workshop on Hot Topics in Software Upgrades, Nashville, Tennessee, Oct 2008. Google Scholar
Digital Library
- }}M. Segal. Online software upgrading: new research directions and practical considerations. In Computer Software and Applications Conference, pages 977--981, Oxford, England, Aug 2002. Google Scholar
Digital Library
- }}M. E. Segal and O. Frieder. Dynamically updating distributed software: supporting change in uncertain and mistrustful environments. In IEEE Conference on Software Maintenance, pages 254--261, Oct 1989.Google Scholar
Cross Ref
- }}J. Sliwerski, T. Zimmermann, and A. Zeller. When do changes induce fixes? On Fridays. In International Workshop on Mining Software Repositories (MSR), Saint Louis, Missouri, May 2005. Google Scholar
Digital Library
- }}E. B. Swanson. The dimensions of maintenance. In International Conference on Software Engineering, pages 492--497, San Francisco, CA, 1976. Google Scholar
Digital Library
- }}L. Tewksbury, L. Moser, and M. Melliar-Smith. Live upgrades of CORBA applications using object replication. In International Conference on Software Maintenance, pages 488--497, Florence, Italy, Nov 2001. Google Scholar
Digital Library
- }}S. Vinoski. Convenience over correctness. IEEE Internet Computing, 12(4):89--92, 2008. Google Scholar
Digital Library
- }}W. Zheng, R. Bianchini, G. J. Janakiraman, J. R. Santos, and Y. Turner. Justrunit: Experiment-based management of virtualized data centers. In USENIX Annual Technical Conference, San Diego, CA, Jun 2009. Google Scholar
Digital Library
Index Terms
To upgrade or not to upgrade: impact of online upgrades across multiple administrative domains
Recommendations
To upgrade or not to upgrade: impact of online upgrades across multiple administrative domains
OOPSLA '10: Proceedings of the ACM international conference on Object oriented programming systems languages and applicationsOnline software upgrades are often plagued by runtime behaviors that are poorly understood and difficult to ascertain. For example, the interactions among multiple versions of the software expose the system to race conditions that can introduce latent ...
Fast and Scalable VMM Live Upgrade in Large Cloud Infrastructure
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating SystemsHigh availability is the most important and challenging problem for cloud providers. However, virtual machine monitor (VMM), a crucial component of the cloud infrastructure, has to be frequently updated and restarted to add security patches and new ...







Comments