Abstract
This paper describes the design and implementation of SecondSite, a cloud-based service for disaster tolerance. SecondSite extends the Remus virtualization-based high availability system by allowing groups of virtual machines to be replicated across data centers over wide-area Internet links. The goal of the system is to commodify the property of availability, exposing it as a simple tick box when configuring a new virtual machine. To achieve this in the wide area, we have had to tackle the related issues of replication traffic bandwidth, reliable failure detection across geographic regions and traffic redirection over a wide-area network without compromising on transparency and consistency.
- M. K. Aguilera, W. Chen, and S. Toueg. Heartbeat: A timeout-free failure detector for quiescent reliable communication. Technical report, Ithaca, NY, USA, 1997. Google Scholar
Digital Library
- M. K. Aguilera, W. Chen, and S. Toueg. Using the heartbeat failure detector for quiescent reliable communication and consensus in partitionable networks. Theor. Comput. Sci., 220: 3--30, June 1999. ISSN 0304--3975.% http://dx.doi.org/10.1016/S0304--3975(98)00235--7. Google Scholar
Digital Library
- G. Altekar and I. Stoica. ODR: output-deterministic replay for multicore debugging. In SOSP '09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 193--206, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--752--3.% http://doi.acm.org/10.1145/1629575.1629594. Google Scholar
Digital Library
- P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 164--177, New York, NY, USA, 2003. ACM Press. ISBN 1--58113--757--5.% http://doi.acm.org/10.1145/945445.945462. Google Scholar
Digital Library
- rg}wanmigrationR. Bradford, E. Kotsovinos, A. Feldmann, and H. Schiöberg. Live wide-area migration of virtual machines including local persistent state. In VEE '07: Proceedings of the 3rd international conference on Virtual execution environments, pages 169--179, New York, NY, USA, 2007. ACM Press. ISBN 978--1--59593--630--1.% http://doi.acm.org/10.1145/1254810.1254834. Google Scholar
Digital Library
- T. C. Bressoud and F. B. Schneider. Hypervisor-based fault-tolerance. In Proceedings of the Fifteenth ACM Symposium on Operating System Principles, pages 1--11, December 1995. Google Scholar
Digital Library
- C. Brooks. Heroku learns the hard way from amazon ec2 outage. SearchCloudComputing.com, January 2010.Google Scholar
- T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. J. ACM, 43: 225--267, March 1996. ISSN 0004--5411. http://doi.acm.org/10.1145/226643.226647. Google Scholar
Digital Library
- C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live migration of virtual machines. In Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation, Berkeley, CA, USA, 2005. USENIX Association. Google Scholar
Digital Library
- A. Cockroft, C. Hicks, and G. Orzell. Lessons Netflix Learned from the AWS Outage. http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.%html, April 2011.Google Scholar
- B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson, and A. Warfield. Remus: high availability via asynchronous virtual machine replication. In NSDI'08: Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, pages 161--174, Berkeley, CA, USA, 2008. USENIX Association. ISBN 111--999--5555--22--1. Google Scholar
Digital Library
- D. Dolev, C. Dwork, and L. Stockmeyer. On the minimal synchronism needed for distributed consensus. J. ACM, 34: 77--97, January 1987. ISSN 0004--5411.% http://doi.acm.org/10.1145/7531.7533. Google Scholar
Digital Library
- G. W. Dunlap, S. T. King, S. Cinar, M. A. Basrai, and P. M. Chen. Revirt: Enabling intrusion analysis through virtual-machine logging and replay. In Proceedings of the 5th Symposium on Operating Systems Design & Implementation (OSDI 2002), 2002. Google Scholar
Digital Library
- G. W. Dunlap, D. G. Lucchetti, M. A. Fetterman, and P. M. Chen. Execution replay of multiprocessor virtual machines. In VEE '08: Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, pages 121--130, New York, NY, USA, 2008. ACM. ISBN 978--1--59593--796--4.% http://doi.acm.org/10.1145/1346256.1346273. Google Scholar
Digital Library
- C. Dwork, N. Lynch, and L. Stockmeyer. Consensus in the presence of partial synchrony. J. ACM, 35: 288--323, April 1988. ISSN 0004--5411.% http://doi.acm.org/10.1145/42282.42283. Google Scholar
Digital Library
- C. Fetzer, M. Raynal, and F. Tronel. An adaptive failure detection protocol. In Proceedings of the 2001 Pacific Rim International Symposium on Dependable Computing, PRDC '01, pages 146--, Washington, DC, USA, 2001. IEEE Computer Society. ISBN 0--7695--1414--6.% URL http://portal.acm.org/citation.cfm?id=882475.883554. Google Scholar
Digital Library
- A. Ganguly, A. Agrawal, P. Boykin, and R. Figueiredo. WOW: Self-Organizing Wide Area Overlay Networks of Virtual Workstations. High-Performance Distributed Computing, International Symposium on, 0: 30--42, 2006.% http://doi.ieeecomputersociety.org/10.1109/HPDC.2006.1652133.Google Scholar
- D. K. Gifford. Weighted voting for replicated data. In Proceedings of the seventh ACM symposium on Operating systems principles, SOSP '79, pages 150--162, New York, NY, USA, 1979. ACM. ISBN 0--89791-009--5.% http://doi.acm.org/10.1145/800215.806583. Google Scholar
Digital Library
- E. Harney, S. Goasguen, J. Martin, M. Murphy, and M. Westall. The efficacy of live virtual machine migrations over the internet. In Proceedings of the 2nd international workshop on Virtualization technology in distributed computing, VTDC '07, pages 8:1--8:7, New York, NY, USA, 2007. ACM. ISBN 978--1--59593--897--8.% http://doi.acm.org/10.1145/1408654.1408662. Google Scholar
Digital Library
- T. Hirofuchi, H. Nakada, H. Ogawa, S. Itoh, and S. Sekiguchi. A live storage migration mechanism over wan and its performance evaluation. In Proceedings of the 3rd international workshop on Virtualization technologies in distributed computing, VTDC '09, pages 67--74, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--580--2.% http://doi.acm.org/10.1145/1555336.1555348. Google Scholar
Digital Library
- X. Jiang and D. Xu. VIOLIN: Virtual Internetworking on Overlay Infrastructure. In ISPA, pages 937--946, 2004. Google Scholar
Digital Library
- C. Labovitz, A. Ahuja, A. Bose, and F. Jahanian. Delayed internet routing convergence. In in Proc. ACM SIGCOMM, pages 175--187, 2000. Google Scholar
Digital Library
- D. Lee, B. Wester, K. Veeraraghavan, S. Narayanasamy, P. M. Chen, and J. Flinn. Respec: efficient online multiprocessor replayvia speculation and external determinism. In ASPLOS '10: Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, pages 77--90, New York, NY, USA, 2010. ACM. ISBN 978--1--60558--839--1.% http://doi.acm.org/10.1145/1736020.1736031. Google Scholar
Digital Library
- U. F. Minhas, S. Rajagopalan, B. Cully, A. Aboulnaga, K. Salem, and A. Warfield. Remusdb: Transparent high availability for database systems. PVLDB, 4 (11): 738--748, 2011. Google Scholar
Digital Library
- C. C. T. A. P. Outage. R. miller. datacenterknowledge.com, May 2010.Google Scholar
- R. H. Patterson, S. Manley, M. Federwisch, D. Hitz, S. Kleiman, and S. Owara. SnapMirror: File-System-Based Asynchronous Mirroring for Disaster Recovery. In FAST '02: Proceedings of the 1st USENIX Conference on File and Storage Technologies, page 9, Berkeley, CA, USA, 2002. USENIX Association. Google Scholar
Digital Library
- P. Reisner and L. Ellenberg. Drbd v8 -- replicated storage with shared disk semantics. In Proceedings of the 12th International Linux System Technology Conference, October 2005.Google Scholar
- D. J. Scales, M. Nelson, and G. Venkitachalam. The design and evaluation of a practical system for fault-tolerant virtual machines. Technical Report VMWare-RT-2010-001, VMWare, Inc., Palo Alto, CA 94304, May 2010.Google Scholar
- R. Strom and S. Yemini. Optimistic recovery in distributed systems. ACM Trans. Comput. Syst., 3 (3), 1985.% http://doi.acm.org/10.1145/3959.3962. Google Scholar
Digital Library
- et al.(2011)Svärd, Hudzia, Tordsson, and Elmroth}deltacompressP. Svärd, B. Hudzia, J. Tordsson, and E. Elmroth. Evaluation of delta compression techniques for efficient live migration of large virtual machines. In Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, VEE '11, pages 111--120, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0687-4. http://doi.acm.org/10.1145/1952682.1952698. Google Scholar
Digital Library
- F. Travostino, P. Daspit, L. Gommans, C. Jog, C. de Laat, J. Mambretti, I. Monga, B. van Oudenaarde, S. Raghunath, and P. Y. Wang. Seamless live migration of virtual machines over the MAN/WAN. Future Gener. Comput. Syst., 22: 901--907, October 2006. ISSN 0167--739X. 10.1016/j.future.2006.03.007. Google Scholar
Digital Library
- R. van Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. In Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, Middleware '98, pages 55--70, London, UK, 1998. Springer-Verlag. ISBN 1--85233-088-0.% URL http://portal.acm.org/citation.cfm?id=1659232.1659238. Google Scholar
Digital Library
- T. Wood, H. A. Lagar-Cavilla, K. K. Ramakrishnan, P. Shenoy, and J. Van der Merwe. Pipecloud: using causality to overcome speed-of-light delays in cloud-based disaster recovery. In Proceedings of the 2nd ACM Symposium on Cloud Computing, SOCC '11, pages 17:1--17:13, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0976--9.% http://doi.acm.org/10.1145/2038916.2038933. Google Scholar
Digital Library
- T. Wood, K. K. Ramakrishnan, P. Shenoy, and J. van der Merwe. CloudNet: dynamic pooling of cloud resources by live WAN migration of virtual machines. In Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, VEE '11, pages 121--132, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0687--4.% http://doi.acm.org/10.1145/1952682.1952699. Google Scholar
Digital Library
- M. Xu, R. Bodik, and M. D. Hill. A "flight data recorder" for enabling full-system multiprocessor deterministic replay. SIGARCH Comput. Archit. News, 31 (2): 122--135, 2003. ISSN 0163--5964.% http://doi.acm.org/10.1145/871656.859633. Google Scholar
Digital Library
- }blktap2 Xen Blktap2 Driver. http://wiki.xensource.com/xenwiki/blktap2 \natexlaba.Google Scholar
- }ebs-outage Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. http://aws.amazon.com/message/65648/ \natexlabb.Google Scholar
- }everrunDR Marathon Technologies: everRun DR. http://www.marathontechnologies.com/ \natexlabc.Google Scholar
- }gae Google app engine. http://code.google.com/appengine/ \natexlabd.Google Scholar
- }spot-instances Amazon EC2 Spot Instances. http://aws.amazon.com/ec2/spot-instances/ \natexlabf.Google Scholar
- }vmware-ping VMware KB: Configuring Split-Brain Avoidance in a WAN. http://kb.vmware.com/kb/1008606 \natexlabg.Google Scholar
- }dvdstore Dell DVD Store Database Test Suite. http://www.delltechcenter.com/page/DVDGoogle Scholar
- Store \natexlabh.Google Scholar
- }specweb05 SPECweb2005. http://www.spec.org/web2005/ \natexlabi.Google Scholar
- }xentop Xentop. http://linux.die.net/man/1/xentop \natexlabj.Google Scholar
Index Terms
SecondSite: disaster tolerance as a service
Recommendations
Design and Implementation of Middleware for Cloud Disaster Recovery via Virtual Machine Migration Management
UCC '14: Proceedings of the 2014 IEEE/ACM 7th International Conference on Utility and Cloud ComputingWide-area Virtual-Machine (VM) live migration can serve as a disaster-recovery solution for IT services by moving virtualized servers to safe locations upon a critical disaster. In this scenario, it is desirable to evacuate as many VMs as possible under ...
Reducing the Migration Times of Multiple VMs on WANs Using a Feedback Controller
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD ForumVirtual machine (VM) migration is affected by network latency and throughput, which are highly fluctuating and unpredictable in wide-area networks (WANs). Hence, it is difficult to statically minimize the time required to transfer a large number of VMs ...
PipeCloud: using causality to overcome speed-of-light delays in cloud-based disaster recovery
SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud ComputingDisaster Recovery (DR) is a desirable feature for all enterprises, and a crucial one for many. However, adoption of DR remains limited due to the stark tradeoffs it imposes. To recover an application to the point of crash, one is limited by financial ...







Comments