Abstract
Large-scale network services can consist of tens of thousands of machines running thousands of unique software configurations spread across hundreds of physical networks. Testing such services for complex performance problems and configuration errors remains a difficult problem. Existing testing techniques, such as simulation or running smaller instances of a service, have limitations in predicting overall service behavior at such scales.
Testing large services should ideally be done at the same scale and configuration as the target deployment, which can be technically and economically infeasible. We present DieCast, an approach to scaling network services in which we multiplex all of the nodes in a given service configuration as virtual machines across a much smaller number of physical machines in a test harness. We show how to accurately scale CPU, network, and disk to provide the illusion that each VM matches a machine in the original service in terms of both available computing resources and communication behavior. We present the architecture and evaluation of a system we built to support such experimentation and discuss its limitations. We show that for a variety of services---including a commercial high-performance cluster-based file system---and resource utilization levels, DieCast matches the behavior of the original service while using a fraction of the physical resources.
- Aguilera, M. K., Mogul, J. C., Wiener, J. L., Reynolds, P., and Muthitacharoen, A. 2003. Performance debugging for distributed systems of black boxes. In Proceedings of the Symposium on Operating Systems Principles. 74--89. Google Scholar
Digital Library
- Barham, P. T., Dragovic, B., Fraser, K., Hand, S., Harris, T. L., Ho, A., Neugebauer, R., Pratt, I., and Warfield, A. 2003. Xen and the art of virtualization. In Proceedings of the Symposium on Operating Systems Principles. 164--177. Google Scholar
Digital Library
- Barham, P. T., Donnelly, A., Isaacs, R., and Mortier, R. 2004. Using magpie for request extraction and workload modelling. In Proceedings of the Symposium on Operating System Design and Implementation. 259--272. Google Scholar
Digital Library
- Barroso, L. A., Dean, J., and Hölzle, U. 2003. Web search for a planet: The google cluster architecture. IEEE Micro 23, 2, 22--28. Google Scholar
Digital Library
- Bellard, F. 2005. Qemu, a fast and portable dynamic translator. In Proceedings of the USENIX Annual Technical Conference. Google Scholar
Digital Library
- BitMover. 2008. Lmbench - tools for performance analysis. http://www.bitmover.com/lmbench.Google Scholar
- Blanquer, J. M., Batchelli, A., Schauser, K., and Wolski, R. 2005. Quorum: Flexible quality of service for internet services. In Proceedings of the Symposium on Networked System Design and Implementation. 159--174. Google Scholar
Digital Library
- Bugzilla, X. 2008. Freebsd bootloader stops with btx halted in hvm domu. http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=622.Google Scholar
- Capps, D. 2006. Iozone filesystem benchmark. http://www.iozone.orgGoogle Scholar
- Cecchet, E., Marguerite, J., and Zwaenepoel, W. 2002. Performance and scalability of ejb applications. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications. 246--261. Google Scholar
Digital Library
- Chen, M. Y., Kiciman, E., Fratkin, E., Fox, A., and Brewer, E. 2002. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the Symposium on Operating System Design and Implementation.Google Scholar
- Cheng, Y.-C., Hölzle, U., Cardwell, N., Savage, S., and Voelker, G. M. 2004. Monkey see, monkey do: A tool for tcp tracing and replaying. In Proceedings of the USENIX Annual Technical Conference. 87--98. Google Scholar
Digital Library
- Cohen, B. 2008. Bittorrent. http://www.bittorrent.com.Google Scholar
- Doyle, R. P., Chase, J. S., Asad, O. M., Jin, W., and Vahdat, A. M. 2003. Model-based resource provisioning in a web service utility. In Proceedings of the USENIX Symposium on Internet Technologies and Systems. Seattle, Washington. Google Scholar
Digital Library
- Forum, T. M. 1993. Mpi: A message passing interface. In Proceedings of the ACM/IEEE Conference on Supercomputing. Portland, Oregon. Google Scholar
Digital Library
- Ganger, G. R., et al. 2008. The disksim simulation environment. http://www.pdl.cmu.edu/DiskSim/index.html.Google Scholar
- Goldberg, R. P. 1974. Survey of virtual machine research. IEEE Computer Magazine 7, 6, 34--45.Google Scholar
Digital Library
- Gupta, D., Lee, S., Vrable, M., Savage, S., Snoeren, A. C., Voelker, G. M., and Vahdat, A. 2008. Difference engine: Harnessing memory redundancy in virtual machines. In Proceedings of the Symposium on Operating System Design and Implementation. Google Scholar
Digital Library
- Gupta, D., Vishwanath, K. V., and Vahdat, A. 2007. Diecast: Testing distributed systems with an accurate scale model. Tech. rep. CS2007-0910, University of California, San Diego.Google Scholar
- Haeberlen, A., Mislove, A., and Druschel, P. 2005. Glacier: Highly durable, decentralized storage despite massive correlated failures. In Proceedings of the Symposium on Networked System Design and Implementation. 143--158. Google Scholar
Digital Library
- Huang, X. W., Sharma, R., and Keshav, S. 1999. The entrapid protocol development environment. In Proceedings of the IEEE International Conference on Computer Communications. 1107--1115.Google Scholar
- Jain, R. 1991. The Art of Computer Systems Performance Analysis. John Wiley & Sons. Chapter 12.Google Scholar
- Katabi, D., Handley, M., and Rohrs, C. E. 2002. Congestion control for high bandwidth-delay product networks. In Proceedings of the SIGCOMM Conference. 89--102. Google Scholar
Digital Library
- Lawton, K. P. 1996. Bochs: A portable pc emulator for unix/x. Linux J., 7. Google Scholar
Digital Library
- LBNL. 2008. Linux tcp tuning guide. http://www-didc.lbl.gov/TCP-tuning/linux.html.Google Scholar
- Linux Community. 2008. Linux advanced routing and traffic control. http://lartc.org.Google Scholar
- Linux Foundation. 2008. Net:netem. http://www.linuxfoundation.org/en/Net:Netem.Google Scholar
- Mogul, J. 2006. Emergent (mis) behavior vs. complex software systems. In Proceedings of the European Conference on Computer Systems. 293--304. Google Scholar
Digital Library
- Mogul, J. C. 2003. Tcp offload is a dumb idea whose time has come. In Proceedings of the Workshop on Hot Topics in Operating Systems. Google Scholar
Digital Library
- National Cyber Range 2009. National cyber range. http://www.darpa.mil/sto/ia/ncr.html.Google Scholar
- NS-2. 2008. The network simulator -- ns-2. http://www.isi.edu/nsnam/ns.Google Scholar
- Oppenheimer, D., Ganapathi, A., and Patterson, D. A. 2003. Why do internet services fail, and what can be done about it? In Proceedings of the USENIX Symposium on Internet Technologies and Systems. Google Scholar
Digital Library
- Pan, R., Prabhakar, B., Psounis, K., and Wischik, D. 2003. Shrink: A method for scaleable performance prediction and efficient network simulation. In Proceedings of the IEEE International Conference on Computer Communications.Google Scholar
- Panasas. 2006. Panasas activescale storage cluster will provide i/o for world’s fastest computer. http://panasas.com/press_release_111306.html.Google Scholar
- Panasas. 2008. Panasas. http://www.panasas.com.Google Scholar
- Peterson, L., Bavier, A., Fiuczynski, M. E., and Muir, S. 2006. Experiences building planetlab. In Proceedings of the Symposium on Operating System Design and Implementation. Google Scholar
Digital Library
- Ricci, R., Alfeld, C., and Lepreau, J. 2003. A solver for the network testbed mapping problem. SIGCOMM Comput. Comm. Rev. 33, 2, 65--81. Google Scholar
Digital Library
- Riley, G. F. 2003. The georgia tech network simulator. In Proceedings of the ACM SIGCOMM Workshop on Models, Methods and Tools for Reproducible Network Research. 5--12. Google Scholar
Digital Library
- Rizzo, L. 1997. Dummynet: A simple approach to the evaluation of network protocols. SIGCOMM Comput. Comm. Rev. 27, 1, 31--41. Google Scholar
Digital Library
- Rosenblum, M., Bugnion, E., Devine, S., and Herrod, S. A. 1997. Using the simos machine simulator to study complex computer systems. ACM Trans. Mod. Comput. Simul. 7, 1. Google Scholar
Digital Library
- Schneider, F. B. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. 22, 4, 299--319. Google Scholar
Digital Library
- Stahlman, M. 2007. Does Google have a million servers? http://www.gartner.com/DisplayDocument?doc_cd=149024.Google Scholar
- Szymanski, B. K., Saifee, A., Sastry, A., Liu, Y., and Madnani, K. 2002. Genesis: A system for large-scale parallel network simulation. In Proceedings of the Workshop on Parallel and Distributed Simulation. 89--96. Google Scholar
Digital Library
- tcpdump.org. 2008. Tcpdump/libpcap public repository. http://www.tcpdump.org.Google Scholar
- Tridgell, A. 2004. Emulating netbench. http://samba.org/ftp/tridge/dbench.Google Scholar
- Urgaonkar, B., Shenoy, P. J., and Roscoe, T. 2002. Resource overbooking and application profiling in shared hosting platforms. In Proceedings of the Symposium on Operating System Design and Implementation. 239--254. Google Scholar
Digital Library
- Vahdat, A., Yocum, K., Walsh, K., Mahadevan, P., Kostic, D., Chase, J. S., and Becker, D. 2002. Scalability and accuracy in a large-scale network emulator. In Proceedings of the Symposium on Operating System Design and Implementation. Google Scholar
Digital Library
- Vishwanath, K. and Vahdat, A. 2008. Evaluating distributed systems: Does background traffic matter? In Proceedings of the USENIX Annual Technical Conference. Google Scholar
Digital Library
- VMwareAppliances. 2008. Vmware appliances. http://www.vmware.com/vmtn/appliances.Google Scholar
- VMwareESX4.0. 2010. Timekeeping in vmware virtual machines. http://www.vmware.com/pdf/vmware_timekeeping.pdf.Google Scholar
- VMwareESXGuide. Esx server 3 configuration guide. http://www.vmware.com/pdf/vi3_35/esx_3/r35/vi3_35_25_3_server_config.pdf.Google Scholar
- VMwareP2V. Vmware p2v assistant. http://www.vmware.com/products/p2v.Google Scholar
- Waldspurger, C. A. 2002. Memory resource management in vmware esx server. In Proceedings of the Symposium on Operating System Design and Implementation. Google Scholar
Digital Library
- Warfield, A., Ross, R., Fraser, K., Limpach, C., and Hand, S. 2005. Parallax: Managing storage for a million machines. In Proceedings of the Workshop on Hot Topics in Operating Systems. Google Scholar
Digital Library
- White, B., Lepreau, J., Stoller, L., Ricci, R., Guruprasad, S., Newbold, M., Hibler, M., Barb, C., and Joglekar, A. 2002. An integrated experimental environment for distributed systems and networks. In Proceedings of the Symposium on Operating System Design and Implementation. Google Scholar
Digital Library
- Xu, L., Harfoush, K., and Rhee, I. 2004. Binary increase congestion control (bic) for fast long-distance networks. In Proceedings of the IEEE International Conference on Computer Communications.Google Scholar
Index Terms
DieCast: Testing Distributed Systems with an Accurate Scale Model
Recommendations
Transparently bridging semantic gap in CPU management for virtualized environments
Consolidated environments are progressively accommodating diverse and unpredictable workloads in conjunction with virtual desktop infrastructure and cloud computing. Unpredictable workloads, however, aggravate the semantic gap between the virtual ...
A Technical Review for Efficient Virtual Machine Migration
CUBE '13: Proceedings of the 2013 International Conference on Cloud & Ubiquitous Computing & Emerging TechnologiesThis paper presents the recent technical research survey on the efficient live migration of virtual machines. Virtual machine migration is required for many reasons like load balancing, energy reduction, dynamic resizing, and to increase availability. ...
ReHype: enabling VM survival across hypervisor failures
VEE '11: Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environmentsWith existing virtualized systems, hypervisor failures lead to overall system failure and the loss of all the work in progress of virtual machines (VMs) running on the system. We introduce ReHype, a mechanism for recovery from hypervisor failures by ...






Comments