skip to main content
research-article

DieCast: Testing Distributed Systems with an Accurate Scale Model

Published:01 May 2011Publication History
Skip Abstract Section

Abstract

Large-scale network services can consist of tens of thousands of machines running thousands of unique software configurations spread across hundreds of physical networks. Testing such services for complex performance problems and configuration errors remains a difficult problem. Existing testing techniques, such as simulation or running smaller instances of a service, have limitations in predicting overall service behavior at such scales.

Testing large services should ideally be done at the same scale and configuration as the target deployment, which can be technically and economically infeasible. We present DieCast, an approach to scaling network services in which we multiplex all of the nodes in a given service configuration as virtual machines across a much smaller number of physical machines in a test harness. We show how to accurately scale CPU, network, and disk to provide the illusion that each VM matches a machine in the original service in terms of both available computing resources and communication behavior. We present the architecture and evaluation of a system we built to support such experimentation and discuss its limitations. We show that for a variety of services---including a commercial high-performance cluster-based file system---and resource utilization levels, DieCast matches the behavior of the original service while using a fraction of the physical resources.

References

  1. Aguilera, M. K., Mogul, J. C., Wiener, J. L., Reynolds, P., and Muthitacharoen, A. 2003. Performance debugging for distributed systems of black boxes. In Proceedings of the Symposium on Operating Systems Principles. 74--89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Barham, P. T., Dragovic, B., Fraser, K., Hand, S., Harris, T. L., Ho, A., Neugebauer, R., Pratt, I., and Warfield, A. 2003. Xen and the art of virtualization. In Proceedings of the Symposium on Operating Systems Principles. 164--177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Barham, P. T., Donnelly, A., Isaacs, R., and Mortier, R. 2004. Using magpie for request extraction and workload modelling. In Proceedings of the Symposium on Operating System Design and Implementation. 259--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Barroso, L. A., Dean, J., and Hölzle, U. 2003. Web search for a planet: The google cluster architecture. IEEE Micro 23, 2, 22--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bellard, F. 2005. Qemu, a fast and portable dynamic translator. In Proceedings of the USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. BitMover. 2008. Lmbench - tools for performance analysis. http://www.bitmover.com/lmbench.Google ScholarGoogle Scholar
  7. Blanquer, J. M., Batchelli, A., Schauser, K., and Wolski, R. 2005. Quorum: Flexible quality of service for internet services. In Proceedings of the Symposium on Networked System Design and Implementation. 159--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bugzilla, X. 2008. Freebsd bootloader stops with btx halted in hvm domu. http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=622.Google ScholarGoogle Scholar
  9. Capps, D. 2006. Iozone filesystem benchmark. http://www.iozone.orgGoogle ScholarGoogle Scholar
  10. Cecchet, E., Marguerite, J., and Zwaenepoel, W. 2002. Performance and scalability of ejb applications. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications. 246--261. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chen, M. Y., Kiciman, E., Fratkin, E., Fox, A., and Brewer, E. 2002. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the Symposium on Operating System Design and Implementation.Google ScholarGoogle Scholar
  12. Cheng, Y.-C., Hölzle, U., Cardwell, N., Savage, S., and Voelker, G. M. 2004. Monkey see, monkey do: A tool for tcp tracing and replaying. In Proceedings of the USENIX Annual Technical Conference. 87--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Cohen, B. 2008. Bittorrent. http://www.bittorrent.com.Google ScholarGoogle Scholar
  14. Doyle, R. P., Chase, J. S., Asad, O. M., Jin, W., and Vahdat, A. M. 2003. Model-based resource provisioning in a web service utility. In Proceedings of the USENIX Symposium on Internet Technologies and Systems. Seattle, Washington. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Forum, T. M. 1993. Mpi: A message passing interface. In Proceedings of the ACM/IEEE Conference on Supercomputing. Portland, Oregon. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ganger, G. R., et al. 2008. The disksim simulation environment. http://www.pdl.cmu.edu/DiskSim/index.html.Google ScholarGoogle Scholar
  17. Goldberg, R. P. 1974. Survey of virtual machine research. IEEE Computer Magazine 7, 6, 34--45.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gupta, D., Lee, S., Vrable, M., Savage, S., Snoeren, A. C., Voelker, G. M., and Vahdat, A. 2008. Difference engine: Harnessing memory redundancy in virtual machines. In Proceedings of the Symposium on Operating System Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gupta, D., Vishwanath, K. V., and Vahdat, A. 2007. Diecast: Testing distributed systems with an accurate scale model. Tech. rep. CS2007-0910, University of California, San Diego.Google ScholarGoogle Scholar
  20. Haeberlen, A., Mislove, A., and Druschel, P. 2005. Glacier: Highly durable, decentralized storage despite massive correlated failures. In Proceedings of the Symposium on Networked System Design and Implementation. 143--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Huang, X. W., Sharma, R., and Keshav, S. 1999. The entrapid protocol development environment. In Proceedings of the IEEE International Conference on Computer Communications. 1107--1115.Google ScholarGoogle Scholar
  22. Jain, R. 1991. The Art of Computer Systems Performance Analysis. John Wiley & Sons. Chapter 12.Google ScholarGoogle Scholar
  23. Katabi, D., Handley, M., and Rohrs, C. E. 2002. Congestion control for high bandwidth-delay product networks. In Proceedings of the SIGCOMM Conference. 89--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Lawton, K. P. 1996. Bochs: A portable pc emulator for unix/x. Linux J., 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. LBNL. 2008. Linux tcp tuning guide. http://www-didc.lbl.gov/TCP-tuning/linux.html.Google ScholarGoogle Scholar
  26. Linux Community. 2008. Linux advanced routing and traffic control. http://lartc.org.Google ScholarGoogle Scholar
  27. Linux Foundation. 2008. Net:netem. http://www.linuxfoundation.org/en/Net:Netem.Google ScholarGoogle Scholar
  28. Mogul, J. 2006. Emergent (mis) behavior vs. complex software systems. In Proceedings of the European Conference on Computer Systems. 293--304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mogul, J. C. 2003. Tcp offload is a dumb idea whose time has come. In Proceedings of the Workshop on Hot Topics in Operating Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. National Cyber Range 2009. National cyber range. http://www.darpa.mil/sto/ia/ncr.html.Google ScholarGoogle Scholar
  31. NS-2. 2008. The network simulator -- ns-2. http://www.isi.edu/nsnam/ns.Google ScholarGoogle Scholar
  32. Oppenheimer, D., Ganapathi, A., and Patterson, D. A. 2003. Why do internet services fail, and what can be done about it? In Proceedings of the USENIX Symposium on Internet Technologies and Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Pan, R., Prabhakar, B., Psounis, K., and Wischik, D. 2003. Shrink: A method for scaleable performance prediction and efficient network simulation. In Proceedings of the IEEE International Conference on Computer Communications.Google ScholarGoogle Scholar
  34. Panasas. 2006. Panasas activescale storage cluster will provide i/o for world’s fastest computer. http://panasas.com/press_release_111306.html.Google ScholarGoogle Scholar
  35. Panasas. 2008. Panasas. http://www.panasas.com.Google ScholarGoogle Scholar
  36. Peterson, L., Bavier, A., Fiuczynski, M. E., and Muir, S. 2006. Experiences building planetlab. In Proceedings of the Symposium on Operating System Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ricci, R., Alfeld, C., and Lepreau, J. 2003. A solver for the network testbed mapping problem. SIGCOMM Comput. Comm. Rev. 33, 2, 65--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Riley, G. F. 2003. The georgia tech network simulator. In Proceedings of the ACM SIGCOMM Workshop on Models, Methods and Tools for Reproducible Network Research. 5--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Rizzo, L. 1997. Dummynet: A simple approach to the evaluation of network protocols. SIGCOMM Comput. Comm. Rev. 27, 1, 31--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Rosenblum, M., Bugnion, E., Devine, S., and Herrod, S. A. 1997. Using the simos machine simulator to study complex computer systems. ACM Trans. Mod. Comput. Simul. 7, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Schneider, F. B. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. 22, 4, 299--319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Stahlman, M. 2007. Does Google have a million servers? http://www.gartner.com/DisplayDocument?doc_cd=149024.Google ScholarGoogle Scholar
  43. Szymanski, B. K., Saifee, A., Sastry, A., Liu, Y., and Madnani, K. 2002. Genesis: A system for large-scale parallel network simulation. In Proceedings of the Workshop on Parallel and Distributed Simulation. 89--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. tcpdump.org. 2008. Tcpdump/libpcap public repository. http://www.tcpdump.org.Google ScholarGoogle Scholar
  45. Tridgell, A. 2004. Emulating netbench. http://samba.org/ftp/tridge/dbench.Google ScholarGoogle Scholar
  46. Urgaonkar, B., Shenoy, P. J., and Roscoe, T. 2002. Resource overbooking and application profiling in shared hosting platforms. In Proceedings of the Symposium on Operating System Design and Implementation. 239--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Vahdat, A., Yocum, K., Walsh, K., Mahadevan, P., Kostic, D., Chase, J. S., and Becker, D. 2002. Scalability and accuracy in a large-scale network emulator. In Proceedings of the Symposium on Operating System Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Vishwanath, K. and Vahdat, A. 2008. Evaluating distributed systems: Does background traffic matter? In Proceedings of the USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. VMwareAppliances. 2008. Vmware appliances. http://www.vmware.com/vmtn/appliances.Google ScholarGoogle Scholar
  50. VMwareESX4.0. 2010. Timekeeping in vmware virtual machines. http://www.vmware.com/pdf/vmware_timekeeping.pdf.Google ScholarGoogle Scholar
  51. VMwareESXGuide. Esx server 3 configuration guide. http://www.vmware.com/pdf/vi3_35/esx_3/r35/vi3_35_25_3_server_config.pdf.Google ScholarGoogle Scholar
  52. VMwareP2V. Vmware p2v assistant. http://www.vmware.com/products/p2v.Google ScholarGoogle Scholar
  53. Waldspurger, C. A. 2002. Memory resource management in vmware esx server. In Proceedings of the Symposium on Operating System Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Warfield, A., Ross, R., Fraser, K., Limpach, C., and Hand, S. 2005. Parallax: Managing storage for a million machines. In Proceedings of the Workshop on Hot Topics in Operating Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. White, B., Lepreau, J., Stoller, L., Ricci, R., Guruprasad, S., Newbold, M., Hibler, M., Barb, C., and Joglekar, A. 2002. An integrated experimental environment for distributed systems and networks. In Proceedings of the Symposium on Operating System Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Xu, L., Harfoush, K., and Rhee, I. 2004. Binary increase congestion control (bic) for fast long-distance networks. In Proceedings of the IEEE International Conference on Computer Communications.Google ScholarGoogle Scholar

Index Terms

  1. DieCast: Testing Distributed Systems with an Accurate Scale Model

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Computer Systems
          ACM Transactions on Computer Systems  Volume 29, Issue 2
          May 2011
          132 pages
          ISSN:0734-2071
          EISSN:1557-7333
          DOI:10.1145/1963559
          Issue’s Table of Contents

          Copyright © 2011 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 May 2011
          • Revised: 1 December 2010
          • Accepted: 1 December 2010
          • Received: 1 May 2010
          Published in tocs Volume 29, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!