skip to main content
article

The costs and limits of availability for replicated services

Authors Info & Claims
Published:01 February 2006Publication History
Skip Abstract Section

Abstract

As raw system performance continues to improve at exponential rates, the utility of many services is increasingly limited by availability rather than performance. A key approach to improving availability involves replicating the service across multiple, wide-area sites. However, replication introduces well-known trade-offs between service consistency and availability. Thus, this article explores the benefits of dynamically trading consistency for availability using a continuous consistency model. In this model, applications specify a maximum deviation from strong consistency on a per-replica basis. In this article, we: i) evaluate the availability of a prototype replication system running across the Internet as a function of consistency level, consistency protocol, and failure characteristics, ii) demonstrate that simple optimizations to existing consistency protocols result in significant availability improvements (more than an order of magnitude in some scenarios), iii) use our experience with these optimizations to prove tight upper bound on the availability of services, and iv) show that maximizing availability typically entails remaining as close to strong consistency as possible during times of good connectivity, resulting in a communication versus availability trade-off.

References

  1. Adya, A., Liskov, B., and O'Neil, P. 2000. Generalized isolation level definitions. In Proceedings of the IEEE International Conference on Data Engineering.]] Google ScholarGoogle Scholar
  2. Amir, Y. and Wool, A. 1996. Evaluating quorum systems over the Internet. In Proceedings of the Annual International Symposium on Fault-Tolerant Computing.]] Google ScholarGoogle Scholar
  3. Amir, Y. and Wool, A. 1998. Optimal availability quorum systems: Theory and practice. Inform. Proces. Letters, 223--228.]] Google ScholarGoogle Scholar
  4. Andersen, D., Balakrishnan, H., Kaashoek, F., and Morris, R. 2001. Resilient overlay networks. In Proceedings of the 18th Symposium on Operating Systems Principles.]] Google ScholarGoogle Scholar
  5. Andersen, D. G., Balakrishnan, H., Kaashoek, M. F., and Rao, R. 2005. Improving Web availability for clients with MONET. In Proceedings of the Symposium on Networked Systems Design and Implementation.]] Google ScholarGoogle Scholar
  6. Baker, M., Hartman, J., Kupfer, M., Shirriff, K., and Ousterhout, J. 1991. Measurements of a distributed file system. In Proceedings of the 13th ACM Symposium on Operating Systems Principles. 198--212.]] Google ScholarGoogle Scholar
  7. Barbara, D. and Garcia-Molina, H. 1986. The vulnerability of vote assignments. ACM Trans. Comput. Syst.]] Google ScholarGoogle Scholar
  8. Barbara, D. and Garcia-Molina, H. 1987. The reliability of voting mechanisms. IEEE Trans. Comput. 36, 10 (Oct.), 1197--1208.]] Google ScholarGoogle Scholar
  9. Bernstein, P. A., Hadzilacos, V., and Goodman, N. 1987. Concurrency Control and Recovery in Database Systems. Addison-Wesley.]] Google ScholarGoogle Scholar
  10. Brown, A. and Patterson, D. 2000. Towards maintainability, availability, and growth benchmarks: A case study of software RAID systems. In Proceedings of the 2000 USENIX Annual Technical Conference.]] Google ScholarGoogle Scholar
  11. Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G., and Fox, A. 2004. Microreboot---A technique for cheap recovery. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation.]] Google ScholarGoogle Scholar
  12. Cetintemel, U., Keleher, P., and Franklin, M. 2001. Support for speculative update propagation and mobility in Deno. In Proceedings of the 21st IEEE International Conference on Distributed Computing Systems.]] Google ScholarGoogle Scholar
  13. Coan, B., Oki, B., and Kolodner, E. 1986. Limitations on database availability when networks partition. In Proceedings of the 5th ACM Symposium on Principle of Distributed Computing. 187--194.]] Google ScholarGoogle Scholar
  14. Cormen, T., Leiserson, C., and Rivest, R. 1990. Introduction to Algorithms. The MIT Press.]] Google ScholarGoogle Scholar
  15. Czyzyk, J., Mehrotra, S., Wagner, M., and Wright, S. PCx: Software for linear programming. Available at: http://www-fp.mcs.anl.gov/otc/Tools/PCx/.]]Google ScholarGoogle Scholar
  16. Dahlin, M., Chandra, B., Gao, L., and Nayate, A. 2003. End-to-end WAN service availability. ACM/IEEE Trans. Network. 11, 2 (April).]] Google ScholarGoogle Scholar
  17. Diks, K., Kranakis, E., Krizanc, D., Mans, B., and Pelc, A. 1994. Optimal coteries and voting schemes. Inform. Proc. Letters, 1--6.]] Google ScholarGoogle Scholar
  18. Douceur, J. R. and Wattenhofer, R. P. 2001. Competitive hill-climbing strategies for replica placement in a distributed file system. In Proceedings of the 15th International Symposium on Distributed Computing (DISC). 48--62.]] Google ScholarGoogle Scholar
  19. Faloutsos, M., Faloutsos, P., and Faloutsos, C. 1999. On power-law relationships of the Internet topology. In SIGCOMM.]] Google ScholarGoogle Scholar
  20. Fox, A. and Brewer, E. 1999. Harvest, yield, and scalable tolerant systems. In Proceedings of HotOS-VII.]] Google ScholarGoogle Scholar
  21. Fox, A., Gribble, S., Chawathe, Y., and Brewer, E. 1997. Cluster-based scalable network services. In Proceedings of the 16th ACM Symposium on Operating Systems Principles. Saint-Malo, France.]] Google ScholarGoogle Scholar
  22. Garcia-Molina, H. and Barbara, D. 1984. Optimizing the reliability provided by voting mechanisms. In Proceedings of the 4th International Conference on Distributed Computing Systems.]]Google ScholarGoogle Scholar
  23. Golding, R. 1992. A weak-consistency architecture for distributed information services. Comput. Syst. 5, 4 (Fall), 379--405.]]Google ScholarGoogle Scholar
  24. Gray, J., Helland, P., O'Neil, P., and Shasha, D. 1996. The dangers of replication and a solution. In Proceedings of the ACM SIGMOD International Conference on Management of Data.]] Google ScholarGoogle Scholar
  25. Gummadi, K. P., Madhyastha, H. V., Gribble, S. D., Levy, H. M., and Wetherall, D. 2004. Improving the reliability of Internet paths with one-hop source routing. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation.]] Google ScholarGoogle Scholar
  26. Hennessy, J. 1999. The future of systems research. IEEE Comput. 32, 8 (Aug.), 27--33.]] Google ScholarGoogle Scholar
  27. Johnson, D. B. and Raab, L. 1991a. A tight upper bound on the benefits of replication and consistency control protocols. In Proceedings of the 10th ACM Symposium on Principles of Database Systems.]] Google ScholarGoogle Scholar
  28. Johnson, D. B. and Raab, L. 1991b. Effects of replication on data availability. Int. J. Comput. Simul. 1, 4.]]Google ScholarGoogle Scholar
  29. Keleher, P. 1999. Decentralized replicated-object protocols. In Proceedings of the 18th Annual ACM Symposium on Principles of Distributed Computing.]] Google ScholarGoogle Scholar
  30. Kistler, J. J. and Satyanarayanan, M. 1992. Disconnected operation in the coda file system. ACM Trans. Comput. Syst. 10, 1 (Feb.), 3--25.]] Google ScholarGoogle Scholar
  31. Krishnakumar, N. and Bernstein, A. 1994. Bounded ignorance: A technique for increasing concurrency in a replicated system. ACM Trans. Datab. Syst. 19, 4 (Dec).]] Google ScholarGoogle Scholar
  32. Kumar, A. and Segev, A. 1993. Cost and availability trade-offs in replicated data concurrency control. ACM Trans. Datab. Syst.]] Google ScholarGoogle Scholar
  33. Ladin, R., Liskov, B., Shirira, L., and Ghemawat, S. 1992. Providing availability using lazy replication. ACM Trans. Comput. Syst. 10, 4, 360--391.]] Google ScholarGoogle Scholar
  34. Lamport, L. 1978. Time, clocks, and the ordering of events in a distributed system. Comm. ACM 21, 7 (July), 558--565.]] Google ScholarGoogle Scholar
  35. Lampson, B. 1996. How to build a highly available system using consensus. In Distributed Algorithms, Lecture Notes in Computer Science Vol. 1151. Springer.]] Google ScholarGoogle Scholar
  36. Mummert, L. 1996. Exploiting weak connectivity in a distributed file system. Ph.D. thesis, Carnegie Mellon University.]] Google ScholarGoogle Scholar
  37. Noble, B., Fleis, B., and Kim, M. 1999. A Case for Fluid Replication. In Proceedings of the 1999 Network Storage Symposium (Netstore).]]Google ScholarGoogle Scholar
  38. Noble, B., Satyananarayanan, M., Nguyen, G., and Katz, R. 1997. Trace-based mobile network emulation. In Proceedings of SIGCOMM.]] Google ScholarGoogle Scholar
  39. Page, T., Guy, R., Heidemann, J., Ratner, D., Goel, A., Kuenning, G., and Popek, G. 1998. Perspectives on optimistically replicated peer-to-peer filing. Softw. Practice Exper. 28, 2 (Feb.), 155-- 180.]] Google ScholarGoogle Scholar
  40. Pai, V. S., Aron, M., Banga, G., Svendsen, M., Druschel, P., Zwaenepoel, W., and Nahum, E. 1998. Locality-aware request distribution in cluster-based network servers. In 8th International Conference on Architectural Support for Programming Languages and Operating Systems.]] Google ScholarGoogle Scholar
  41. Paxson, V. 1996. end-to-end routing behavior in the Internet. In Proceedings of the ACM SIGCOMM'96 Conference on Communications Architectures and Protocols.]] Google ScholarGoogle Scholar
  42. Peleg, D. and Wool, A. 1995. The availability of quorum systems. Inform. Computat. 123, 2 (Dec.), 210--223.]] Google ScholarGoogle Scholar
  43. Petersen, K., Spreitzer, M., Terry, D., Theimer, M., and Demers, A. 1997. Flexible update propagation for weakly consistent replication. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP-16).]] Google ScholarGoogle Scholar
  44. Pu, C. and Leff, A. 1991. Replication control in distributed system: An asynchronous approach. In Proceedings of the ACM SIGMOD Conference on Management of Data.]] Google ScholarGoogle Scholar
  45. Rosenthal, A. 1977. Computing the reliability of a complex network. SIAM J. App. Math. 32, 384--393.]]Google ScholarGoogle Scholar
  46. Saito, Y., Bershad, B., and Levy, H. 1999. Manageability, availability and performance in porcupine: A highly scalable Internet mail service. In Proceedings of the 17th ACM Symposium on Operating Systems Principles.]] Google ScholarGoogle Scholar
  47. Savage, S., Collins, A., Hoffman, E., Snell, J., and Anderson, T. 1999. The end-to-end effects of Internet path selection. In SIGCOMM.]] Google ScholarGoogle Scholar
  48. Singla, A., Ramachandran, U., and Hodgins, J. 1997. Temporal notions of synchronization and consistency in Beehive. In Proceedings of the 9th ACM Symposium on Parallel Algorithms and Architectures.]] Google ScholarGoogle Scholar
  49. Spasojevic, M. and Berman, P. 1994. Voting as the optimal static pessimistic scheme for managing replicated data. IEEE Trans. Parall. Distrib. Syst. 64--73.]] Google ScholarGoogle Scholar
  50. Swift, M. M., Annamalai, M., Bershad, B. N., and Levy, H. M. 2004a. Recovering device drivers. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation.]] Google ScholarGoogle Scholar
  51. Swift, M. M., Bershad, B. N., and Levy, H. M. 2004b. Improving the reliability of commodity operating systems. ACM Trans. Comput. Syst. 22, 4 (Nov.).]] Google ScholarGoogle Scholar
  52. Terry, D. B., Theimer, M. M., Petersen, K., Demers, A. J., Spreitzer, M. J., and Hauser, C. H. 1995. Managing update conflicts in Bayou, a weakly connected replicated storage system. In Proceedings of the 15th ACM Symposium on Operating Systems Principles.]] Google ScholarGoogle Scholar
  53. Thomas, R. H. 1979. A majority consensus approach to concurrency control for multiple copy databases. ACM Trans. Datab. Syst. 4, 2 (June), 180--209.]] Google ScholarGoogle Scholar
  54. Tong, Z. and Kain, R. Y. 1988. Vote assignments in weighted voting mechanisms. In Proceedings of the 7th IEEE Symposium on Reliable Distributed Systems. 138--143.]]Google ScholarGoogle Scholar
  55. Torres-Rojas, F., Ahamad, M., and Raynal, M. 1999. Timed consistency for shared distributed objects. In Proceedings of the 18th ACM Symposium on Principle of Distributed Computing.]] Google ScholarGoogle Scholar
  56. Yu, H. and Vahdat, A. 2000. Efficient numerical error bounding for replicated network services. In Proceedings of the 26th International Conference on Very Large Databases (VLDB).]] Google ScholarGoogle Scholar
  57. Yu, H. and Vahdat, A. 2001. Combining generality and practicality in a conit-based continuous consistency model for wide-area replication. In Proceedings of the 21st International Conference on Distributed Computing Systems (ICDCS).]] Google ScholarGoogle Scholar
  58. Yu, H. and Vahdat, A. 2002. Design and evaluation of a conit-based continuous consistency model. ACM Trans. Comput. Syst.]] Google ScholarGoogle Scholar
  59. Zegura, E. W., Calvert, K., and Donahoo, M. J. 1997. A quantitative comparison of graph-based models for Internet topology. IEEE/ACM Trans. Network. 5, 6 (Dec.).]] Google ScholarGoogle Scholar

Index Terms

  1. The costs and limits of availability for replicated services

      Recommendations

      Reviews

      Michael Zastre

      Replication is usually the solution touted when addressing the problem of ensuring highly available services. This technique does not come without a cost, however; availability must be balanced with replica consistency. These tradeoffs are well known. The consequence of dynamically adjusting the degree of consistency on system availability (where consistency is a continuous function) is not as well known. This paper explores this area by presenting some theoretical results on availability, along with test results, and does this by taking the point of view that the biggest impact on availability is network reachability, rather than system uptime. Sections 1 and 2 provide a description of the work's scope, along with a listing of three important metrics for consistency (numerical error, order error, and staleness). Section 3 tackles the difficult problem of specifying a system model, and precisely defining availability. Section 4 develops the concept of an availability upper bound, the best possible availability of a service given specific constraints on consistency among service replicas. Perfect consistency implies lower service availability than is possible when consistency constraints are relaxed, and the upper bound is an attempt to explore the limits imposed on availability by the constraints. Sections 5 and 6 describe an experimental implementation and results. These results are obtained via network emulation, and are validated using an eight-node wide-area network (WAN). Sections 7 and 8 round out the paper with related research, along with conclusions and descriptions of tantalizing future work, an example of the latter being an application of the work to mobile networks. This is not an easy paper to read without some prior knowledge of the theory of distributed replicas. The paper is well written; the authors simply (and justifiably) assume a certain amount of background, and so even those with a bit of background in distributed systems may find themselves slowing down in places. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!