Abstract
This article describes an architecture that allows a replicated service to survive crashes without breaking its TCP connections. Our approach does not require modifications to the TCP protocol, to the operating system on the server, or to any of the software running on the clients. Furthermore, it runs on commodity hardware. We compare two implementations of this architecture (one based on primary/backup replication and another based on message logging) focusing on scalability, failover time, and application transparency. We evaluate three types of services: a file server, a Web server, and a multimedia streaming server. Our experiments suggest that the approach incurs low overhead on throughput, scales well as the number of clients increases, and allows recovery of the service in near-optimal time.
- Aghdaie, N. and Tamir, Y. 2002. Implementation and evaluation of transparent fault-tolerant web service with kernel-level support. In Proceedings of the 11th IEEE International Conference on Computer Communications and Networks (ICCCN), 63--68.Google Scholar
- Aghdaie, N. and Tamir, Y. 2003. Fast transparent failover for reliable web service. In Proceedings of the 15th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS).Google Scholar
- Alvisi, L., Bressoud, T. C., El-Khashab, A., Marzullo, K., and Zagorodnov, D. 2001. Wrapping server-side TCP to mask connection failures. In Proceedings of the IEEE InfoCom Conference, 329--337.Google Scholar
- Apache. 2005. Apache homepage. http://www.apache.org/.Google Scholar
- Basile, C., Kalbarczyk, Z., and K., I. R. 2003. A preemptive deterministic scheduling algorithm for multithreaded replicas. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), 149--158.Google Scholar
Cross Ref
- Basile, C., Kalbarczyk, Z., Whisnant, K., and Iyer, R. K. 2002. Active replication of multithreaded applications. Tech. rep. CRHC-02-01, University of Illinois.Google Scholar
- Bhide, A., Elnozahy, E., and Morgan, S. 1991. A highly available network file server. In Proceedings of the USENIX Winter Technical Conference, 199--205.Google Scholar
- Bradford, R., Kotsovinos, E., Feldmann, A., and Schiöberg, H. 2007. Live wide-area migration of virtual machines including local persistent state. In Proceedings of the 3rd International Conference on Virtual Execution Environments (VEE), 169--179. Google Scholar
Digital Library
- Bressoud, T. 1998. TFT: A software system for application-transparent fault tolerance. In Proceedings of the 28th Annual International Symposium on Fault-Tolerant Computing (SRDS), 128--137. Google Scholar
Digital Library
- Bressoud, T. and Schneider, F. 1996. Hypervisor-Based fault tolerance. ACM Trans. Comput. Syst. 14, 1, 80--107. Google Scholar
Digital Library
- Budhiraja, N., Marzullo, K., Schneider, F., and Toueg, S. 1992. Primary-Backup protocols: Lower bounds and optimal implementations. In Proceedings of the 3rd IFIP Conference on Dependable Computing for Critical Applications, 187--198.Google Scholar
- Burton-Krahn, N. 2002. HotSwap - Transparent server failover for Linux. In Proceedings of the 16th Systems Administration Conference (LISA'02), 205--212. Google Scholar
Digital Library
- Clark, C., Fraser, K., H, S., Hansen, J. G., Jul, E., Limpach, C., Pratt, I., and Warfield, A. 2005. Live migration of virtual machines. In Proceedings of the 2nd ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI), 273--286. Google Scholar
Digital Library
- Cully, B., Lefebvre, G., Meyer, D., Feeley, M., Hutchinson, N., and Warfield, A. 2008. Remus: High availability via asynchronous virtual machine replication. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation (NSDI). USENIX Association, 161--174. Google Scholar
Digital Library
- Daniel, E. and Choi, G. S. 1999. TMR for off-the-shelf Unix systems. Short presentation at IEEE International Symposium on Fault-Tolerant Computing (FTCS).Google Scholar
- Dolev, D., Malki, D., and Yarom, Y. 1994. Warm backup using snooping. In Proceedings of the 1st International Workshop on Services in Distributed and Networked Environments (SDNE), 60--65.Google Scholar
- DSS. 2005. Homepage. http://developer.apple.com/darwin/projects/streaming/.Google Scholar
- Ekwall, R., Urbán, P., and Schiper, A. 2002. Robust TCP connections for fault tolerant computing. In Proceedings of the 9th International Conference on Parallel and Distributed Systems (ICPADS), 501--508. Google Scholar
Digital Library
- Elnozahy, E., Alvisi, L., Wang, Y., and Johnson, D. 2002. A survey of rollback-recovery protocols in message passing systems. ACM Comput. Surv. 34, 3, 375--408. Google Scholar
Digital Library
- Fetzer, C. and Mishra, S. 1999. Transparent TCP/IP based replication. Short presentation at IEEE International Symposium on Fault-Tolerant Computing (FTCS).Google Scholar
- Hertel, C. 2003. Implementing CIFS: The Common Internet File System. Prentice Hall. http://ubiqx.org/cifs/. Google Scholar
Digital Library
- Jacobson, V. 1988. Congestion avoidance and control. Comput. Commun. Rev. 18, 4, 314--329. Google Scholar
Digital Library
- Koch, R. R., Hortikar, S., E., M. L., and M., M.-S. P. 2003. Transparent TCP connection failover. In Proceedings of the IEEE International Conference on Dependable Systems and Networks (DSN), 383--392.Google Scholar
Cross Ref
- Luo, M. and Yang, C. 2001. Constructing zero-loss web services. In Proceedings of the IEEE InfoCom, 1781--1790.Google Scholar
- Marwah, M., Mishra, S., and Fetzer, C. 2003. TCP server fault tolerance using connection migration to a backup server. In Proceedings of the IEEE International Conference on Dependable Systems and Networks (DSN), 373--382.Google Scholar
- Marwah, M., Mishra, S., and Fetzer, C. 2005. A system demonstration of ST-TCP. In Proceedings of the IEEE International Conference on Dependable Systems and Networks (DSN), 308--313. Google Scholar
Digital Library
- Nagle, J. 1984. Congestion control in IP/TCP internetworks. RFC 896, Network Working Group. January. Google Scholar
Digital Library
- Napper, J., Alvisi, L., and Vin, H. 2003. A fault-tolerant Java virtual machine. In Proceedings of the IEEE International Conference on Dependable Systems and Networks (DSN), 425--434.Google Scholar
- Nasika, R. and Dasgupta, P. 2000. Transparent migration of distributed communicating processes. In Proceedings of the 13th ISCA International Conference on Parallel and Distributed Computing Systems (PDCS).Google Scholar
- Orgiyan, M. and Fetzer, C. 2002. Tapping TCP streams. In Proceedings of the IEEE International Symposium on Network Computing and Applications (NCA), 278--289. Google Scholar
Digital Library
- Paxson, V. and Allman, M. 2000. Computing TCP's retransmission timer. RFC 2988, Network Working Group. November. Google Scholar
Digital Library
- Peyrouze, N. and Muller, G. 1996. FT-NFS: An efficient fault tolerant NFS server designed for off-the-shelf workstations. In Proceedings of the IEEE International Symposium on Fault-Tolerant Computing (FTCS), 64--73. Google Scholar
Digital Library
- Powell, M. and Presotto, D. 1983. Publishing: A reliable broadcast communication mechanism. In Proceedings of the 9th Symposium on Operating Systems Principles (SOSP), 100--109. Google Scholar
Digital Library
- Rescorla, E., Cain, A., and Korver, B. 2002. SSLACC: A clustered SSL accelerator. In Proceedings of the 11th USENIX Security Symposium, 229--246. Google Scholar
Digital Library
- Rijsinghani, A. 1994. Computation of the Internet checksum via incremental update. RFC 1624, Network Working Group. May. Google Scholar
Digital Library
- Rodrigues, R., Castro, M., and Liskov, B. 2001. BASE: Using abstraction to improve fault tolerance. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP), 15--28. Google Scholar
Digital Library
- Sarolahti, P. and Kuznetsov, A. 2002. Congestion control in Linux TCP. In Proceedings of the FREENIX Track: USENIX Annual Technical Conference, 49--62. Google Scholar
Digital Library
- Shenoy, G., Satapati, S., and Bettati, R. 2000. HydraNet-FT: Network support for dependable services. In Proceedings of the 20th International Conference on Distributed Computing Systems (ICDCS), 699--706. Google Scholar
Digital Library
- Slember, J. G. and Narasimhan, P. 2004. Using program analysis to identify and compensate for nondeterminism in fault-tolerant, replicated systems. In Proceedings of the 23rd International Symposium Reliable Distributed Systems (SRDS), 251--263. Google Scholar
Digital Library
- Slye, J. and Elnozahy, E. 1996. Supporting nondeterministic execution in fault-tolerant systems. In Proceedings of the IEEE International Symposium on Fault-Tolerant Computing (FTCS), 250--259. Google Scholar
Digital Library
- SMB. 2005. Samba homepage. http://www.samba.org/.Google Scholar
- Snoeren, A., Andersen, D., and Balakrishnan, H. 2001. Fine-Grained failover using connection migration. In Proceedings of the 3rd USENIX Symposium on Internet Technologies and Systems (USITS), 221--232. Google Scholar
Digital Library
- Srinivasan, K. 2001. M-TCP: Transport layer support for highly available network services. M.S. thesis, Rutgers University. Available as Tech. Rep. DCS-TR-459.Google Scholar
- Srisuresh, P. and Holdrege, M. 1999. IP network address translator (NAT) terminology and considerations. RFC 2663, Network Working Group. August. Google Scholar
Digital Library
- Stevens, R. 1994. TCP/IP Illustrated, Volume 1: The Protocols. Addison-Wesley. Google Scholar
Digital Library
- Sultan, F., Bohra, A., Gallard, P., Neamtiu, I., Smaldone, S., Pan, Y., and Iftode., L. 2005. Recovering internet service sessions from operating system failures. IEEE Internet Comput. 9, 2, 17--27. Extended version available as Rutgers University Tech. rep. DCS-TR-524. Google Scholar
Digital Library
- Sultan, F., Bohra, A., and Iftode, L. 2003. Service continuations: An operating system mechanism for dynamic migration of Internet service sessions. In Proceedings of the Symposium Reliable Distributed Systems (SRDS), 177--186.Google Scholar
- Sultan, F., Srinivasan, K., Iyer, D., and Iftode, L. 2002. Migratory TCP: Connection migration for service continuity in the Internet. In Proceedings of the IEEE International Conference on Distributed Computing Systems (ICDCS), 469--470. Google Scholar
Digital Library
- Sultan, F., Srinivasan, K., and Iftode, L. 2001. Transport layer support for highly-available network services. Tech. rep. DCS-TR-429, Rutgers University, May.Google Scholar
- X/Open. 1992. Protocols for X/Open PC Interworking: SMB, Version 2. X/Open Company Ltd. Also available at http://www.opengroup.org/products/publications/catalog/c209.htm.Google Scholar
- Yang, C. and Luo, M. 2000. Realizing fault resilience in web-server cluster. In Proceedings of the Supercomputing Conference. Google Scholar
Digital Library
- Zagorodnov, D. and Marzullo, K. 2005. Managing self-inflicted nondeterminism. In Proceedings of the 1st Workshop on Hot Topics in System Dependability (HotDep), 323--328. Google Scholar
Digital Library
- Zagorodnov, D., Marzullo, K., Alvisi, L., and Bressoud, T. 2003. Engineering fault-tolerant TCP/IP servers using FT-TCP. In Proceedings of the IEEE International Conference on Dependable Systems and Networks (DSN), 393--402.Google Scholar
- Zandy, V. and Miller, B. 2002. Reliable network connections. In Proceedings of the 8th ACM International Conference on Mobile Computing and Networking (MobiCom), 95--106. Google Scholar
Digital Library
- Zhang, R., Abdelzaher, T. F., and Stankovic, J. A. 2004. Efficient TCP connection failover in web server clusters. In Proceedings of the IEEE InfoCom Conference. Vol. 2, 1219--1228.Google Scholar
Index Terms
Practical and low-overhead masking of failures of TCP-based servers
Recommendations
TCP tunnels: avoiding congestion collapse
LCN '00: Proceedings of the 25th Annual IEEE Conference on Local Computer NetworksThis paper examines the attributes of TCP tunnels which are TCP circuits that carry IP packets and benefit from the congestion control mechanism of TCP/IP. The deployment of TCP tunnels reduces the many flows situation on the Internet to that of a few ...
A case for context-aware TCP/IP
This paper discusses the design and evaluation of CATNIP, a Context-Aware Transport/Network Internet Protocol for the Web. This integrated protocol uses application-layer knowledge (i.e., Web document size) to provide explicit context information to the ...
TCP Hybla: a TCP enhancement for heterogeneous networks
In heterogeneous networks, TCP connections that incorporate a terrestrial or satellite radio link are greatly disadvantaged with respect to entirely wired connections, because of their longer round trip times RTTs. To cope with this problem, a new TCP ...








Comments