Abstract
Application programmers increasingly prefer distributed storage systems with strong consistency and distributed transactions (e.g., Google’s Spanner) for their strong guarantees and ease of use. Unfortunately, existing transactional storage systems are expensive to use—in part, because they require costly replication protocols, like Paxos, for fault tolerance. In this article, we present a new approach that makes transactional storage systems more affordable: We eliminate consistency from the replication protocol, while still providing distributed transactions with strong consistency to applications.
We present the Transactional Application Protocol for Inconsistent Replication (TAPIR), the first transaction protocol to use a novel replication protocol, called inconsistent replication, that provides fault tolerance without consistency. By enforcing strong consistency only in the transaction protocol, TAPIR can commit transactions in a single round-trip and order distributed transactions without centralized coordination. We demonstrate the use of TAPIR in a transactional key-value store, TAPIR-KV. Compared to conventional systems, TAPIR-KV provides better latency and better throughput.
- Atul Adya, Robert Gruber, Barbara Liskov, and Umesh Maheshwari. 1995. Efficient optimistic concurrency control using loosely synchronized clocks. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’95). Google Scholar
Digital Library
- Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, and Christos Karamanolis. 2007. Sinfonia: A new paradigm for building scalable distributed systems. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’07). Google Scholar
Digital Library
- Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. 2014. Highly available transactions: Virtues and limitations. In Proceedings of the Conference on Very Large Databases (VLDB’14).Google Scholar
- Jason Baker, Chris Bond, James Corbett, J. J. Furman, Andrey Khorlin, James Larson, Jean-Michel Léon, Yawei Li, Alexander Lloyd, and Vadim Yushprakh. 2011. Megastore: Providing scalable, highly available storage for interactive services. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’11).Google Scholar
- Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran, Michael Wei, John D. Davis, Sriram Rao, Tao Zou, and Aviad Zuck. 2013. Tango: Distributed data structures over a shared log. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’13). Google Scholar
Digital Library
- Hal Berenson, Phil Bernstein, Jim Gray, Jim Melton, Elizabeth O’Neil, and Patrick O’Neil. 1995. A critique of ANSI SQL isolation levels. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data. ACM. Google Scholar
Digital Library
- Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. 1987. Concurrency Control and Recovery in Database Systems. Addison Wesley. Google Scholar
Digital Library
- Ken Birman and Thomas A. Joseph. 1987. Exploiting virtual synchrony in distributed systems. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’87). Google Scholar
Digital Library
- Mike Burrows. 2006. The Chubby lock service for loosely coupled distributed systems. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’06). Google Scholar
Digital Library
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2, Article 4 (June 2008), 26. Google Scholar
Digital Library
- Yanzhe Chen, Xinda Wei, Jiaxin Shi, Rong Chen, and Haibo Chen. 2016. Fast and general distributed transactions using RDMA and HTM. In Proceedings of the 11th ACM SIGOPS EuroSys (EuroSys’16). ACM. Google Scholar
Digital Library
- Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich, Robert T. Morris, and Eddie Kohler. 2013. The scalable commutativity rule: Designing scalable software for multicore processors. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’13). Google Scholar
Digital Library
- Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, and Ramana Yerneni. 2008. PNUTS: Yahoo!’s hosted data serving platform. Proceedings of the Conference on Very Large Databases (VLDB’08).Google Scholar
Digital Library
- Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the ACM Symposium on Cloud Computing (SOCC’10). Google Scholar
Digital Library
- James C. Corbett et al. 2012. Spanner: Google’s globally distributed database. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). Google Scholar
Digital Library
- James Cowling and Barbara Liskov. 2012. Granola: Low-overhead distributed transaction coordination. In Proceedings of the USENIX Annual Technical Conference (ATC’12). Google Scholar
Digital Library
- James Cowling, Daniel Myers, Barbara Liskov, Rodrigo Rodrigues, and Liuba Shrira. 2006. HQ replication: A hybrid quorum protocol for Byzantine fault tolerance. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’06). Google Scholar
Digital Library
- Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’07). Google Scholar
Digital Library
- Akon Dey, Alan Fekete, Raghunath Nambiar, and Uwe Rohm. 2014. YCSB+T: Benchmarking web-scale transactional databases. In Proceedings of the International Conference on Data Engineering Workshops (ICDEW’14).Google Scholar
Cross Ref
- Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). USENIX. Google Scholar
Digital Library
- Aleksandar Dragojević, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. 2015. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP’15). ACM. Google Scholar
Digital Library
- Jiaqing Du, Sameh Elnikety, and Willy Zwaenepoel. 2013. Clock-SI: Snapshot isolation for partitioned data stores using loosely synchronized clocks. In Proceedings of the 32nd IEEE Symposium on Reliable Distributed Systems (SRDS’13). IEEE. Google Scholar
Digital Library
- Robert Escriva, Bernard Wong, and Emin Gun Sirer. 2013. Warp: Multi-Key Transactions for Key-Value Stores. Technical Report. Cornell.Google Scholar
- Michael J. Fischer, Nancy A. Lynch, and Michael S. Patterson. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (Apr. 1985), 374--382. Google Scholar
Digital Library
- David K. Gifford. 1979. Weighted voting for replicated data. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’79). Google Scholar
Digital Library
- Jim Gray and Leslie Lamport. 2006. Consensus on transaction commit. ACM Trans. Database Syst. 31, 1 (Mar. 2006), 133--160. Google Scholar
Digital Library
- Patrick Hunt, Mahadev Konar, Flavio Paiva Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free coordination for internet-scale systems. In Proceedings of the USENIX Annual Technical Conference (ATC’10). Google Scholar
Digital Library
- Flavio Junqueira, Yanhua Mao, and Keith Marzullo. 2007. Classic Paxos vs Fast Paxos: Caveat emptor. In Proceedings of the 3rd Workshop on Hot Topics in System Dependability (HotDep’07). USENIX. Google Scholar
Digital Library
- David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. 1997. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In Proceedings of the ACM Symposium on Theory of Computing (STOC’97). Google Scholar
Digital Library
- Tim Kraska, Gene Pang, Michael J. Franklin, Samuel Madden, and Alan Fekete. 2013. MDCC: Multi-data center consistency. In Proceedings of the ACM SIGOPS EuroSys (EuroSys’13). Google Scholar
Digital Library
- Hsiang-Tsung Kung and John T. Robinson. 1981. On optimistic methods for concurrency control. ACM Trans. Database Syst. 6, 2 (June 1981), 213--226. Google Scholar
Digital Library
- Rivka Ladin, Barbara Liskov, Liuba Shrira, and Sanjay Ghemawat. 1992. Providing high availability using lazy replication. ACM Trans. Comput. Syst. 10, 4 (Nov. 1992), 360--391. Google Scholar
Digital Library
- Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. ACM SIGOPS Operat. Syst. Rev. 44, 2 (Apr. 2010), 35--40. Google Scholar
Digital Library
- Leslie Lamport. 1994. ACM Trans. Prog. Lang. Syst. 16, 3 (May 1994), 872--923. Google Scholar
Digital Library
- Leslie Lamport. 2001. Paxos made simple. ACM SIGACT News 32, 4 (Dec. 2001), 51--58.Google Scholar
- Leslie Lamport. 2005. Generalized Consensus and Paxos. Technical Report 2005-33. Microsoft Research.Google Scholar
- Leslie Lamport. 2006a. Fast Paxos. Distrib. Comput. 19, 2 (2006). Google Scholar
Digital Library
- Leslie Lamport. 2006b. Lower bounds for asynchronous consensus. Distrib. Comput. 19, 2 (Oct. 2006), 104--125. Google Scholar
Digital Library
- Costin Leau. 2013. Spring Data Redis--Retwis-J. Retrieved from http://docs.spring.io/spring-data/data-keyvalue/examples/retwisj/current/.Google Scholar
- Jialin Li, Ellis Michael, and Dan R. K. Ports. 2017. Eris: Coordination-free consistent transactions using network multi-sequencing. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP’17). ACM. Google Scholar
Digital Library
- Jialin Li, Ellis Michael, Adriana Szekeres, Naveen Kr. Sharma, and Dan R. K. Ports. 2016. Just say no to Paxos overhead: Replacing consensus with network ordering. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). USENIX. Google Scholar
Digital Library
- Barbara Liskov, Miguel Castro, Liuba Shrira, and Atul Adya. 1999. Providing persistent objects in distributed systems. In Proceedings of the European Conference on Object-Oriented Programming (ECOOP’99). Google Scholar
Digital Library
- Barbara Liskov and James Cowling. 2012. Viewstamped replication revisited. Technical report MIT-CSAIL-TR-2012-021. MIT.Google Scholar
- Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, and David G. Andersen. 2011. Don’t settle for eventual: Scalable causal consistency for wide-area storage with COPS. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’11). Google Scholar
Digital Library
- Hatem Mahmoud, Faisal Nawab, Alexander Pucher, Divyakant Agrawal, and Amr El Abbadi. 2013. Low-latency multi-datacenter databases using replicated commit. Proceedings of the Conference on Very Large Databases (VLDB’13).Google Scholar
Digital Library
- Dahlia Malkhi and Michael Reiter. 1998. Byzantine quorum systems. Distrib. Comput. 11 (1998), 203--213. Google Scholar
Digital Library
- MongoDB. 2013. MongoDB: A open-source document database. Retrieved from http://www.mongodb.org/.Google Scholar
- Iulian Moraru, David G. Andersen, and Michael Kaminsky. 2013. There is more consensus in Egalitarian parliaments. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’13). Google Scholar
Digital Library
- Shuai Mu, Yang Cui, Yang Zhang, Wyatt Lloyd, and Jinyang Li. 2014. Extracting more concurrency from distributed transactions. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). Google Scholar
Digital Library
- Shuai Mu, Lamont Nelson, Wyatt Lloyd, and Jinyang Li. 2016. Consolidating concurrency control and consensus for commits under conflicts. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). USENIX. Google Scholar
Digital Library
- Brian M. Oki and Barbara H. Liskov. 1988. Viewstamped replication: A new primary copy method to support highly available distributed systems. In Proceedings of the ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC’88). Google Scholar
Digital Library
- Dan R. K. Ports, Jialin Li, Vincent Liu, Naveen Kr. Sharma, and Arvind Krishnamurthy. 2015. Designing distributed systems using approximate synchrony in data center networks. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI’15). Google Scholar
Digital Library
- Redis. 2013. Redis: Open Source Data Structure Server. Retrieved from http://redis.io/.Google Scholar
- Yasushi Saito and Marc Shapiro. 2005. Optimistic replication. Comput. Surveys 37, 1 (Mar. 2005), 42--81. Google Scholar
Digital Library
- Salvatore Sanfilippo. 2013. WAIT: Synchronous replication for Redis. Retrieved from http://antirez.com/news/66.Google Scholar
- Yee Jiun Song and Robbert van Renesse. 2008. Bosco: One-step Byzantine asynchronous consensus. In Proceedings of the International Symposium on Distributed Computing (DISC’08).Google Scholar
Digital Library
- Yair Sovran, Russell Power, Marcos K. Aguilera, and Jinyang Li. 2011. Transactional storage for geo-replicated systems. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’11). Google Scholar
Digital Library
- Douglas B. Terry, Marvin M. Theimer, Karin Petersen, Alan J. Demers, Mike J. Spreitzer, and Carl H. Hauser. 1995. Managing update conflicts in Bayou, a weakly connected replicated storage system. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’95). Google Scholar
Digital Library
- Robert H. Thomas. 1979. A majority consensus approach to concurrency control for multiple copy databases. ACM Trans. Database Syst. 4, 2 (June 1979), 180--209. Google Scholar
Digital Library
- Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP’15). ACM. Google Scholar
Digital Library
- Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. 2015a. Building consistent transactions with inconsistent replication. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’15). Google Scholar
Digital Library
- Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. 2015b. Building Consistent Transactions with Inconsistent Replication (extended version). Technical Report 2014-12-01 v2. University of Washington. Retrieved from http://syslab.cs.washington.edu/papers/tapir-tr-v2.pdf.Google Scholar
- Yang Zhang, Russell Power, Siyuan Zhou, Yair Sovran, Marcos K. Aguilera, and Jinyang Li. 2013. Transaction chains: Achieving serializability with low latency in geo-distributed storage systems. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’13). Google Scholar
Digital Library
Index Terms
Building Consistent Transactions with Inconsistent Replication
Recommendations
Eventually consistent transactions
ESOP'12: Proceedings of the 21st European conference on Programming Languages and SystemsWhen distributed clients query or update shared data, eventual consistency can provide better availability than strong consistency models. However, programming and implementing such systems can be difficult unless we establish a reasonable consistency ...
Fast Distributed Transactions and Strongly Consistent Replication for OLTP Database Systems
As more data management software is designed for deployment in public and private clouds, or on a cluster of commodity servers, new distributed storage systems increasingly achieve high data access throughput via partitioning and replication. In order ...
Consistent Replication of Multithreaded Distributed Objects
SRDS '06: Proceedings of the 25th IEEE Symposium on Reliable Distributed SystemsDeterminism is mandatory for replicating distributed objects with strict consistency guarantees. Multithreaded execution of method invocations is a source of nondeterminism, but helps to improve performance and avoids deadlocks that nested invocations ...








Comments