Distributed Multi-writer Multi-reader Atomic Register with Optimistically Fast Read and Write

A distributed multi-writer multi-reader (MWMR) atomic register is an important primitive that enables a wide range of distributed algorithms. Hence, improving its performance can have large-scale consequences. Since the seminal work of ABD emulation in the message-passing networks [JACM '95], many researchers study fast implementations of atomic registers under various conditions."Fast"means that a read or a write can be completed with 1 round-trip time (RTT), by contacting a simple majority. In this work, we explore an atomic register with optimal resilience and"optimistically fast"read and write operations. That is, both operations can be fast if there is no concurrent write. This paper has three contributions: (i) We present Gus, the emulation of an MWMR atomic register with optimal resilience and optimistically fast reads and writes when there are up to 5 nodes; (ii) We show that when there are>5 nodes, it is impossible to emulate an MWMR atomic register with both properties; and (iii) We implement Gus in the framework of EPaxos and Gryff, and show that Gus provides lower tail latency than state-of-the-art systems such as EPaxos, Gryff, Giza, and Tempo under various workloads in the context of geo-replicated object storage systems.


INTRODUCTION
Attiya, Bar-Noy, Dolev [7] present an emulation algorithm, namely ABD, that implements an atomic single-writer multi-reader register with optimal resilience in asynchronous message-passing networks when nodes may crash.ABD allows porting many known shared-memory algorithms to message-passing networks, such as multi-writer multi-reader (MWMR) registers, atomic snapshots, approximate consensus and randomized consensus.
The MWMR version of ABD [42] requires 2 RTT to complete a write operation, and 1 RTT to complete a read when there is no concurrent write.Subsequent works identify conditions so that reads [20,34] and writes [23] can be fast.An operation is fast if it can always be completed in 1 round-trip time (RTT), by contacting a simple majority of nodes.Unfortunately, the conditions for fast writes are not generally applicable to practical systems as will be discussed in Section 2.2.
Dutta et al. [PODC '04] prove that implementing an atomic register with both fast writes and fast reads is impossible [20].Recently, Huang et al. [PODC '20] identify more constraints in implementing fast writes or fast reads [34].Motivated by these results, we ask: "Can we do better for practical systems?"Motivation.Observe that object storage systems can be modeled as atomic registers.For real-world object storages, the typical workloads have two key characteristics [4,14,24]: (i) Concurrency is rare, but possible: in Microsoft OneDrive, only 0.5% of the writes occur within a 1 second interval; and (ii) Object size and operation vary widely: IBM Cloud Object Storage supports hosting service of web page, game, video and enterprise backups.In their testing benchmark [4], object size varies from 1 KB to 128 MB, and the ratio of write operations range from 5% to 90%.
These observations indicate that it is important to design an algorithm that handles various workloads efficiently, for practical object storages.To optimize for the common case, we are interested in "optimistically fast" operations, i.e., operations that are fast, when there is no concurrent write.The MWMR version of ABD [42] achieves optimistically fast reads, but not writes.Concretely, we answer the following question in this paper: Is it possible to implement an atomic register that supports both optimistically fast reads and writes?arXiv:2304.09382v1[cs.DC] 19 Apr 2023 Contribution: Theory.On the positive side, we present Gus, which implements an MWMR atomic register with optimal resilience and optimistically fast read and write operations when there are up to 5 nodes.To achieve optimistically fast operations, Gus combines two novel techniques: (i) speculative timestamp: a node optimistically uses locally known logical timestamp to enable 1-RTT fast path for writes (i.e., when writes commit in a single communication step), and (ii) view exchange: nodes exchange newly received timestamps to enable 1-RTT fast path for reads.
Considering that most production storage systems deploy 3-or 5-way geo-replication [16,29,49], we believe Gus is useful for practical settings, given its performance benefits.Furthermore, to address the scalability issue, we propose two solutions with different trade-offs between latency (in terms of RTTs) and resilience.
Furthermore, we formally prove that scalability is fundamentally limited.We show that when there are > 5 nodes, it is impossible to emulate an optimally resilient atomic register that supports optimistically fast reads and writes.This impossibility implies that Gus is optimal with this aspect.
Contribution: Systems and Experiments.We experimentally evaluate how the property of optimistically fast operations behave in object storages, as it is difficult to quantify how concurrent operations affect the performance in theory.Practical systems often use a consensus-based approach to implement an object storage.Hence, we compare Gus with state-of-art consensus-based systems EPaxos [44], Gryff [12],Giza [14], and Tempo [21].
We implemented Gus in the framework of EPaxos [44] and Gryff [12] to make a fair comparison.Furthermore, in the same framework, we implemented our version of Giza [14] (source code not available).Gus outperforms these competitors in both throughput and latency, which demonstrates practical performance benefits under a wide range of workloads.Under various settings with three nodes, Gus has better tail latency than both Gryff and EPaxos.Compared to Gryff, 5%-18% of Gus's reads are faster, and ≥95% of writes improve latency by up to 50%.Gus also has 0.5x-4.5xmaximum throughput than both Gryff and EPaxos in the case of write-intensive geo-replication workload.With 9 nodes, Gus's tail latency for reads has 12.5% improvement over Tempo's [21].

PRELIMINARIES AND RELATED WORK 2.1 System Model
We consider an asynchronous message-passing network consisting of  nodes, where  ≤ 5. Section 5.1 presents solutions to scale Gus beyond 5 nodes with different trade-offs.At most  of the nodes may crash at any point of time.Gus ensures safety and liveness as long as  ≥ 2 + 1. Messages could be arbitrarily delayed, but messages between any pair of fault-free nodes are delivered eventually.
Following the convention of the literature [7,8,41], we assume that each node has client threads (reader thread or writer thread) and a server thread.In practical systems, this model captures colocated clients -a client  is co-located with a sever  if the message delivery latency between  and  is much less than the minimum latency between  and other servers.Clients running the applications (e.g., web hosting or backup service) can be considered co-located with a server in the same data center.
Linearizability.Gus achieves linearizability [32].That is, there exists a total ordering of operations  such that (i) operations appear to execute in the order of ; (ii)  is consistent with register semantics, i.e., any read must return the value of the most recent write in ; and (iii)  satisfies the real-time ordering between operations, i.e., if operation   completes before the invocation of operation   , then   should appear after   in .

Related Work
This section discusses the closely related theory works.We defer the comparison between Gus and relevant practical systems to Section 6.These systems (e.g., [12,14,21,44]) are based on some form of consensus and provide liveness only in partially asynchronous networks, whereas Gus uses quorum and ensures both safety and liveness in asynchronous networks.
The ABD algorithm by Attiya, Bar-Noy, Dolev [7] is the first implementation of atomic single-writer multi-reader (SWMR) register in asynchronous networks with  ≥ 2 + 1. ABD requires 1 RTT for writes and 2 RTT for reads.Lynch and Shvartsman [42] later extend the algorithm to the multi-writer multi-reader version, which takes 2 RTT for writes and 2 RTT for reads.These two versions of ABD support a simple optimization to make reads optimistically fast, i.e., 1 RTT reads when there is no concurrent write.
Subsequent works [20,23,34] study fast operations which complete in 1 RTT.The algorithm in [23] supports fast writes only when there are at most  − 1 writer clients.In practical geo-replication with  data centers, this condition implies that one data center cannot serve any writes.The algorithms in [20,34] support fast reads, but require  =  (   ), where   is the number of readers.
Prior works identify several impossibilities.Dutta et al. [20] show that in general it is impossible to have both fast writes and reads, when a single node may crash.Englert et al. [23] prove that to support fast writes, the number of writes cannot be more than  − 1 (which implies that their algorithm is optimal in this aspect).Huang et al. [34] derive two more impossibilities: (i) fast write is impossible if reads need to be completed in 2 RTT; and (ii) to have fast reads and 2-RTT writes, Ω(   ) is the lower bound on .
Several works study other variations of the properties, e.g., semifast operations [28,37], fast operations for Byzantine-tolerant SWMR registers [31], weak semi-fast operations [27], and fast operations for regular and safe registers [3,30].To the best of our knowledge, no prior work studies the feasibility of atomic registers with optimistically fast operations.Furthermore, our idea of speculative timestamp is new, which would be useful for future works that aim to achieve optimistically fast operations.
ABD Register [7,42].Most works on atomic registers in messagepassing networks are inspired by ABD, including Gus.Hence, we briefly describe ABD before presenting our design.We describe ABD and Gus for a single register.Recall that linearizability is a local (or composable) property [32], i.e., the property holds for a set of objects, if and only if it holds for each individual object.Therefore, it is straightforward to compose instances of these protocols to obtain a linearizable system that supports multiple registers.
ABD associates a unique tag with a write and its value.Writes and values are ordered lexicographically by their tags.Formally, a tag is a tuple, (, ), consisting of two fields: (i) a logical timestamp representing the (logical) time for the write; and (ii) the writer ID representing the writer client's identifier that invokes the write.For tag , we use " ." to denote the timestamp field, and " ." to denote the writer ID field.Two tags can be compared as follows: Tag  1 is equal to  2 if  1 .=  2 .and  1 .=  2 ..
Each node stores a value  and an associated tag .ABD register requires two phases for both reads and writes.A read begins with the reader client obtaining the current tag and value from a quorum.The quorum is any simple majority of nodes.The reader then chooses the value associated with the maximum tag and propagates this maximum tag and value to all the nodes.Upon the acknowledgments from a quorum, the read is complete.The second phase, namely the "write-back" phase, can be omitted if all the tags from the first phase are identical, achieving optimistically fast reads.
A writer client  follows a similar two-phase procedure.It first obtains the maximum tag   from a quorum, and then constructs a new tag  = (  .+ 1, ).In the second phase, client  propagates  and its value to all nodes and waits for acknowledgments from a quorum.Since a writer needs to contact a quorum to obtain tag  (writer-reads design), ABD and relevant protocols [20,34] require 2 RTT for the write operations, even if there is no concurrent operation.Our technique "speculative timestamp" and the focus on only 3 or 5 nodes allow us to skip this step optimistically.

GUS: DESIGN 3.1 Architecture and Protocol
Gus borrows tag and lexicographical ordering from ABD.A key challenge is to determine a tag for each write.Later we will show that even with a speculative timestamp, each write and its value still obtain a unique tag.As a result, we will often refer to a tag as the "version" of the register value.
Recall that each node has a writer, a reader and a server. 1 Writers and readers communicate with server threads at other nodes.For brevity, we will simply say writer/reader communicate with nodes.Readers exchange ⟨read⟩ and ⟨ack-read⟩ messages, and writers exchange ⟨write⟩, ⟨ack-write⟩, and ⟨commit-write⟩ messages.Background handlers of the server implement a set of event-driven functions that exchange ⟨update-view⟩ messages with other nodes and update local variables.
Node States.Each node   maintains three states: •   is a set of tuples (, ), which stores all the versions of the register value, where each version has a unique ; which version of the register value is safe to return, with respect to linearizability.
We assume any thread on the node can access these states.This assumption is typical in many practical systems, as clients are handled by client proxies that run on each node.
Techniques and Challenges.Gus has two novel techniques: • Speculative timestamp: Writer opportunistically uses the local tag   as the tag for the value it intends to write.• View exchange: Each node propagates to all the other nodes whenever it has learned a new tag.Each node  uses    to keep track of this information.
Speculative timestamp allows Gus to achieve 1 RTT when there is no concurrent write, and enters the second phase only when observing a concurrent write.View exchange allows nodes to collect up-to-date information and to enable 1-RTT read when there is no concurrent write.In terms of protocol design, we need to address the following two technical challenges: • No read can return stale value, even if the speculative timestamp is stale.A writer can observe a stale timestamp if the node that the writer is co-located with has not received the most recent writes from other nodes.• No write operation can be associated with two tags.Essentially, the ordering of the operations is constructed using the tags; hence, if a write can be associated with multiple tags, the total ordering could be violated.We will formally define what "associated tags" mean after presenting the protocol.
Protocol Specification.Algorithm 1 specifies the steps that need to be followed by each node when  = 3.We defer the discussion of extension of  = 4 or 5 to Section 3.3.

Write operation:
Writer , which is co-located with node   , obtains tag (   , ) by adding 1 to the timestamp of the largest tag known to node   (Line 2).It then propagates the value along with this new tag to all the nodes and waits until receiving an acknowledgement from a quorum of nodes  (Line 4).A quorum used in Gus is always a simple majority.
Fast Path: Writer can then detect whether there is a concurrent write by comparing (   , ) with the tag received from  (Line 6).If there is no concurrent write operation, then 's write is on the fast path (Line 7).Client is notified that the write is completed at this point.The writer proceeds to asynchronous bookkeeping steps, including updating tag (Line 12), storage (Line 13), and view (Line 14).All these steps can be done asynchronously, because after Line 7, it is guaranteed that enough nodes have already obtained the value with the correct tag.
Slow Path: Only if the writer detects a concurrent write, it needs to obtain and update the correct logical timestamp.It first constructs the logical timestamp by finding the largest timestamp field in the received tags from  and increasing it by 1 (Line 9).The writer then sends the commit message ⟨commit-write⟩ to all the nodes to update the tag, and waits for acknowledgement from a quorum on the slow path.This is necessary to ensure that enough nodes have received the correct and updated tag.Note that ⟨commit-write⟩ message does not include the value field to save network bandwidth.

Background Handler for Writes:
The server thread has eventdriven handlers that run in the background to process incoming  Upon receiving ⟨commit-write⟩ message from writer , node   moves the value from   to   (Line 32, 33) if the write has been put in   before.Otherwise,   updates   to make sure that 's write has the correct tag.The tag in   could be stale if both   and  have not observed a previously completed write operation.Next,   proceeds with the steps similar to the previous handler: updates tag (Line 36, 37) and view (Line 38), and notifies others that it has learned a new value (Line 39).Finally,   sends acknowledgement to  (Line 40).
Technical Challenge 1: Due to asynchrony and failure, it is possible for a write to have a stale speculative timestamp.Consider the example in Figure 1, node  3 has not observed the most recent write  1 ; hence, its timestamp is still 1.Then,  2 , invoked by a writer co-located with  3 , has a stale tag because its speculative timestamp is less than the one included in a completed write operation  1 .Recall that to satisfy linearizability, a read that occurs after  2 has to return the value of  2 , instead of Gus achieves this by introducing the second phase to identify and update the correct timestamp, which equals to 3 in this example.After  1 completes,  1 and  2 have timestamp 2; hence, after the first phase,  2 learns the most recent timestamp from either node, and updates the correct tag in the second phase.

Read operation:
In Gus, a reader can retrieve value from its colocated node.The only task is to figure out the value associated with the most recent tag, i.e., the version of the value that satisfies  Intuitively, Condition SafeToRead finds a (, ) pair in   whose  is larger than   and  is received by a quorum of nodes   , including   .In other words, the condition ensures that the returned value is received by a read quorum   , and its version  is at least as recent as   .
Figure 2 presents the fast and slow paths for reads.If   =  1 , then the condition must hold at that moment.If there is a concurrent write (with a larger tag), then   >  1 .Thus, the reader at  1 needs to wait for more messages -⟨write⟩ from  5 and ⟨update-view⟩ from two other nodes -to satisfy Condition SafeToRead.In the worst case, this takes 2 RTT.
Technical Challenge 2: With speculative timestamp, a write may have two tags (or timestamps).We say that a write is "associated" with a tag  (or timestamp ) if a read returns the value of a write with  (or ).In the example of Figure 1, we do not want  2 to be associated with timestamp 1, i.e., no read should return the value of  2 with timestamp 1.This is because that eventually  2 will update its timestamp to 3, which means  2 will be associated with two tags.Consequently, it is impossible to find a total ordering using associated tags that satisfies linearizability.
Condition SafeToRead is devised so that such an undesirable scenario can never occur.In Gus, a read returns  if a read quorum   has received (, ).In the aforementioned example, no read can return a value associated with timestamp 1 because  1 and  2 observe  = 1 being stale, and  3 updates  only after  2 is completed; hence, it is not possible to gather a read quorum.When  = 3, if a write observes a stale tag   , then no read can return a value with   .This is because at most one other node would consider   as the most recent tag, which means that no read can obtain   from a read quorum   .

Correctness and Performance Analysis
We follow the proof structure in [7,42], i.e., using tags to assign the order of the operations.The key difference is to prove that Gus addresses Technical Challenge 2 correctly -each write can only be associated with one tag.We prove the claim by formalizing the argument in Section 3.1.The complete proof is presented in Appendix A.
Gus achieves optimistically fast operations, i.e., both writes and reads take 1 RTT if there is no concurrent write.Both operations take 2 RTT in the worst case, as shown in Figure 2. Message complexity for reads is the same as prior algorithms [7,42],  ().For reads, we only count the messages on the fast path, since as shown in Figure 2, other messages for committing reads belong to writes.For writes, the message complexity is  ( 2 ) due to ⟨update-view⟩.Despite higher complexity, we find this acceptable in our target case because this design allows for using the fast path for reads.Moreover, for the case of object storage systems, ⟨update-view⟩ only contains tag, not the data itself.Since typical data size is in the range of KBs, MBs or even more [4,14,24], the bit complexity and network bandwidth consumption of the overhead are negligible.

The Case of 𝑛 = 4 or 5
Algorithm 1 does not work with  > 3 owing to Technical Challenge 2 -a write could be associated with two tags when  > 3. Consider the example in Figure 3. Suppose  1 is from a writer  1 at  1 and  2 is from writer  2 at  2 .Writer  1 learns from  2 that its speculative timestamp is stale due to the concurrent  2 .In the meantime,  3 ,  4 , and  5 have not observed  2 and form a read quorum which allows a reader to read  1 with To address this issue, more information needs to be included in ⟨ack-write⟩ message -if the highest tag is from   , then   needs to indicate whether a write is completed or not.In the earlier example, the second phase is not needed.Since  2 has not completed yet (i.e.,  2 has not received a confirmation from a quorum), the writer  1 does not need to update the tag, and can complete its write on the fast path.This does not violate linearizability, since by definition, two concurrent writes can appear in any order.

IMPOSSIBILITY
Theorem 1.For  > 5 and  = 2 + 1, it's impossible to have an atomic register that supports optimistically fast writes and reads.
Proof Sketch.The proof is based on an indistinguishability argument, which constructs several executions indistinguishable to nodes such that in one of the executions, a reader has to return a value that violates linearizability.All the executions we construct have no concurrent write; hence, the optimistically fast operations require all operations to complete in 1 RTT.
Consider  = 7 with nodes   to   .Since  = 3, the maximum quorum size to ensure liveness is 4. Now, consider the following executions such that the first write  1 is invoked by a writer at node   and writes value : Fundamentally, the impossibility is because that the quorum intersection is too small for a larger  .Due to the 1-RTT communication, readers or writers are not able to update all nodes in the read or write quorums.This is why we can defer messages in the proof.In the case of ABD, the read quorum of  1 will learn the most recent value before  1 completes because of the write-back.Consequently, the read quorum for  2 would return .
Note that in the construction above, we require 3 nodes to fail.This is why Gus works for  ≤ 5.For example, when  = 3, the union of the reader and the quorum intersection is enough to ensure safety; hence, E3 is impossible and readers can learn the correct value that satisfies linearizability.To circumvent the impossibility, we either need to sacrifice optimistically fast operation or resilience.

PRACTICAL CONSIDERATIONS 5.1 Scalability
To increase scalability, we present two solutions for  > 5.The first increases 1 RTT for writes, which is suitable for serving larger objects because of its natural integration with erasure coding.The second increases the quorum size by focusing on the case of a smaller number of concurrent failures (relaxed resilience), a common case for modern geo-replicated systems [16,21,22].For  ≤ 5, these solutions are not needed, and therefore not applied.
Layered design by separating metadata and data: Inspired by Giza [14] and Layered Data Replication [25], we integrate a layered design with Gus, which separate the data and metadata paths into two layers.To read, a client first contacts the servers in the metadata layer to find the set of data servers that have the most recent data, and then reads the data from any of them.To write, a client first writes its data to a set of data servers, then update the metadata servers.Such a layered design allows the underlying data/metadata servers to optimize different workloads and features.Giza uses Azure object storage as the data server and Azure table as the core of metadata layer.Giza adopts Fast Paxos [39] to replicate the metadata (i.e., the version, the IDs and the locations of the object) to 3 or 5 metadata servers, whereas Gus uses Algorithm 1.In our implementation, we use Redis as the data server because of its high performance and support of durability.
As observed in [14,52,55], to save storage and network cost, it is common to use erasure coding for the data layer.For larger objects, we adopt the  =  +  Reed-Solomon code [35] -the value is divided into  data fragments, and the encoder generates  parity fragments.Each data server stores exactly one fragment.The object is durable as long as at most  node fails.With erasure coding, both writes and reads take 2 RTT on the fast path.
Increasing quorum sizes by lowering resilience: Concurrent failures in replication across datacenters are rare and transient [16,21,22]; hence, it is reasonable to focus on a smaller  with a larger .Let   and   be the size of the read and write quorum, respectively.As long as they satisfy the following inequalities, Gus ensures safety: 2  >  and 2 − 2  − 1 <   The first part ensures that any two write quorums intersect with each other, whereas the second part ensures that any write can only be associated with one tag (which can be argued similar as before).As long as a writer (or reader) can reach a write (or read) quorum, then its operation can be completed.
For read-intensive workloads, we can let   be ⌊/2⌋ + 1.Then For write-intensive workloads, we can lower write quorum size by increasing read quorum size accordingly.In other words, tolerating less failures allows Gus to explore a trade-off between quorum sizes and performance of different operations.

Optimizing Reads in Gus
We have two approaches to optimize reads in Gus.Consider the case of  = 3. Gus's read only needs 1 RTT with one simple change -piggybacking the value associated with the highest tag in ⟨ack-read⟩ at Line 42.Since any two nodes form a read quorum   , upon receiving the value associated with   , the reader can update   and directly return the value, which must satisfy Condition SafeToReturn.The second optimization can be applied to the case when  ≤ 5 and when a node serves several reader clients (a typical case in practical systems).Observe that read does not change the state at other nodes; hence, when there are multiple concurrent readers co-located in the same data center, then all the subsequent reads can "tag along" the first read without sending any messages.

EVALUATION
We evaluate Gus in practical settings.Our evaluation is focused on the case of geo-replicated object storages, because (i) atomic registers capture its semantic [12,14]; and (ii) round-trip time matters the most for user-perceived latency in the case of geo-replication, as cross-datacenter latency can be in the order of 100+ms.
As discussed in Section 2.2, prior algorithms [20,23,34] with fast operations have limited practical usages due to their stringent conditions.Therefore, we compare Gus with consensus-based systems.Even though these systems only ensure liveness when the network is partially synchronous, they have high-performance in common cases.We first present related systems that are optimized for geo-replication, followed by our evaluation.

Related Work: Geo-replicated System
A comparison of Gus's features against state-of-the-art competitors is outlined in Table 1.To ensure a total ordering, storage systems often adopt the consensus-based approach.Most production systems [1,2,9,11,13,16] rely on variants of Paxos [38,40] or Raft [46] for agreeing on the order of client commands (or requests) and execute the commands following the agreed order.Unfortunately, these leader-based consensus protocols suffer long latency -2 RTT (cross-datacenter message delay) -if the clients are not co-located with the leader data center.

Latency
Fast-Quorum Size Optimistically Limitation Read Write Fast Ops EPaxos [44] 1/2 Many recent systems [5,21,22,43,44,47] propose a leaderless design to avoid the bottleneck at the leader and achieve optimistically fast operations. 2 EPaxos commits commands in 1 RTT when there is no contention, and 2 RTT with contention.Unfortunately, EPaxos has worse tail latency than Paxos-based systems (up to 4x worse) [12] and may have a livelock in pathological cases [50].This is mainly because EPaxos's fine-grained dependency tracking may chain dependency recursively, and the execution of some operations may be delayed in wide-area networks [50].
Gryff [12] reduces tail latency by unifying consensus and shared registers.Gryff implements an abstraction that provides read, write and read-modify-write (RMW) on a single object.On a high-level, it uses ABD register [7] to process reads and writes, and EPaxos [44] to process RMWs.While Gryff reduces p99 read latency compared to EPaxos, it always takes 2 RTT to complete a write; hence, it does not achieve optimistically fast writes and is not suitable for write-intensive workloads like game hosting or enterprise backup service that typically has around 90% of writes [4].
Giza uses Fast Paxos [39] to agree on the version for each operation, and needs only 1 RTT when there is no concurrent write.Two downsides of Giza are its reliance on the coordinator to order concurrent write operations and that its fast-quorum requires a super majority.Both affect tail latency, especially for the geo-replicated storage systems, because the clients need to wait for the nodes or the coordinator in the further datacenters.
Atlas [22] and Tempo [21] are two recent consensus-based systems that sacrifice resilience to optimize performance.Atlas uses dependency tracking; hence, suffers from long tail latency.Tempo develops a novel mechanism of using (logical) timestamps to determine when it is safe to execute a particular operation.Both systems have quorum size ⌊/2⌋ +  , which is optimal when  = 1.Atlas and Tempo do not distinguish between read and write quorums.Compared to them, Gus can be configured to have an optimal read quorum size, while having the write quorum size the same or greater by 1. Table 2 presents some examples.Gus's smaller read quorum not only allows a better read latency, but also ensures that reads can still complete, when ≥ ⌊/2⌋ + 1 nodes are alive.For the case of  = 11, Tempo requires a quorum of 8, which equals to the write quorum of Gus.Reads can be served with a quorum of 6 in Gus.Later in Section 6.8, we will see how a smaller quorum size allows Gus to have better tail latency under practical workloads.

Gus
Atlas Other consensus-based systems achieve optimistically fast operations for both reads and writes, e.g., M 2 Paxos [47], Caesar [5], and Mencius [43].Each system performs well in certain cases.To support more general operations, e.g., transactions or RMW, they sacrifice high-performance under high skewed workload.Both EPaxos and Caesar use dependency tracking, which leads to high tail latency [50].M 2 Paxos requires a lock on an object; hence, not suitable for workloads with high contention.Mencius need information from all nodes.

Implementation and Experiment Setup
In our evaluation, we focus on tail latency, because it is well-known that user-perceived latency is correlated with the tail latency of the underlying storage systems [6,18,45,48].We evaluate Gus against two categories of competitors: (i) those aiming/optimizing for fault-tolerant non-blocking MWMR registers (Gryff and Giza), and (ii) state-of-the-art consensus systems (EPaxos and Tempo) that are optimized for the scenarios that Gus is optimized for.
For Gryff, we are essentially evaluating its ABD component (and Gryff's optimizations), as the workload consists of only reads and writes.For Giza, we only focus on the tail latency without any concurrent write.As documented in [14], its design is not optimized for concurrency.For scalability, we compare Gus with Tempo so that they tolerate the same number of concurrent failures.
Recent systems [26,52,53] use techniques such as coding and nil-externality to further improve performance.We do not compare against them, due to their leader-based design.We mainly focus on leaderless systems, because as demonstrated in [12,14,44], leaderless systems have better performance in both common case and tail latency in the context of geo-replicated storages.
Implementation.We implemented Gus3 and our version of Giza (source code not available) in Go using the framework of EPaxos and  3: RTT (in ms) between VMs in emulated geographic regions [12].For  = 3, we use VMs in CA, VA, and IR.
Gryff to ensure a fair comparison between protocols.For Tempo, we use the implementation in [21].
Clearly, even though in Algorithm 1 we focus on a single register (or object) for clarity, our implementation supports multiple objects and adopts the optimizations mentioned in Section 5.In order to do that, we include two extra fields in each message typekey  and sequence number .The key denotes the identifier of each object, and the sequence number is the operation index.This allows Gus to support multiple objects and also pipelining.We do not enable thrift optimization nor batching, because these optimizations generally increase the tail latency, by increasing the chance of conflicts [12,44,50].
In addition, we follow the same setup in [12,44,50] to separate node and client machines for best performance.Each node has several client proxies that handle requests from the respective client.
Testbed.We run our experiments on CloudLab [19] using m510 (Intel Xeon D-1548, 8 cores, 6GB RAM) for node VMs and c6525-25g (AMD EPYC 7302P, 16 cores, 8GB RAM) for the client VM.We adopt the same latency profile used in [12] -(i)  = 3: nodes in California (CA), Virginia (VA), and Ireland (IR); and (ii)  = 5: two more nodes in Oregon (OR) and Japan (JP).The latencies of the wide-area network are emulated using Linux's Traffic Control (tc) by adding delays to packets on all nodes with filters on different IPs.Table 3 shows the configured RTT between nodes in different regions.These numbers were chosen to represent typical RTT between the corresponding Amazon EC2 availability regions [12].
Experiment Setup.In all experiments except for the ones in Section 6.6, the clients run on one client VM, which has no artificial latency to all the other node VMs.In experiments testing the integration of the layered design (Section 6.6), the clients are located in the CA region.For most experiments, we use 16 closed-loop clients co-located with each node, again following [12].This setup balances between capturing the effect of concurrent operations and avoiding saturating the system.This also allows us to isolate limitations of the hardware and software.We use different numbers of clients to stress the systems in the throughput experiment.
In our implementation, despite the fact that clients and servers are indeed co-located, clients do not interact directly with the system's backend but pass through a proxy interface, which emulates an intermediate tier typically deployed in data centers for security and access control purposes.In other words, the backend in our implementation supports remote clients by the usage of proxy.
For all experiments, we require the system to commit and execute client requests before responding to clients.Since as observed in [50], most applications depend on information or confirmation returned by an operation.For example, Redis and ZooKeeper [36] return results for both reads and writes.
Each experiment is run for 180 seconds, and we collect statistics in the middle of 150 seconds.In our experience, the statistics are quite stable during this period because of the removal of warmup and cool-down.By default, each object is of 16B.While large objects are common in object storages [14,24], 16B gives the best performance for EPaxos and Gryff, and in [6], Facebook reported the workload of using Memcached as a key-value store, where 40% of the data is less than 11B.Therefore, we mainly test 16B objects.
Following prior works [12,44,50], all the systems we test store the data in the main memory, except for two sets of experiments.This choice is reliable as long as the number of concurrent node failures is bounded.We are targeting redundancy across data centers, which are very rare to fail concurrently [14,16,22].Moreover, there exist solutions that prevent data loss from crashed machines, e.g., persistent memory or disaggregated memory [17,51,54,56].For persistence, we evaluate two approaches: (i) writing to the disk using Redis in Section 6.6, and (ii) writing to a file using Go's OS package in Section 6.7.
Two operations are said to be conflicting with each other if they are targeting the same object (or same key) [12,14,44,50].Following the evaluation in [12,21,22,44], we focus on the evaluations with various conflict rates.A conflict rate  denotes that a client chooses the same key with a probability , and some unique key otherwise.Workload with a Zipfian distribution [50] shows a similar pattern.

Summary of Our Findings
To understand whether Gus performs well under various settings, we aim at answering the following questions: • Does Gus reduce tail latency under various conflict rates and write ratios?(Section 6.4) • How does the throughput of Gus compare to the state-ofthe-art competitors?(Section 6.5) • How does object size and write ratio affect the performance of Gus? (Section 6.5 and Section 6.6) • How does Gus perform when integrated with the layered design and erasure coding?(Section 6.6) • How does persistence affect latency?(Section 6.7) • How does Gus scale when tolerating a smaller number of concurrent failures?(Section 6.8)To summarize our findings, under various conflict rates, Gus has better read and write tail latency than both Gryff and EPaxos.When  = 3, around 5%-18% of reads are faster than Gryff, even though both systems complete reads in 1 RTT.This demonstrates the effectiveness of our read optimization mentioned in Section 5. Gus's maximum throughput is 0.5x-4.5xgreater than Gryff and EPaxos in the context of geo-replication with a write-intensive workload.The write ratio does not have a significant impact on throughput in Gus, whereas it impacts Gryff significantly because of its 2-RTT writes.Finally, Gus's reads are 12.5%-17% faster than Tempo because of the smaller read quorum size.

Tail Latency
The Case of  = 3.First, we examine the tail latency of Gus, with a focus on large-scale web hosting.Since the web hosting applications is usually read-heavy [6,10,15], we use the ratio of 94.5% read operations with various conflict rate.This write ratio is the same as the YCSB-B workload [15].Figure 4 presents the cumulative distribution functions (CDF) for both read and write latency for clients from three regions (CA, VA, IR) for three different conflict rates with  = 3. Top row represents the CDF for reads and bottom row for writes.
1 RTT Reads for Gus and Gryff.Both Gus and Gryff complete reads in 1 RTT, as shown in the top row of Figure 4. ∼66% of reads complete after 1-RTT of communication with the nearest quorum (a simple majority) between CA and VA, which has latency of 72ms.
Clients in IR are closest to the nodes in IR and VA, so 33% of the reads complete in 1 RTT between IR and VA, which is 88ms.
Read Optimization of Gus.As mentioned in Section 5, Gus exploits the semantics of linearizable object storages to return reads without any communication when there are concurrent readers co-located within the same data center.Depending on when the concurrent reads are invoked, the latencies vary from 0.755ms to 72ms for clients in CA and VA.
Impact of Instant Execution.As identified in [12,50], in EPaxos, some operations need to be delayed due to its dependency tracking, which results into a higher latency.In comparison, Gryff and Gus can execute an operation instantly.
Impact of Conflict Rate.For both Gryff and Gus, conflict rate does not affect latency significantly.This is because reads complete in 1 RTT, and writes always complete in 2 RTT in Gryff.With a higher conflict rate, Gus's reads have improved latency in the common case, owing to the read optimization.Higher conflict rate implies a higher chance for reads to tag along.With 25% conflict, Gus's writes occasionally need to take 2-RTT to complete.
The Case of  = 5. Figure 5 reports the cumulative distribution functions of the latency of reads and writes with  = 5.In this experiment, we use workload consisting of 49.5% reads and 50.5% writes with 25% conflicts.The write ratio follows from YCSB-A.
Roughly equal amount of operations and the higher conflict rate allow us to observe the performance under concurrent operations.Again, having more concurrent operations allows Gus to complete reads faster.Its writes are also faster than Gryff's because of Gus's optimistically fast writes.A faster write also reduces the chance of concurrent access.This is the reason that, compared to Gryff, more reads in Gus can be completed on the fast path.These results indicate that even with a write-heavy workload, 4 Gus has better latency than both EPaxos and Gryff.Gus's reads are faster than those of competitors due to the read optimization under high conflict rate and write ratio.

Throughput vs. Write Ratio in WAN
We also measure the maximum attainable throughput with various write ratios with  = 3 (Figure 6).Gus maintains the highest throughput regardless of the write ratio.Gryff slows down for high write ratios because of its 2-RTT writes.EPaxos has the lowest throughput due to its dependency tracking.

Scalability: Layered Design
In this experiment, we test the layered design (Section 5), which deploys three metadata servers in the CA, VA, and IR regions, and nine data servers in the five regions identified in Table 3, and four other regions not shown in the table for brevity.The maximum RTT between a pair of regions is 243ms, and the lowest is 7ms.To see the impact of two-layered design and isolate the interference from concurrency, we deploy one close-loop client in CA and report the p99.99 latency in Figure 7 with varying number of data servers.Giza [14] was optimized for workload without contention, aligning with our setup.We persist the data and commands to disk using Redis's append-only file feature with a single thread (we invoke fsync at every query).
The number of data servers equals the replication factor, since each server has one copy of data.Reads are scalable when there is no contention, since they take 1 RTT in contacting the metadata servers and retrieving data from the data servers co-located in the same data center.Writes take 2 RTT in this case.Gus outperforms Giza because of its smaller fast quorum.
Erasure Coding.We integrate erasure coding and layered design in Gus with three data centers (located in CA, VA and IR). Figure 8 reports p50 and p99 read latency.Erasure coding is more costeffective for larger data in terms of the tradeoff between latency and saving in network bandwidth and storage; hence, we vary data size from 4KB to 4MB.Under the no-contention workload, both systems take 2-RTT (1 RTT to metadata server and 1 RTT to data server).Gus has better latency due to its smaller fast quorum; its write takes roughly 144ms (2*72), whereas Giza's write takes roughly 223ms (72+151).We observe a similar pattern for write latency.

Persistence After Crash
For persistence, we log state change to an SSD disk before sending acknowledgement.This experiment uses the same configuration as  in Figure 4c.In EPaxos, nodes log synchronously to an SSD-backed file, whereas in Gryff and Gus, nodes log their state change only for incoming writes; hence, we only report the latency for writes in Figure 9.All the systems are I/O bound, but EPaxos's dependency tracking makes the tail latency increase by ∼600ms, whereas Gryff and Gus increase by 280-300ms.Even with persistent writes, Gus still has better tail latency because of its 1-RTT fast path.
Gus has better tail latency for reads because of its smaller read quorum (see Table 2).For example, when  = 5, Gus's fast path to the closet fast quorum takes 72-145ms and Tempo's takes 93-162ms.In general, Tempo has better latency for writes when  = 9, because its quorum is 1 less than Gus's write quorum.Occasionally, Tempo needs to wait for timestamps to becomes stable to execute an operation.This is mainly the reason that Gus outperforms Tempo when we consider p99 or above latency for writes.
A CORRECTNESS PROOF Definition 3 (Effective Operations).A read operation is called effective if the node invoking the read does not crash while executing it.A write operation is called effective if either the node invoking the write does not crash, or if its value was returned by an effective read.
Definition 4 (Committed Write).A write is committed if a majority of nodes have its value stored in .
An effective write must be committed due to the usage of   and commit-write messages.
Lemma 1.Only the value from a committed write can be returned by a read operation.
Proof.This follows from the definition of a committed write, the usage of  , and the design that a read operation requires a quorum received the same value-tag pair.□ Definition 5 (Tag for Committed Write).The tag for a committed write operation is the tag (   , ) associated with the value it writes, where    is defined at Line 3 for writes that take the fast path and Line 10 for writes that take the slow path.Lemma 2. Tag for any committed write is unique.
Proof.This is because of quorum intersection.During the process when the writer node is still completing the write operation , only two other nodes may serve a read operation.That is, either both of these nodes see no conflict (so the tag won't be updated), or one of the nodes sees conflict, then a read won't return the value in .□ Definition 6 (Tag for Write).The tag for a write operation is the tag (   , ) associated with the value it returns,  at Line 17.
Proof.First one is because of quorum intersection.Second one is because of line 2 and line 10.□ Lemma 5 (Safety).There is a total order  on all the effective operations such that (i)  respects the real-time occurrence order for the effective operations; and (ii) any effective read operation obtains the value written by the last write operation that precedes it in .
Proof.Consider only the effective operations.Define the total order  as follows: • Order effective operations according to their tags.

} 39 :
send ⟨update-view,   ⟩ to all nodes 40: send ⟨ack-commit,   ⟩ to writer  41: Upon receiving (read, ) from reader : 42: send ⟨ack-read,   ⟩ to reader  43: Upon receiving (update-view, (, )) from node : 44:    [  ] ←    [  ] ∪ { (, ) } messages.Upon receiving ⟨write⟩ message from writer , node   first checks the tag (, ).If it is larger than   , then node   stores the value (Line 23), updates tag (Line 24) and view (Line 25), and notifies others that it has learned a new value (Line 26).Finally,   replies  with the acknowledgement (Line 29).If   is larger, this means that writer 's tag may be stale, and  needs to update its speculative timestamp later.Hence, node   puts the value at a temporary storage   (Line 28) and replies  (Line 29).

Figure 1 :
Figure 1: Speculative Timestamp with  = 3.Two ends denote the invocation and the completion time of each operation, respectively.Orange box denotes the timestamp (ts) field of each write's tag. 2 's timestamp was stale initially.both real-time and total ordering constraints.Gus achieves this by first contacting a quorum of nodes to learn their most recent tag   (Line 16 -18), and using Condition SafeToRead, as per Definition 2, to obtain the return value (Line 19, 20).

Figure 2 :Figure 3 :
Figure 2: Fast/Slow Path of Reads.Green arrows represent ⟨read⟩ and ⟨ack-read⟩, blue represent ⟨write⟩, and red represent ⟨update-view⟩.(On the right figure, not all messages are shown for brevity.)

Figure 5 :
Figure 5: Latency CDF ( = 5, 50.5% writes, 25% conflicts).Gus's reads are faster than those of competitors due to the read optimization under high conflict rate and write ratio.
•   represents the largest known tag associated with the value in   ; and •    is a vector that keeps track of each node's view.View of a node   is defined as a set of tags that   has known so far.By design,    [] contains the tags associated to all the values in   .Condition SafeToReturn presented later in Definition 2 shows how Gus uses    to decide 1Nodes can support multiple writers and readers using proxies.
completed with a write quorum {  ,   ,   ,   }.All the messages from the write quorum to other nodes are delayed, except for the messages between   and   .At some time  after  1 completes, reader at node   invokes a read  1 that completes with a read quorum {  ,   ,   ,   } and returns  1 's value .•E2: Only node   receives  1 , because node   and its writer client crash during the write.The messages from   are all lost, because it has crashed.The messages from   to nodes other than node  E3: Now, we construct E3 by extending E2.Right after the read  1 completes, nodes   and   crash.This is allowed since  = 3.Furthermore, the messages from   and   to all the other nodes are lost, because they have crashed.As a result, none of   ,   ,   ,   learns the existence of  1 .At some later time, a reader at   invokes another read  2 that completes with a read quorum {  ,   ,   ,   }, and returns a default value, violating linearizability.It's straightforward to extend the argument to a larger .□ are delayed.At time , reader at node   invokes a read  1 that completes with a read quorum {  ,   ,   ,   }.Since E1 and E2 are indistinguishable from the perspective of node   and its reader, the read returns .•

Table 1 :
Gus vs. related leaderless systems designed for geo-redundancy.The first number in the Latency column indicates the RTT in the common case (fast path), and the second is the RTT with contention.All of the systems tolerate  crashes with  = 2 + 1 nodes.Both EPaxos and Gryff support read-modify-write, whereas Giza and Gus do not.When  is beyond 5, some properties of Gus no longer hold.

Table 2 :
Read/write quorum size (  /  ) in Gus, and quorum size () in Atlas and Tempo, where  = number of tolerated concurrent failures. Table