Abstract
Do some storage interfaces enable higher performance than others? Can one identify and exploit such interfaces to realize high performance in storage systems? This article answers these questions in the affirmative by identifying nil-externality, a property of storage interfaces. A nil-externalizing (nilext) interface may modify state within a storage system but does not externalize its effects or system state immediately to the outside world. As a result, a storage system can apply nilext operations lazily, improving performance.
In this article, we take advantage of nilext interfaces to build high-performance replicated storage. We implement
1 INTRODUCTION
Defining the right interfaces is perhaps the most important aspect of system design [54], as well-designed interfaces often lead to desirable properties. For example, idempotent interfaces make failure recovery simpler [18, 80]. Similarly, commutative interfaces enable scalable software implementations [19].
In a similar spirit, this article asks: Do some types of interfaces enable higher performance than others in storage systems? Our exercise in answering this question has led us to identify an important storage-interface property that we call nil-externality. A nil-externalizing (nilext) interface may modify state within a storage system but does not externalize its effects or system state immediately to the outside world (apart from the acknowledgment). As a result, a storage system can apply a nilext operation in a deferred manner after acknowledgment, improving performance.
In this article, we exploit nil-externality to design high-performance replicated storage that offers strong consistency (i.e., linearizability [43]). A standard approach today to building such a system is to use a consensus protocol like Paxos [52], Viewstamped Replication (VR) [61], or Raft [72]. For example, Facebook’s ZippyDB uses Paxos to replicate RocksDB [83]; Harp builds a replicated file system using VR [62]; other examples exist as well [11, 22, 23, 27].
A storage system built using this standard approach performs several actions before it returns a response to a request. Roughly, the system makes the request durable (if it is an update), orders the request with respect to other requests, and finally executes the request. Usually, a leader replica orchestrates these actions [61, 72]. Upon receiving requests, the leader decides the order and then replicates the requests (in order) to a set of followers; once enough followers respond, the leader applies the requests to the system state and returns responses. Unfortunately, this process is expensive: Updates incur two round trip times (RTTs) to complete.
The system can defer some or all of these actions to improve performance. Deferring durability, however, is unsafe: If an acknowledged write is lost, then the system would violate linearizability [37, 56]. Fortunately, durability can be ensured without coordination: Clients can directly store updates in a single RTT on the replicas [74, 91]. However, ordering (and subsequent execution) requires coordination among the replicas and thus is expensive. Can a system hide this cost by deferring ordering and execution?
At first glance, it may seem like all operations must be synchronously ordered and executed before returning a response. However, we observe that if the operation is nilext, then it can be ordered and executed lazily, because nilext operations do not externalize state or effects immediately.
Nilext interfaces have performance advantages, but are they practical? Perhaps surprisingly, we find that nilext interfaces are not just practical but prevalent in storage systems (Section 2). As a simple example, consider the
Nilext-aware replication is a new approach to replication that takes advantage of nil-externality of storage interfaces (Section 3). The key idea behind this approach is to defer ordering and executing operations until their effects are externalized. Because nilext updates do not externalize state, they are made durable immediately, but expensive ordering and execution are deferred, improving performance. The effects of nilext operations, however, can be externalized by later non-nilext operations (e.g., a read to a piece of state modified by a nilext update). Thus, nilext operations must still be applied in the same (real-time) order across replicas for consistency. This required ordering is established in the background and enforced before the modified state is externalized. While nilext interfaces lead to high performance, it is, of course, impractical to make all interfaces nilext: Applications do need state-externalizing updates (e.g., increment and return the latest value, or return an error if key is not present). Such non-nilext updates are immediately ordered and executed for correctness.
Nilext-aware replication delivers high performance in practice. First, while applications do require non-nilext updates, such updates are less frequent than nilext updates. For instance, nilext
Nilext-aware replication draws inspiration from the general idea of deferring work until needed similar to lazy evaluation in functional languages [44], externally synchronous file I/O [70], and previous work in databases [36, 78]. Here, we apply this general idea to hide the cost of ordering and execution in replicated storage. Prior approaches like speculative execution [49, 50, 77] reduce ordering cost by eagerly executing and then verifying that the order matches before notifying end applications. Nilext-aware replication, in contrast, realizes that some operations can be lazily ordered and executed after notifying end applications of completion.
We build
While
Our experiments (Section 5) show that
We also build two variants of
This article makes four contributions.
We first identify nil-externality, a property of storage interfaces, and show its prevalence.
We show how one can exploit this property to improve the performance of strongly consistent storage systems.
Third, we present the design and implementation of
Skyros , a nilext-aware replication protocol.Finally, we demonstrate the performance benefits of
Skyros through rigorous experiments.
2 NIL-EXTERNALIZING INTERFACES
We first define nil-externality and describe its attributes. We next analyze which interfaces are nilext in three example storage systems; then, we discuss opportunities to improve performance by exploiting nilext interfaces in general.
2.1 Nil-externality
We define an interface to be nil-externalizing if it does not externalize storage-system state: It does not return an execution result or an execution error, although it might return an acknowledgment. A nilext interface can modify state in any way (blindly set, or read and modify). The state modified by a nilext operation can be externalized at a later point by another non-nilext operation (e.g., a read). Note that although nilext operations do not return an execution error, they may return a validation error. Validation errors (e.g., a malformed request) do not externalize state and can be detected without executing the operation. Thus, an operation that returns only validation errors (but not execution errors) is nilext.
Determining whether or not an operation is nilext is simple in most cases. Nil-externality is an interface-level property: It suffices to look at the interface (specifically, the return value and the possible execution errors) to say if an operation is nilext. Nil-externality is a static property: It is independent of the system state or the arguments of an operation; one can therefore determine if an operation is nilext without having to reason about all possible system states and arguments.
2.2 Nil-externality in Storage Systems
We now analyze which interfaces are nilext in three storage systems that expose a key-value API (see Table 1). We pick these systems as candidates given their widespread use [16, 33, 64, 71]; exploiting nilext interfaces in these systems to improve performance can benefit many deployments.
RocksDB and LevelDB are LSM-based [73] key-value stores.
In Memcached,
Nilext updates can be completed faster than non-nilext ones, because their ordering and execution can be deferred. Thus, operations such as
Note that while nilext operations do not return errors as part of their contract, a system that lazily applies nilext writes may encounter errors (e.g., due to insufficient disk space or a bad block) at a later point. A storage system that eagerly applies updates can detect such errors early on. Fortunately, this difference is not an obstacle to realizing the benefits of nilext interfaces in practice as we discuss later (Section 4.8).
Given the benefits of nilext interfaces, it is worthwhile to make small changes to a non-nilext interface’s semantics to make it nilext when possible. For instance, a Btree-based store may return an error upon an update to a nonexistent key; changing the semantics to not return such an error can enable a system to replicate updates quickly. Such semantic changes have been practical and useful in the past: MySQL-TokuDB supports SQL updates that do not return the number of records affected to exploit TokuDB’s fast upserts [76].
3 NILEXT-AWARE REPLICATION
We now describe how a replicated storage system can exploit nil-externality to improve performance. To do so, we first give background on consensus, a standard substrate upon which strongly consistent storage is built. We then describe the nilext-aware replication approach and show that its high-performance cases are common in practice. We finally discuss how this new approach compares to existing approaches.
3.1 Consensus-based Replication Background
Consensus protocols (e.g., Paxos and VR) ensure that replicas execute operations in the same order. Clients submit operations to the leader, which then ensures that replicas agree on a consistent ordering of operations before executing them.
Figure 1 shows how requests are processed in the failure-free case. Upon an update, the leader assigns an index, adds the request to its log, and sends a prepare to the followers. The followers add the request to their logs and respond with a prepare-ok. Once the leader receives prepare-ok from enough followers, it applies the update and returns the result to the client. Reads are usually served by the leader locally; the leader is guaranteed to have seen all updates and so can serve the latest data, preserving linearizability. Stale reads on a deposed leader can be prevented using leases [61].
Fig. 1. Request processing in consensus. The figure shows how writes and reads are processed in systems built atop consensus protocols.
Latency is determined by the message delays in the protocol: Updates take two RTTs and reads one RTT. Throughput is determined by the number of messages processed by the leader [26]. Practical systems [5] batch requests to reduce the load on the leader. While batching improves throughput, it increases latency, a critical concern for applications [77, 79].
3.2 Exploiting Nil-externality for Fast Replication
Using an off-the-shelf consensus protocol to build replicated storage leads to inefficiencies, because this approach is oblivious to the properties of the storage interface. In particular, it is oblivious to nil-externality: All updates are immediately ordered and executed. Our hypothesis is that a replication protocol can deliver higher performance if it is cognizant of the underlying storage interface. Specifically, if a protocol is aware of nil-externality, then it can delay ordering and execution, improving performance. We now provide an overview of such a protocol. We describe the detailed design soon (Section 4).
A nilext-aware protocol defers ordering and execution of operations until their effects are externalized. Figure 2 shows how such a protocol handles different operations. First, nilext writes are made durable immediately, but their ordering and execution are deferred. Clients send nilext writes to all replicas. Clients wait for enough replies including one from the leader before they consider the request to be completed. Nilext writes thus complete in one RTT. At this point, the operation is durable and considered complete; clients can make progress without waiting for the operation to be ordered and executed. We say that an operation is finalized when it is assigned an index and applied to the storage system.
Fig. 2. Nilext-aware replication. The figure shows how a nilext-aware replication protocol handles different operations.
State modified by nilext updates can be externalized later by other non-nilext operations (e.g., reads). Therefore, the protocol must ensure that replicas apply the updates in the same order and it has to do so before the modifications are externalized. Thus, upon receiving a read, the leader checks if there are any unfinalized updates that this read depends upon. If no, then it quickly serves the read. Conversely, if there are unfinalized updates, then the leader synchronously establishes the order and waits for enough followers to accept the order; the leader then applies the pending updates and serves the read. In practice, most reads can be served without triggering synchronous ordering and execution, because the leader keeps finalizing updates in the background; thus, in most cases, updates are finalized already by the time a read arrives.
Finally, the protocol does not defer ordering and executing non-nilext updates. Clients submit non-nilext requests to the leader, which finalizes the request by synchronously ordering and executing it (and the previously completed requests).
A nilext-aware protocol can complete nilext updates in one RTT; non-nilext updates take two RTTs. A read can be served in one RTT if prior nilext updates that the read depends upon are applied before the read arrives. Thus, exploiting nil-externality offers benefit if a significant fraction of updates is nilext and reads do not immediately follow them. We next show that these conditions are prevalent in practice.
3.3 Fast Case Is the Common Case
We first analyze the prevalence of nilext updates. First, we note that in some systems, almost all updates are nilext (e.g., write-optimized key-value stores as shown in Table 1). Some systems like Memcached have many non-nilext interfaces. However, how frequently do applications use them? To answer this question, we examine production traces [85, 90] from Twemcache, a Memcached clone at Twitter [84]. The traces contain \( \sim \)200 billion requests across 54 clusters. Twemcache supports nine types of updates (similar to Memcached as shown in Table 1). Except for
We consider 29 clusters that have at least 10% updates. Figure 3(a) shows the distribution of nilext percentages. In Twemcache, in 80% of the clusters, more than 90% of updates are nilext (
Fig. 3. Fast case is common. Panel (a) shows the distribution of nilext percentages; a bar for a range x%-y% shows the percentage of clusters where x%-y% of updates are nilext. Panel (b) shows the distribution of percentage of reads within \( T_f \) ; a bar for x%-y% shows the percentage of clusters where x%-y% of reads access objects updated within \( T_f \) . We consider \( T_f \) = 1 second, 50 ms.
We performed a similar analysis on the IBM-COS traces across 35 storage clusters with at least 10% writes (of 98 in total) [30]. COS supports three kinds of updates:
We next analyze how often reads may incur overhead. A read will incur overhead if there are unfinalized updates to the object being read. Let \( T_f \) be the time taken to finalize updates. We thus measure the time interval between a read to an object and the prior write to the same object and calculate the percentage of reads for which this interval is less than \( T_f \). We use the IBM-COS traces for this analysis, because the Twemcache traces do not have millisecond-level timestamps.
Figure 3(b) shows the distribution of percentage of reads that access items updated within \( T_f \). We first consider \( T_f \) to be 1 second. Even with such an unrealistically high \( T_f \), in 66% of clusters, only less than 5% of reads access objects modified within 1 second. We next consider a more realistic \( T_f \) of 50 ms. \( T_f=50 \) ms is realistic (but still conservative), because these traces are from a setting where replicas are in different zones of the same geographical region, and inter-zone latencies are ~2 ms [45]. With \( T_f=50 \) ms, in 85% of clusters, less than 5% of reads access objects modified within 50 ms; thus, only a small fraction of reads in a nilext-aware protocol may incur overhead in practice. Further, not all such reads will incur overhead due to prior reads to unfinalized updates and non-nilext updates that would force synchronous ordering.
3.4 Comparison to Other Approaches
While nilext-aware replication defers ordering, prior work has built solutions to efficient ordering. The nilext-aware approach offers advantages over such prior solutions. While we focus on consensus-based approaches here, other ways to construct replicated storage systems exist; we discuss how exploiting nil-externality applies to them as well.
3.4.1 Efficient Ordering in Consensus.
Prior approaches to efficient ordering broadly fall into three categories.
Network Ordering. This approach enforces ordering in the network [26, 58]: the network consistently orders requests across replicas in one RTT, improving performance. In contrast, a nilext-aware protocol does not require a specialized network and thus applies to geo-replication as well.
Speculative Execution. This approach employs speculative execution to reduce ordering cost [50, 77]. Replicas speculatively execute requests before agreeing on the order. Clients then compare responses from different replicas to detect inconsistencies and replicas rollback their state upon divergence. Replicas can thus be in an inconsistent state before the end application is acknowledged. However, when end application is notified, the system ensures that the requests have been executed in the correct order. In contrast, the nature of nilext interfaces allows one to defer ordering and execution even after the application is notified of completion; only durability must be ensured before notifying. Ordering and execution are performed only when the effects are externalized by later operations. Also, a nilext-aware protocol does not require replicas to do rollbacks, reducing complexity.
Exploiting Commutativity. This approach (used in Generalized Paxos [53] and EPaxos [68]) realizes that ordering is not needed when updates commute. Both commutative and nilext-aware protocols incur overhead when reads access unfinalized updates. However, as we show (Section 5.7), commutative protocols can be expensive when updates conflict and when operations do not commute. Nilext-aware replication, in contrast, always completes nilext updates in one RTT. Finally, nil-externality and commutativity are not at odds: A nilext-aware protocol can exploit commutativity to commit non-nilext writes faster (Section 5.7).
3.4.2 Other Approaches to Replicated Storage.
Shared registers [6], primary-backup [13], and chain replication [86] offer other ways to building replicated storage. Storage systems that support only reads and writes can be built using registers that are not subject to FLP impossibility [6]. However, shared registers cannot readily enable RMWs [2, 14], a common requirement in modern storage APIs. Starting with state machines as the base offers more flexibility and exploiting nil-externality when possible leads to high performance. Gryff [14] combines registers (for reads and writes) and consensus (for RMWs); however, Gryff’s writes take two RTTs. Primary-backup, chain replication, and other approaches [24] support a richer API. However, primary-backup also incurs two RTTs for updates [59, 74]; similarly, updates in chain replication also incur many message delays. The idea of exploiting nil-externality can be used to hide the ordering cost in these approaches as well; we leave this extension as an avenue for future work.
Summary. Unlike existing approaches, nilext-aware replication takes advantage of nil-externality of storage interfaces. It should perform well in practice: nilext updates contribute to a large fraction of writes and reads do not often access recent updates. This approach offers advantages over existing efficient ordering mechanisms: It requires no network support; it can defer execution beyond request completion and does not require rollbacks; it offers advantages over and combines well with exploiting commutativity.
4 SKYROS Design and Implementation
We now describe the design of
4.1 Overview
We use VR (or multi-paxos) as our baseline to highlight the differences in
In VR, the leader establishes an order by sending a prepare and waiting for prepare-ok from f followers. The leader then does an Apply upcall into the storage system to execute the operation.
Fig. 4. Client interface and upcalls. The figure shows the client interface and the upcalls the replication layer makes into the storage system.
Clients submit nilext updates to all replicas using InvokeNilext. Since nil-externality is a static property (it does not depend upon the system state), clients can decide which requests are nilext and invoke the appropriate call. Upon receiving a nilext update, replicas invoke the MakeDurable upcall to make the operation durable (Section 4.2).
Although nilext updates are not immediately finalized, they must be executed in the same real-time order across replicas. The leader gets the replicas to agree upon an order and the replicas apply the updates in the background (Section 4.3).
Clients send read requests to the leader via InvokeRead. When a read arrives, the leader does a Read upcall. If all updates that the read depends upon are already applied, then the read is served quickly; otherwise, the leader orders and executes updates before serving the read (Section 4.4).
Clients send non-nilext updates to the leader via InvokeNonNilext; such updates are immediately finalized (Section 4.5).
4.2 Nilext Updates
Clients send nilext updates directly to all replicas including the leader to complete them in one RTT. Each request is uniquely identified by a sequence number, a combination of client-id and request number. Similarly to VR, only replicas in the normal state reply to requests and duplicate requests are filtered. A replica stores the update by invoking MakeDurable.
Fig. 5. Skyros writes and reads, and durability log states. Panel (a) shows how Skyros processes nilext writes and reads; d-log: durability, c-log: consensus log, L: leader; f = 2 and supermajority = 4. Panel (b) shows the possible durability logs for two completed nilext operations a and b. In (i) and (ii), b follows a in real time, whereas in (iii) and (iv), they are concurrent.
Note that an update need not be added in the same position in the durability logs across replicas. For example, in Figure 5(b)(i), b is considered completed although its position is different across durability logs. Then, why do
Why is a simple majority (\( f+1 \)) insufficient? Consider an update b that follows another update a in real time. Let us suppose for a moment that we use a simple majority. A possible state then is \( \lt D_1:ab, D_2:ab, D_3:ab, D_4:ba, D_5:ba\gt \), where \( D_i \) is the durability log of replica \( S_i \) (as shown in Figure 6). This state is possible, because a client could consider a to be completed once it receives acknowledgment from \( S_1 \), \( S_2 \), and \( S_3 \). Then, b starts and is stored on all durability logs and so is considered completed. a now arrives late at \( S_4 \) and \( S_5 \). Assume the current leader (\( S_1 \)) crashes. Now, we have four replicas whose logs are \( \lt D_2:ab, D_3:ab, D_4:ba, D_5:ba\gt \). With these logs, one cannot determine the correct order. A supermajority quorum avoids this situation. Writing to a supermajority ensures that a majority within any available majority is guaranteed to have the requests in the correct order in their durability logs. We later show how by writing to a supermajority,
Fig. 6. Why a simple majority is insufficient. The figure shows why writing to a simple majority is insufficient. First, operation a completes on a simple majority and is considered committed; operation b then starts and completes on all replicas; a now arrives later at \( S_4 \) and \( S_5 \) . If current leader fails now, then the correct order of b follows a cannot be reconstructed from the durability logs of available replicas.
During normal operation, the leader’s durability log is guaranteed to have the updates in the correct order. This is because a response from the leader is necessary for a request to complete. Thus, if an update b follows another update a in real time, then the leader’s durability log is guaranteed to have a before b (while some replicas may contain them in a different order as in Figure 5(b)(ii)). This guarantee ensures that when clients read from the leader, they see the writes in the correct order. The leader uses this property to ensure that operations are finalized to the consensus log in the correct order. If a and b are concurrent, then they can appear in the leader’s log in any order as in Figure 5(b)(iii) and (b)(iv).
4.3 Background Ordering and Execution
While nilext updates are not immediately ordered, they must be ultimately executed in the same real-time order across replicas. The leader is guaranteed to have all completed updates in its durability log in real-time order. Periodically, the leader takes an update from its durability log (via the GetDurabilityLogEntries upcall), adds it to the consensus log, and initiates the usual ordering protocol. Once f followers respond after adding the request to their consensus logs, the leader applies the update and removes it from its durability log. At this point, the request is finalized. As in VR, the leader sends a commit for the finalized request; the followers apply the update and then remove it from their durability logs. Note that this step is the same as in VR; once \( f+1 \) nodes agree on the order, at least one node in any majority will have requests in the correct order in its consensus log.
The leader employs batching for the background work; it adds many requests to its consensus log and sends one prepare for the batch. Once f followers respond, it applies batch and removes it from the durability log.
4.4 Reads
Clients read only at the leader in
If there are no pending updates, then the storage system populates the response by reading the state, sets the need_sync bit to 0, and returns the read value to the replication layer. The leader then returns the response to the client, completing the read in one RTT (e.g., read-a in Figure 5(a)(ii)).
Conversely, if there are pending updates, then the storage system sets the need_sync bit. In that case, the leader synchronously adds all requests from the durability log to the consensus log to order and execute them (e.g., read-c in Figure 5(a)(iii)). Once f followers respond, the leader applies all the updates and then serves the read. Fortunately, the periodic background finalization reduces the number of requests that must be synchronously ordered and executed during such reads.
4.5 Non-nilext Updates
If an update externalizes state, then it must be immediately ordered and executed. Clients send such non-nilext updates only to the leader. The leader first adds all prior requests in the durability log to the consensus log; it then adds the non-nilext update to the end of the consensus log and then sends a prepare for all the added requests. Once f followers respond, the leader applies the non-nilext update (after applying all prior requests) and returns the result to the client.
4.6 Replica Recovery and View Changes
So far, we have described only the failure-free operation. We now discuss how
Replica Recovery. Similarly to VR,
View Changes. In VR, when the leader of the current view fails, the replicas change their status from normal to view-change and run a view-change protocol. The new leader must recover all the committed operations in the consensus log before the system can accept requests. The new leader does this by waiting for f other replicas to send a
In
To correctly recover the durability log, a
Fig. 7. RecoverDurabilityLog. The figure shows the procedure to recover the durability log at the leader during a view change.
Fig. 8. RecoverDurabilityLog example. The figure shows how RecoverDurabilityLog works. \( S_1 \) , the leader of the previous view view-1, has failed; this is a view-change for view-2 for which \( S_2 \) is the leader.
The system must make progress with f failures; thus, the procedure must correctly recover the durability log with \( f+1 \) replicas participating in a view change. As in VR, upon receiving \( f+1 \)
The above steps give the operations that form the durability log, but not the real-time order among them. To determine the order, the leader considers every pair of operations \( \lt x,y\gt \) in E, and counts the number of logs where x appears before y or x appears but y does not. If this count is at least \( {\lceil {f/2}\rceil }+1 \), then the leader determines that y follows x in real time. In Figure 8(ii), a appears before c on \( \geqslant 2 \) logs and so the leader determines that c follows a. In contrast, a does not appear before b (or vice versa) in \( \geqslant 2 \) logs and thus are concurrent. Thus, this step gives only a partial order.
The leader constructs the total order as follows. It first adds all operations in E as vertices in a graph, G (lines 4 and 5). Then, for every pair of vertices \( \lt a,b\gt \) in G, an edge is added between a and b if on at least \( {\lceil {f/2}\rceil }+1 \) logs, either a appears before b, or a is present but not b (lines 6–10). G is a DAG whose edges capture the real-time order between operations. To arrive at the total order, the leader topologically sorts G (line 11) and uses the result as its durability log (NLD). In Figure 8(ii), both bac and abc are valid total orders.
The leader then appends the operations from the durability log to the consensus log; duplicate operations are filtered using sequence numbers. Then, the leader sets its status as normal. The leader then sends the consensus log in the
4.7 Correctness
We now show that
C1. Ensuring durability when the leader is alive is straightforward; a failed replica can recover its state from the leader. Durability must also be ensured during view changes; the new leader must recover all finalized and completed operations. Finalized operations are part of at least \( f+1 \) consensus logs. Thus, at least one among the \( f+1 \) replicas participating in the view change is guaranteed to have the finalized operations and thus will be recovered (this is similar to VR).
Next we show that completed operations that have not been finalized are recovered. Let v be the view for which a view change is happening and the highest normal view be \( v^\prime \). We first establish that any operation that completed in \( v^\prime \) will be recovered in v. Operations are written to \( f+{\lceil {f/2}\rceil }+1 \) durability logs before they are considered completed and are not removed from the durability logs before they are finalized. Therefore, among the \( f+1 \) replicas participating in the view change for v, a completed operation in \( v^\prime \) will be present in at least \( {\lceil {f/2}\rceil }+1 \) durability logs. Because the new leader checks which operations are present in at least \( {\lceil {f/2}\rceil }+1 \) logs (line 2 in Figure 7), operations completed in \( v^\prime \) that are not finalized will be recovered as part of the new leader’s durability log.
We next show that operations that were completed in an earlier view \( v^{\prime \prime } \) will also survive into v. During the view change for \( v^\prime \), the leader of \( v^\prime \) would have recovered the operations completed in \( v^{\prime \prime } \) as part of its durability log (by the same argument above). Before the view change for \( v^\prime \) completed, the leader of \( v^\prime \) would have added these operations from its durability log to the consensus log. Any node in the normal status in view \( v^{\prime } \) would thus have these operations in its consensus log. Consensus-log recovery would ensure these operations remain durable in successive views including v.
C2. During normal operation, the leader’s durability log reflects the real-time order. The leader adds operations to its consensus log only in order from its durability log. Before an (non-nilext) operation is directly added to the consensus log, all prior operations in the durability log are appended to the consensus log as well. Thus, all operations in the consensus log reflect the linearizable order. Reads are served by the leader that is guaranteed to have all acknowledged operations; thus, any read to an object will include the effect of all previous operations. This is because the leader ensures that any pending updates that the read depends upon are applied in a linearizable order before the read is served.
The correct order must also be maintained during view changes. Similarly to VR, the order established among the finalized operations (in the consensus log) survives across views; any operation committed to the consensus log will survive in the same position.
Next, we show that the linearizable order of completed-but-not-finalized operations is preserved. As before, we need to consider only operations that were completed but not yet finalized in \( v^\prime \); remaining operations will be recovered as part of the consensus log. We now show that for any two completed operations x and y, if y follows x in real time, then x will appear before y in the new leader’s recovered durability log. Let G be a graph containing all completed operations as its vertices. Assume that for any pair of operations \( \lt x,y\gt \), a directed edge from x to y is correctly added to G if y follows x in real time (A1). Next assume that G is acyclic (A2). If A1 and A2 hold, then a topological sort of G ensures that x appears before y in the result of the topological sort. We show that A1 and A2 are ensured by
A1: Consider two completed operations a and b and that b follows a in real time. Since a completed before b, when b starts, a must have already been present on at least \( f+{\lceil {f/2}\rceil }+1 \) durability logs; let this set of logs be DL. Now, for each log dl in DL, if b is written to dl, then b would appear after a in dl. If b is not written to dl, then a would appear in dl but not b. Thus, a appears before b or a is present but not b on at least \( f+{\lceil {f/2}\rceil }+1 \) durability logs. Consequently, among the \( f+1 \) replicas participating in view change, on at least \( {\lceil {f/2}\rceil }+1 \) logs, a appears before b or a is present but not b. Because the leader adds an edge from a to b when this condition is true (lines 7–9 in Figure 7) and because it considers all pairs, A1 is ensured. A2: Since \( {\lceil {f/2}\rceil }+1 \) is a majority of \( f+1 \), an opposite edge from b to a would not be added to G. Since all pairs are considered, G is acyclic.
A completed operation is assigned a position only when it is finalized. Since
Model Checking. We have modeled the request-processing and view-change protocols in
4.8 Practical Issues and Solutions
We now describe a few practical problems we handled in
Space and Catastrophic Errors. Because nilext updates are not immediately executed, certain errors cannot be detected. For instance, an operation can complete but may fail later when applied to the storage system due to insufficient space. A protocol that immediately executes operations, in theory, could propagate such errors to clients. However, such space errors can be avoided in practice by using space watermarks that the replication layer has visibility into; once a threshold is hit, the replication layer can throttle updates while the storage system reclaims space. One cannot, however, anticipate catastrophic memory or disk failures. Fortunately, this is not a major concern in practice. Given the inherent redundancy, a
Determining Nil-externality. While it is straightforward in many cases to determine whether or not an interface is nilext, occasionally it is not. For instance, a database update may invoke a trigger that can externalize state. However, when unsure, clients can safely choose to say that an interface is non-nilext, forgoing some performance for safety.
Replica-group Configuration and Slow Path. In our implementation, clients know the addresses of replicas from a configuration value. During normal operation,
Mitigating Stragglers in Supermajority. Network failures and reroutes as well as server overloads can cause temporary delays in replicas. Such delays may impact
Possible Optimizations. In
5 EVALUATION
To evaluate
How does
Skyros perform compared to standard replication protocols on nilext-only workloads? (Section 5.1)How does
Skyros perform on mixed workloads? (Section 5.2)How do read-latest percentages affect performance? (Section 5.3)
Does the supermajority requirement in
Skyros impact performance with many replicas? (Section 5.4)How does
Skyros perform on YCSB workloads? (Section 5.5)Does replicated RocksDB benefit from
Skyros ? (Section 5.6)Does
Skyros offer benefit over commutative protocols? (Section 5.7)
Setup. We run our experiments on five replicas; thus, \( {f}=2 \) and \( {supermajority}=4 \). Each replica runs on a m5zn bare-metal instance [7] in AWS (US-East). Numbers reported are the average over three runs. Our baseline is VR/multi-paxos, which implements batching to improve throughput (denoted as Paxos).
5.1 Microbenchmark: Nilext-only Workload
We first compare the performance for a nilext-only workload. Figure 9 plots the average latency against the throughput when varying the number of clients. We also compare to a no-batch Paxos variant in this experiment. In all further experiments, we compare only against Paxos with batching.
Fig. 9. Nilext-only workload. The figure plots the average latency against throughput for a nilext-only workload by varying the number of clients.
We make three observations from the figure. First,
5.2 Microbenchmark: Mixed Workloads
We next consider mixed workloads. We use 10 clients.
Nilext and reads. We first consider a workload with nilext writes and reads. In
Fig. 10. Mixed workloads. The figure compares the performance Skyros to Paxos under three different mixed workloads (nilext+reads, nilext+nonnilext, and nilext+nonnilext+reads).
In the zipfian case, some keys are more popular than others. Therefore, reads may often access keys recently modified by writes. Thus, as shown, p99 latency in
Nilext and non-nilext writes. Figure 10(ii) shows the result for a workload with a mix of nilext and non-nilext writes. With low non-nilext fractions,
Writes and reads. We next run a mixed workload with all three kinds of operations. We vary the write percentage (W) and fix the non-nilext fraction to be 10% of W. As shown in Figure 10(iii), with a small fraction of writes,
5.3 Microbenchmark: Read Latest
If many reads access recently modified items, then
Figure 11 shows the result. Intuitively, if no or few reads access recently modified items, then performance of
Fig. 11. Read-latest. The figure shows the performance of Skyros with varying read-latest percentages.
5.4 Microbenchmark: Latency with Many Replicas
In prior experiments, we use five replicas and thus clients wait for four responses. With larger clusters,
Fig. 12. Latency with many replicas. The figure compares the latency of Skyros for different cluster sizes.
Microbenchmark Summary.
5.5 YCSB Macrobenchmark
We next analyze performance under six YCSB [21] workloads: Load (write-only), A (50% w, 50% r), B (5% w, 95% r), C (read-only), D (5% w, 95% r), and F (50% rmw, 50% r). Figure 13 shows the result for 10 clients. For write-heavy workloads (load, A, and F),
Fig. 13. YCSB throughput. The figure shows the throughput of Paxos and Skyros for all YCSB workloads.
To understand the effect of reads that trigger synchronous ordering, we examine the read-latency distributions (Figure 14(a)–(d)). In all the mixed YCSB workloads (A, B, D, and F), most reads complete in one RTT, while some incur overhead. However, this fraction is very small (e.g., 4% in YCSB-A and 0.3% in YCSB-B).
Fig. 14. YCSB performance. (a)–(d) show the read-latency distribution for the different YCSB workloads; (e)–(h) show the operation-latency distribution for the same workloads.
However, the slow reads do not affect the overall p99 latency. In fact, examining the distribution of operation (both read and write) latencies shows that
Paxos, with batching across many clients, can achieve high throughput levels (similarly to
Fig. 15. Skyros latency benefits. The figures compare the average and tail (p99) latencies at maximum throughput for mixed YCSB workloads. The number below each bar shows the throughput for the workload.
We next measure the tail latencies at maximum throughputs for the different mixed workloads. As shown in Figure 15(b),
5.6 Replicated RocksDB: Paxos vs. Skyros
We have also integrated RocksDB with
Fig. 16. RocksDB. The figure compares the performance of Paxos and Skyros in replicating RocksDB.
5.7 Comparison to Commutative Protocols
We now compare
5.7.1 Benefits over Commutative Protocols.
We first compare
Fig. 17. Comparison to commutativity. (a) shows the write throughput in kv-store. (b) and (c) show the latencies for YCSB-A. (d) compares record-append throughput.
We next run YCSB-A (50%w, 50%r). As shown in Figure 17(b), Paxos reads take one RTT. In
Exploiting nil-externality offers benefit over commutativity when operations do not commute. To show this, we built a file store that supports GFS-style record appends [39]. The record-append interface is not commutative: Records must be appended in the same order across replicas. However, it is nilext: It just returns a success. Figure 17(d) shows the result when four clients append records to a file. Because every operation conflicts, Curp-c’s performance drops; it is lower than Paxos, because some requests take three RTTs.
6 SKYROS Variants
We now describe two variants of
6.1 Skyros-Comm : Faster Non-nilext Operations with Commutativity
While
Fig. 18. Skyros-Comm benefits. The figure shows the kv-store throughput for a nilext + non-nilext write workload.
Fortunately, however, nil-externality is compatible with commutativity. We build
The last bar in Figure 18, the no-conflict case, shows that
6.2 Skyros-Adapt : Straggler and Failure Detection
As noted in Section 4.8,
Fig. 19. The effect of stragglers in a supermajority. The figure shows the performance of Paxos, Skyros, and the straggler-detection variant Skyros-Adapt when two out of five replicas are slower than the rest. We run a nilext-only workload and note the average operation latency for the entire experiment (y-axis). We repeat the experiment multiple times, increasing the delay of the two replicas in 100us increments (x-axis). Since the added delay only affects the supermajority, Paxos latencies are stable while Skyros increases linearly with delay. Skyros-Adapt detects the stragglers, switches to 2-RTT-Mode, and achieves latencies comparable to Paxos.
We build
6.2.1 Straggler Detection.
Clients in
As transient network delays are possible [9], the straggler detection logic does not determine replica latency based on the most recent response but rather on a simple moving average. The latencies of all replicas are sorted and the slowest of super and simple majority are compared against each other. Currently, we determine straggler presence if the slowest in super majority is greater than three times that of the simple majority. We use a moving average window of twenty for replica latencies and run the straggler detection once every 10 seconds by default.
When a supermajority straggler is detected,
To demonstrate the adaptiveness of
Figure 20 shows the change in throughput for Paxos,
Fig. 20. Effect of intermittent network delays. The figure shows the performance when replicas face temporary network delays. We run a nilext-only workload and plot the throughput (y-axis) over time (x-axis) every five seconds until the experiment ends ( \( \sim \) 9 minutes). In a five-replica cluster r1 to r5 with r1 being the leader, we introduce a delay of 300us on a non-leader replica every minute starting with r5. After five minutes, when replicas r2 to r5 all have a delay of 300us, we remove the delays every minute starting with r2. When there is a straggler in the supermajority but not simple majority, Skyros-Adapt switches to the 2-RTT-Mode and thus matches the performance of Paxos. When stragglers exist even in the simple majority, Skyros-Adapt reverts to the NormalMode.
6.2.2 Failure Detection.
We run the same workload and cluster setup described in Section 6.2.1 for four minutes, reporting throughput every 5 seconds. However, instead of adding delays, we
Fig. 21. Availability under failures. The figure shows the performance when replicas in the cluster fail. We run a nilext-only workload and plot the throughput (y-axis) over time (x-axis). In a five-replica cluster r1 to r5 with r1 being the leader, we fail a non-leader replica every minute. When two replicas fail, Skyros becomes unavailable (throughput goes to zero) and Skyros-Adapt switches into 2-RTT-Mode to stay available. When three replicas fail, both Paxos and Skyros-Adapt become unavailable.
6.2.3 Limitations.
We observe two limitations in
(1) | While | ||||
(2) | As clients in | ||||
7 DISCUSSION
We now discuss a few possible improvements and avenues for future work both in
7.1 Skyros Improvements
In this article, we built
Leaderless Protocols.
Scalable Reads.
Multi Datacenter Settings. Unlike protocols designed for the data center [58, 77],
7.2 Exploiting Nil-externality in Other Contexts
In this article, we described how nil-externality can be exploited in storage systems (with a particular focus on key-value stores). However, nil-externality is a general interface-level property that can be exploited in other contexts as well. We now discuss such potential avenues.
Storage Systems beyond Key-value Stores. Key-value stores (especially ones built atop write-optimized structures) have many nilext interfaces, enabling fast replication. Nil-externality can be exploited to perform fast replication for other systems such as databases and file systems as well. As an example, consider the
Message Queues. In message queues such as Kafka [4], producers insert messages into a queue that are later then consumed by a set of consumers. APIs used by the producers to insert messages often do not return the index at which their particular message was inserted; rather, only a simple durability acknowledgment is returned to the producer. However, the inserted messages must be ordered according to real time. Current systems establish such ordering eagerly. Instead, a system that exploits the nil-externality of producer interfaces can synchronously guarantee durability but establish the required ordering in the background, improving performance.
Interactive Applications. Interactive applications (like virtual reality and gaming) are typically built using the client-server architecture [10]. These systems often forgo fault tolerance for responsiveness. However, fault tolerance is desirable in many use cases; for example, a server failure leading to a complete loss of game state can be devastating. Applying the idea of deferred ordering and execution can alleviate this problem. In interactive systems, clients generate inputs (through gestures); the server then executes them, resulting in changes to the system state (e.g., moving of objects). The server then pushes the changes to clients, which render the view. At a high level, interactive inputs do not externalize state immediately; inputs are applied in some order by the server and later externalized. Interactive update interfaces, in essence, are nilext. Thus, ideas of nilext-aware replication can be used to make interactive servers fault tolerant without increasing latency. However, other challenges must be addressed to realize the full benefits of replication. For example, how to enable concurrent updates while preserving consistency? Can any replica push state changes to clients for better scalability?
Microservice Designs. Modern applications are built using recent paradigms such as microservices [15]. In the microservice architecture, a client request is processed in a chain of individual service components. This provides many benefits, such as modularity, scalability, and fault isolation. However, applications incur higher latencies, because each component is internally replicated; thus, processing a request may require expensive coordination among the replicas. An extension to the deferred execution idea can help realize the benefits of microservices without compromising on latency. At a high level, if a request execution need not be finalized at a component, then the request can be simply logged and later executed when external entities observe the component state in some way (e.g., when the state is explicitly read). Such deferral can lead to lower latencies and thus improve responsiveness. However, such a solution must address several challenges. For example, how do we efficiently track dependencies across components? How do we identify when coordination must be enforced?
RDMA-based Designs. Remote direct memory access (RDMA) offers a way to communicate within data centers [47]. RDMA does not interrupt the CPU at the receiver, enabling clients to access the server’s memory directly. This leads to low latency and more scalability. However, such one-sided operations are not suitable (or inefficient) to implement general-purpose storage systems [87]. For example, to read items indexed by a Btree, multiple one-sided operations are required. Prior work has designed solutions to perform reads using a single one-sided operation (e.g., by caching the internal nodes at the client [66]). However, updates still involve the server CPU, incurring overhead for write-intensive applications. We realize that exploiting nil-externality enables one to process updates purely using one-sided operations. Because nilext interfaces do not return a response, they can be completed using one one-sided RDMA write. Such a solution can be valuable for microsecond-scale applications [3].
8 RELATED WORK
Commit before Externalize. Our idea of deferring work until externalization bears similarity to prior systems. Xsyncfs defers disk I/O until output is externalized [70], essentially moving the output commit [31, 82] to clients. SpecPaxos [77], Zyzzyva [50], and SpecBFT [88] do the same for replication. As discussed in Reference 3.4.1, these protocols execute requests in the correct order before notifying the end application. Our approach, in contrast, defers ordering or executing nilext operations beyond notifying the end application.
State modified by nilext updates can be externalized by later non-nilext operations upon which
Exploiting Semantics. Inconsistent replication (IR) [91] realizes that inconsistent operations only require durability and thus can be completed in one RTT. Nilext operations, in contrast, require durability and ordering. Further, IR cannot support general state machines. Prior replication [53, 68, 74] and transaction protocols [69] use commutativity to improve performance. Nil-externality has advantages over and combines well with commutativity (Section 5.7).
SMR Optimizations. Apart from the approaches in Section 3.4.1, prior systems have pushed consensus into the network [25, 26]. Domino uses a predictive approach to reduce latency in WAN [89] and allows clients to choose between Multi-Paxos and Fast-Paxos schemes. As discussed in Section 7, ideas from Domino can be utilized in
Local Storage Techniques. Techniques in
9 CONCLUSION
In this article, we identify nil-externality, a storage-interface property, and show that this property is prevalent in storage systems. We design nilext-aware replication, a new approach to replication that takes advantage of nilext interfaces to improve performance by lazily ordering and executing updates. We experimentally demonstrate that nilext-aware replication improves performance over existing approaches for a range of workloads. More broadly, our work shows that exposing and exploiting properties across layers of a storage system can bring significant performance benefit. Storage systems, today, layer existing replication protocols upon local storage systems (such as key-value stores). Such black-box layering masks vital information across these layers, resulting in missed performance opportunities. This article shows that by making the replication layer aware of the underlying storage-interface properties, higher performance can be realized.
The source code of
Footnotes
- [1] 2021. Memcached Commands. Retrieved from https://github.com/memcached/memcached/wiki/Commands#set.Google Scholar
- [2] . 2013. Leveraging sharding in the design of scalable replication protocols. In Proceedings of the ACM Symposium on Cloud Computing (SOCC’13).Google Scholar
Digital Library
- [3] . 2020. Microsecond consensus for microsecond applications. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI’20). Banff, Canada.Google Scholar
- [4] . Kakfa. Retrieved from http://kafka.apache.org/.Google Scholar
- [5] . 2021. ZooKeeper. Retrieved from https://zookeeper.apache.org/.Google Scholar
- [6] . 1995. Sharing memory robustly in message-passing systems. J. ACM 42, 1 (1995), 124–142.Google Scholar
Digital Library
- [7] . 2020. New EC2 M5zn Instances—Fastest Intel Xeon Scalable CPU in the Cloud. https://aws.amazon.com/blogs/aws/new-ec2-m5zn-instances-fastest-intel-xeon-scalable-cpu-in-the-cloud/.Google Scholar
- [8] . 2015. An introduction to Be-trees and write-optimization. USENIX ;login: 40, 5 (2015), 22–28.Google Scholar
- [9] . 2010. Network traffic characteristics of data centers in the wild. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement (IMC’10). Association for Computing Machinery, 267–280. Google Scholar
Digital Library
- [10] . 2001. Latency compensating methods in client/server in-game protocol design and optimization. In Game Developers Conference, Vol. 98033.Google Scholar
- [11] . 2011. Paxos replicated state machines as the basis of a high-performance data store. In Proceedings of the 8th Symposium on Networked Systems Design and Implementation (NSDI’11).Google Scholar
Digital Library
- [12] . 2003. Lower bounds for external memory dictionaries. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA’03), Vol. 3.Google Scholar
- [13] . 1993. The primary-backup approach. Distrib. Syst. 2 (1993).Google Scholar
- [14] . 2020. Gryff: Unifying consensus and shared registers. In Proceedings of the 17th Symposium on Networked Systems Design and Implementation (NSDI’20).Google Scholar
- [15] . 2019. Aegean: replication beyond the client-server model. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19).Google Scholar
Digital Library
- [16] . 2020. characterizing, modeling, and benchmarking RocksDB key-value workloads at Facebook. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20). Santa Clara, CA.Google Scholar
- [17] . 2019. Linearizable quorum reads in Paxos. In Proceedings of the 11th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’19).Google Scholar
Digital Library
- [18] . 1987. UIO: A uniform I/O system interface for distributed systems. ACM Trans. Comput. Syst. 5, 1 (1987).Google Scholar
Digital Library
- [19] . 2013. The scalable commutativity rule: Designing scalable software for multicore processors. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). Farmington, Pennsylvania.Google Scholar
- [20] . 2020. SplinterDB: Closing the bandwidth gap for NVMe key-value stores. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’20).Google Scholar
- [21] . 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the ACM Symposium on Cloud Computing (SOCC’10).Google Scholar
Digital Library
- [22] . 2012. Spanner: Google’s globally distributed database. In Proceedings of the 10th Symposium on Operating Systems Design and Implementation (OSDI’12). Hollywood, CA.Google Scholar
- [23] . 2012. Granola: Low-overhead distributed transaction coordination. In 2012 USENIX Annual Technical Conference (USENIX ATC’12).Google Scholar
- [24] . 2006. HQ Replication: A hybrid quorum protocol for byzantine fault tolerance. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06). Seattle, WA.Google Scholar
- [25] . 2020. P4xos: Consensus as a network service. IEEE/ACM Trans. Netw. 28, 4 (2020).Google Scholar
Digital Library
- [26] . 2015. NetPaxos: Consensus at network speed. In Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research (SOSR’15).Google Scholar
Digital Library
- [27] . Cluster-Level Storage @ Google. Retrieved from http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Keynote.pdf.Google Scholar
- [28] . 1984. Implementation techniques for main memory database systems. In Proceedings of the 1984 ACM SIGMOD Conference on the Management of Data (SIGMOD’84).Google Scholar
Digital Library
- [29] . 2019. The design and operation of CloudLab. In Proceedings of the USENIX Annual Technical Conference (ATC’19). 1–14.Google Scholar
- [30] . 2021. Object Storage Traces: A Treasure Trove of Information for Optimizing Cloud Workloads. Retrieved from https://www.ibm.com/cloud/blog/object-storage-traces.Google Scholar
- [31] . 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34, 3 (2002), 375–408.Google Scholar
Digital Library
- [32] . 2012. The TokuFS streaming file system. In Proceedings of the 4th Workshop on Hot Topics in Storage and File Systems (HotStorage’12).Google Scholar
- [33] . 2016. MyRocks: A Space- and Write-optimized MySQL Database. Retrieved from https://engineering.fb.com/2016/08/31/core-data/myrocks-a-space-and-write-optimized-mysql-database/.Google Scholar
- [34] . 2021. Merge Operator. Retrieved from https://github.com/facebook/rocksdb/wiki/Merge-Operator.Google Scholar
- [35] . 2021. RocksDB. Retrieved from http://rocksdb.org/.Google Scholar
- [36] . 2014. Lazy evaluation of transactions in database systems. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD’14).Google Scholar
Digital Library
- [37] . 2020. Strong and efficient consistency with consistency-aware durability. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20). Santa Clara, CA.Google Scholar
- [38] . 2021. Exploiting nil-externality for fast replicated storage. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP’21). Association for Computing Machinery, New York, NY, 440–456. Google Scholar
Digital Library
- [39] . 2003. The google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03).Google Scholar
Digital Library
- [40] . 2011. LevelDB. Retrieved from https://github.com/google/leveldb.Google Scholar
- [41] . 2014. Rex: Replication at the speed of multi-core. In Proceedings of the European Conference on Computer Systems (EuroSys’14).Google Scholar
Digital Library
- [42] . 1987. Reimplementing the cedar file system using logging and group commit. In Proceedings of the 11th ACM Symposium on Operating Systems Principles (SOSP’87).Google Scholar
Digital Library
- [43] . 1990. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12, 3 (
July 1990).Google ScholarDigital Library
- [44] . 1989. Conception, evolution, and application of functional programming languages. ACM Comput. Surv. 21, 3 (1989).Google Scholar
Digital Library
- [45] . 2021. Locations for Resource Deployment: Multizone Regions. Retrieved from https://cloud.ibm.com/docs/overview?topic=overview-locations#mzr-table.Google Scholar
- [46] . 2015. BetrFS: A right-optimized write-optimized file system. In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST’15).Google Scholar
- [47] . 2014. Using RDMA efficiently for key-value services. In Proceedings of the ACM SIGCOMM 2014 Conference.Google Scholar
Digital Library
- [48] . 2012. All about Eve: Execute-verify replication for multi-core servers. In Proceedings of the 10th Symposium on Operating Systems Design and Implementation (OSDI’12). Hollywood, CA.Google Scholar
- [49] . 1999. Processing transactions over optimistic atomic broadcast protocols. In Proceedings of the International Symposium on Distributed Computing (DISC 99).Google Scholar
Cross Ref
- [50] . 2007. Zyzzyva: Speculative byzantine fault tolerance. In ACM SIGOPS Operating Systems Review, Vol. 41. ACM, 45–58.Google Scholar
- [51] . 2013. Consistency-based service level agreements for cloud storage. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). Farmington, Pennsylvania.Google Scholar
- [52] . 2001. Paxos made simple. ACM Sigact News 32, 4 (2001), 18–25.Google Scholar
- [53] Leslie Lamport. 2005. Generalized consensus and Paxos. Technical Report MSR-TR-2005-33.Google Scholar
- [54] . 1983. Hints for computer system design. In Proceedings of the 9th ACM Symposium on Operating System Principles (SOSP’83).Google Scholar
Digital Library
- [55] . 2019. Dynastar: Optimized dynamic partitioning for scalable state machine replication. In Proceedings of the IEEE 39th International Conference on Distributed Computing Systems (ICDCS’19).Google Scholar
- [56] . 2015. Implementing linearizability at large scale and low latency. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP’15). Monterey, California.Google Scholar
- [57] . 2012. Making geo-replicated systems fast as possible, consistent when necessary. In Proceedings of the 10th Symposium on Operating Systems Design and Implementation (OSDI’12). Hollywood, CA.Google Scholar
- [58] . 2016. Just say no to Paxos overhead: Replacing consensus with network ordering. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). Savannah, GA.Google Scholar
- [59] . 2008. PacificA: Replication in Log-based Distributed Storage Systems.
Technical Report MSR-TR-2008-25.Google Scholar - [60] . 2011. tc-netem(8)—Linux Manual Page. Retrieved from https://man7.org/linux/man-pages/man8/tc-netem.8.html.Google Scholar
- [61] Barbara Liskov and James Cowling. 2012. Viewstamped Replication Revisited. Technical Report MIT-CSAIL-TR-2012-021.Google Scholar
- [62] . 1991. Replication in the Harp file system. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP’91).Google Scholar
Digital Library
- [63] . 2008. Mencius: Building efficient replicated state machines for WANs. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI’08).Google Scholar
- [64] . 2020. MyRocks: LSM-tree database storage engine serving Facebook’s social graph. Proc. VLDB Endow. 13, 12 (2020).Google Scholar
Digital Library
- [65] . 2017. I can’t believe it’s not causal! Scalable causal consistency with no slowdown cascades. In Proceedings of the 14th Symposium on Networked Systems Design and Implementation (NSDI’17).Google Scholar
- [66] Christopher Mitchell, Kate Montgomery, Lamont Nelson, Siddhartha Sen, and Jinyang Li. 2016. Balancing CPU and network in the cell distributed b-tree store. In Proceedings of the 2016 Usenix Annual Technical Conference (USENIX ATC’16).Google Scholar
- [67] . 1992. ARIES: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Trans. Datab. Syst. 17, 1 (1992), 94–162.Google Scholar
Digital Library
- [68] . 2013. There is more consensus in egalitarian parliaments. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). Farmington, Pennsylvania.Google Scholar
- [69] . 2016. Consolidating concurrency control and consensus for commits under conflicts. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). Savannah, GA.Google Scholar
- [70] . 2006. Rethink the sync. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06). Seattle, WA.Google Scholar
- [71] . 2013. Scaling memcache at Facebook. In Proceedings of the 10th Symposium on Networked Systems Design and Implementation (NSDI’13).Google Scholar
Digital Library
- [72] . 2014. In search of an understandable consensus algorithm. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’14).Google Scholar
- [73] . 1996. The log-structured merge-tree (LSM-Tree). Acta Inf. 33, 4 (1996).Google Scholar
- [74] . 2019. Exploiting commutativity for practical fast replication. In Proceedings of the 16th Symposium on Networked Systems Design and Implementation (NSDI’19).Google Scholar
- [75] . 2002. Handling message semantics with generic broadcast protocols. Distrib. Comput. (2002).Google Scholar
Digital Library
- [76] . 2013. Fast Updates with TokuDB. Retrieved from https://www.percona.com/blog/2013/02/12/fast-updates-with-tokudb/.Google Scholar
- [77] . 2015. Designing distributed systems using approximate synchrony in data center networks. In Proceedings of the 12th Symposium on Networked Systems Design and Implementation (NSDI’15).Google Scholar
- [78] . 2013. Quantum databases. In Proceedings of the 6th Biennial Conference on Innovative Data Systems Research (CIDR’13).Google Scholar
- [79] . 2011. It’s time for low latency. In Proceedings of the 13th Workshop on Hot Topics in Operating Systems (HotOS XIII).Google Scholar
- [80] . 1986. The sun network file system: Design, implementation and experience. In Proceedings of the USENIX Summer Technical Conference (USENIX Summer’86).Google Scholar
- [81] . 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. 22, 4 (
December 1990), 299–319. Google ScholarDigital Library
- [82] . 1985. Optimistic recovery in distributed systems. ACM Trans. Comput. Syste. 3, 3 (1985), 204–226.Google Scholar
Digital Library
- [83] . 2019. Who’s afraid of uncorrectable bit errors? Online recovery of flash errors with distributed redundancy. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’19).Google Scholar
- [84] . 2012. Caching with Twemcache. Retrieved from https://blog.twitter.com/engineering/en_us/a/2012/caching-with-twemcache.html.Google Scholar
- [85] . 2020. Twitter Cache Trace. Retrieved from https://github.com/twitter/cache-trace.Google Scholar
- [86] . 2004. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI’04).Google Scholar
- [87] . 2020. Fast RDMA-based ordered key-value store using remote learned cache. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI’20). Banff, Canada.Google Scholar
- [88] . 2009. Tolerating latency in replicated state machines through client speculation. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI’09).Google Scholar
Digital Library
- [89] . 2020. Domino: Using network measurements to reduce state machine replication latency in WANs. In Proceedings of the 16th International Conference on Emerging Networking Experiments and Technologies.Google Scholar
Digital Library
- [90] . 2020. A Large scale analysis of hundreds of in-memory cache clusters at Twitter. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI’20). Banff, Canada.Google Scholar
- [91] . 2015. Building consistent transactions with inconsistent replication. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP’15). Monterey, California.Google Scholar
Index Terms
Exploiting Nil-external Interfaces for Fast Replicated Storage
Recommendations
Exploiting Nil-Externality for Fast Replicated Storage
SOSP '21: Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems PrinciplesDo some storage interfaces enable higher performance than others? Can one identify and exploit such interfaces to realize high performance in storage systems? This paper answers these questions in the affirmative by identifying nil-externality, a ...
Keeping up with storage: Decentralized, write-enabled dynamic geo-replication
AbstractLarge-scale applications are ever-increasingly geo-distributed. Maintaining the highest possible data locality is crucial to ensure high performance of such applications. Dynamic replication addresses this problem by dynamically ...
Highlights- We propose a decentralized data popularity measurement method to identify hot data cluster-wide.
Consistent and automatic replica regeneration
Reducing management costs and improving the availability of large-scale distributed systems require automatic replica regeneration, that is, creating new replicas in response to replica failures. A major challenge to regeneration is maintaining ...



























Comments