Grove: a Separation-Logic Library for Verifying Distributed Systems

Grove is a concurrent separation logic library for verifying distributed systems. Grove is the first to handle time-based leases, including their interaction with reconfiguration, crash recovery, thread-level concurrency, and unreliable networks. This paper uses Grove to verify several distributed system components written in Go, including vKV, a realistic distributed multi-threaded key-value store. vKV supports reconfiguration, primary/backup replication, and crash recovery, and uses leases to execute read-only requests on any replica. vKV achieves high performance (67-73% of Redis on a single core), scales with more cores and more backup replicas (achieving about 2 × the throughput when going from 1 to 3 servers), and can safely execute reads while reconfiguring.


Introduction
Large-scale applications run on many servers, and face a wide range of challenges typical of distributed systems such as concurrency, crashes, network outages, loosely-coupled clocks between servers, etc.This means that the developer has to consider a large number of subtle corner cases and interactions, which in turn makes it difficult to ensure that the application correctly handles all such cases.Formal verification is an attractive approach to rigorously establish correctness of such systems, and in principle could help developers ensure that they correctly handle all of the corner cases.
One particularly challenging and cross-cutting aspect of distributed systems, which has not been addressed in prior work on verification, is the use of leases.Leases [19] are a widely used technique in distributed systems.A lease is a promise that some aspect of the system will not change for some duration of time (e.g., the primary server will not be replaced for the next 5 seconds).For instance, leases are used to ensure there is at most one Paxos leader trying to run the replication protocol in Spanner [13]; and GFS [16], Chubby [4], and DynamoDB [15] have similar mechanisms.Leases allow a leader to execute read-only operations quickly, without having to contact replicas to confirm that it is still the leader.Leases are challenging to use correctly because they interact with crash recovery and reconfiguration (e.g., reconfiguration must wait until leases expire before it can choose a new primary server) and node-local concurrency (e.g., executing a read-only operation may require first checking that a lease is valid, but in the time between the check and the operation itself, the lease may have expired, and other threads may have executed additional writes).This paper presents Grove, a library based on concurrent separation logic (CSL) [39] for reasoning about distributed systems, where state is split between nodes, crashes discard a node's memory state, network messages can be lost or duplicated, and nodes have loosely synchronized clocks.In CSL, a proof decomposes a system's state into parts called resources that are logically owned by different threads.Synchronization primitives, like mutexes, are used to transfer ownership between threads.Grove generalizes this notion of resources and ownership to reason about distributed systems; in particular, Grove introduces time-bounded invariants to reason about leases, extends Crash Hoare Logic [7,11] to reason about crashes in distributed systems, provides abstractions for reasoning about append-only logs and monotonic epoch counters, and provides a verified RPC library.This makes Grove the first to support verification of distributed systems that use leases, including their interaction with crash recovery, reconfiguration, concurrency, and unreliable networks.
To demonstrate Grove's approach, we developed a number of distributed system components written in Go (libraries, systems, and applications), and specified and verified them using Grove.As we explain in §3, these components make extensive use of Grove's ownership and resources.To name some examples: the proof of consistency for a replicated log in a primary-backup replication library uses ownership of logical append-only lists; the proof of crash recovery in a durable storage library uses ownership of durable files; proofs about RPCs that may re-execute many times use duplicable ownership (which can be thought of as knowledge), since they cannot transfer ownership of unique resources; and proofs about read operations that use leases to avoid coordination in a state-machine replication library uses time-bounded invariants to prove the state being read is not stale.By using CSL, Grove enables modular reasoning: developers can verify each component of a distributed system separately, and reason about code line-by-line, rather than explicitly considering all possible interleavings.Nevertheless, these proofs still compose into a complete proof of the entire distributed system.For instance, our case study builds a replicated key-value service (called vKV) out of multiple independent components (RPC library, primary-backup replication, state-machine replication, durable storage, configuration service, etc), and builds an example bank application on top of vKV and a distributed lock service.The proof of the bank considers only the specifications of the underlying vKV and lock service, and does not look at their implementation.At the same time, the composed proofs ensure there are no subtle bugs due to surprising interactions between the components.As we show in §6, the proof-to-code ratio is about 12×, on par with other distributed systems verification efforts, which shows that handling leases, reconfiguration, concurrency, etc, with Grove does not come at the cost of inflated proof effort.
Grove's support for leases, concurrency, reconfiguration, and crash recovery is crucial for verifying high-performance distributed systems.§7 shows that vKV achieves good performance (67-73% of the throughput of Redis in a single-core unreplicated configuration), scales well with the number of cores and the number of backup replicas (going from 463,491 to 816,252 req/sec for a 95%-read YCSB workload when using 1 and 3 servers respectively), and is able to serve read requests quickly and safely during reconfiguration.
To summarize, the contribution of this paper is Grove, which generalizes concurrent separation logic (CSL) to support distributed systems with RPCs, leases, replication, reconfiguration, and crash recovery.The paper provides lessons, insights, and techniques at several different levels.For a general systems audience, Grove demonstrates that ownership-based reasoning (using CSL) is valuable for distributed systems, by showing what kinds of distributed systems can be verified, how verification catches specific subtle bugs (the "what if" scenarios in §4), and how CSL leads to modular development ( §5).For a verification audience, Grove presents techniques and ideas for how to extend CSL to reason about distributed systems issues, such as RPCs, leases, and replication, as we describe in §3.These ideas may be helpful to researchers building frameworks for verifying distributed systems.Finally, the source code of Grove and its case studies is publicly available, 1 for experts that may want to adopt Grove's lowerlevel techniques for encoding distributed systems in the Iris separation logic [25,26,31].
One limitation is that Grove is only able to verify safety properties, ensuring that a system never returns the wrong results.Grove cannot verify liveness properties, such as ensuring that the system will respond or otherwise make progress. 1Grove is available at https://github.com/mit-pdos/perennialand the case studies are at https://github.com/mit-pdos/gokv.exactlyonce §2. 2

Motivating case studies
Grove's goal is to enable verification of distributed system components in a way that allows composing them into a single proof for the entire system.To illustrate the verification challenges that Grove aims to address, this section presents a number of components typically seen in distributed systems, shown in Figure 1, spanning from RPC and storage at the lowest level, to libraries for replicated state machines, key-value stores, and locking, to application-level code such as a bank example.Distributed systems challenges, such as concurrency, crashes, clocks, etc, show up in many of these components, and a key benefit of Grove is that it provides a consistent framework for handling these issues in the specifications and proofs of each component, which in turn allows combining these components into larger verified systems.The components fit together to build vKV, a replicated key-value store, as well as applications on top of it, as shown in Figure 2.
These components use sophisticated techniques to achieve high performance and strong correctness guarantees.For instance, they use threads on each machine to execute RPCs in parallel; store data durably on disk (using a separate thread for performance) and recover their state after a crash; batch disk writes and pipelines requests for improved performance; achieve linearizability even in the presence of retransmission, crashes, and in-flight client requests while adding or removing servers through reconfiguration; and use leases to coordinate the execution of read-only requests at each replica with reconfiguration.

RPC library
An important building block for distributed systems is RPC, which allows a client to invoke a procedure on a remote server.For instance, a client invocation of rpcClient.Call("f", args) invokes f(args) on the server to which rpcClient is connected.The rpc library provides unreliable RPCs, meaning that one invocation by a client can result in the server running the corresponding function one, zero, or many times.This is because the underlying network may drop, reorder, or duplicate packets.Applications typically do not directly invoke RPCs; rather, applications use various clerks, which wrap RPCs with additional handling (such as adding request IDs, retrying, etc).

Replicated state machine library
The focal point of our case study is a replicated state machine library called vRSM, as shown in Figure 3. vRSM replicates a state machine supplied by the application (the exact interface is shown in Figure 10).§2.3 discusses how applications use this interface.vRSM is implemented in several components, which this subsection describes.The components each handle a different aspect of state machine replication, allowing, for instance, durability to be implemented separately from the replication protocol.

replica server: replicating writes
The replica component manages copies of the state machine being replicated.A replica server is either a primary and handles write requests from clients, or else is a backup (we discuss the handling of reads later in §2.2.3).Upon receiving an operation, a primary server applies it locally and then replicates it to all backup servers before replying to the client, as shown in Figure 4. To replicate an operation, the primary spawns threads to send RPCs concurrently to each backup and then waits for all the threads to finish (using a Go WaitGroup) to know that the operation is committed (i.e.applied by all replica servers).Backup replicas handle these RPCs by also applying the operation locally.s.stateLogger takes care of managing the RSM state, as we describe in §2.2.4.This protocol requires the primary to replicate the operation to all servers before replying to a client, so if even a single backup is unavailable, the replication protocol is blocked.To unblock the system, an operator or an automatic failure detector can remove unresponsive servers (and add new ones) by invoking the reconfig component, described next.

reconfig using configservice
The reconfig component allows adding or removing replica servers by making use of sequentially numbered epochs and a configservice component.An epoch typically corresponds to a configuration-that is, a set of servers with one designated as the primary.We call such epochs live, even if such an epoch has been superseded by another one.However, some epochs may not have a corresponding configuration, if that epoch never started running (e.g. because a node running reconfig crashed); we call such epochs reserved.The configservice keeps track of the latest epoch number and the most recent configuration (a list of server addresses), which may be from an earlier epoch if the current epoch is not live.
Clients can invoke operations concurrently with reconfiguration, which runs the risk of a client's operations being applied in an old configuration after the new configuration has already started, thereby missing these operations in the new config.To prevent this, reconfiguration first seals one of the servers from the old epoch.A sealed server no longer modifies its state until it enters a new epoch, at which point it becomes unsealed.Sealed servers may still handle read requests.Sealing allows the reconfiguration process to get a stable checkpoint of the system state and ensure all of the servers in the new configuration have consistent state before entering the new epoch.Reconfiguration involves coordination between the configuration service, the old servers, and the new servers.Figure 5 shows the API for the configuration service, and Figure 6 shows the code for reconfiguration, invoked to change to a new set of servers specified by the newServers argument.Not shown is the monitoring logic that decides when to call this function or which new servers to choose; correctness (safety) is independent of that logic, and Grove does not prove liveness.Reconfiguration consists of the following steps: 5. Enable the primary in the new configuration, which allows the primary to start processing write requests (line 29).
In case of a network partition, it is possible that both sides of the partition will try to initiate reconfiguration.One might worry that this would lead to two copies of the system with diverging states.This possibility is ruled out with the help of the configuration service, which accepts only the highestnumbered new epoch in its WriteConfig RPC handler, together with the replicas' SetNewEpochState handler, which rejects state from lower-numbered epochs.As a result, the reconfiguration process that has the higher new epoch from GetNewEpochAndConfig() will win.

replica server: lease-based reads
Any replica server (primary as well as backup) can serve linearizable reads without communicating with other servers by using leases, as shown in Figure 7. Leases avoid the possibility of one server returning stale reads if reconfiguration happens and the new configuration has executed additional writes not seen by this server.Specifically, every replica runs a background thread that contacts the configuration service to obtain or extend a lease that promises the configuration service will not change the current epoch number (and thus not reconfigure) until lease expiration (e.g., 1 second from the time the lease is issued).All servers can serve read requests because this lease is a promise about the epoch number, rather than anything specific to a particular server's state.
When a replica server receives a read-only operation, and its lease is still valid, it computes the response from its local state.The replica's local state includes all committed operations since committed operations must be acknowledged by all servers.However, the state may also include ongoing write operations that have not yet been committed.To ensure that the client's observed read does not roll back due to a crash or reconfiguration, the replica waits for all the previous writes that the read depends on to be committed before sending the result to the client.As part of executing the read operation, LocalRead's job is to determine which prior requests the read depends on, returning the appropriate idx value as shown in Figure 7; it is always safe to return idx = s.nextIndex.If reconfiguration happens during waitForCommitted, the server tells the client to retry.
Since the clocks on different nodes might be slightly out of sync with each other, Grove provides a TrueTime-like API [13] for accessing the current time, GetTimeRange().This function returns a pair of timestamps, earliest and latest, which provide lower and upper bounds for the current time.

storage library for replicas E
Replica servers manage durable state with a storage library that provides a "state logger" to durably log new operations in an append-only file.To get good performance, the state logger buffers appends in memory while a background thread asynchronously appends and syncs the buffer to the file.The library provides a Wait() function that allows waiting until a prefix of the file has been made durable.The replica library uses Wait() to ensure changes are durably stored before replying to an RPC (not shown in our simplified code examples).

Fault-tolerant configservice using paxos E
To handle server failures for the configuration service itself, configservice relies on a simple Paxos-based replication library called paxos.paxos operates on a fixed set of servers (new servers cannot be added at runtime), but requires only a majority of the servers to process requests.paxos uses a leader to coordinate operations, but allows changing the leader if the previous one seems to have crashed.
The code structure of paxos largely follows the primarybackup replication library, with a few key differences.First, instead of relying on an external config service to choose new epoch numbers, paxos chooses new epoch numbers on its own, which it can do safely because the set of servers does not change.Second, instead of requiring every server to commit an operation, paxos requires only a majority, which also means that a new leader must obtain the latest state from a majority of other servers, rather than just one.Third, paxos is much simpler than the primary-backup replication library: it does not use leases, and it sends and writes the entire state to disk on every update rather than appending operations to a log.This means that paxos has lower performance for write operations, which is acceptable for the configuration service.Finally, paxos provides fast but weakly consistent reads.
The interface provided by paxos is shown in Figure 8. WeakRead returns the entire current state replicated by paxos, as stored on the server where WeakRead is invoked.The resulting state might be stale (if other servers have committed new operations in the meantime) or the resulting state might not even be committed (if this state was never acknowledged by a majority of servers).configservice implements GetConfig using WeakRead, despite its weak semantics, because the caller of GetConfig is the vRSM clerk, which can handle stale or even incorrect results, and because using WeakRead ensures that GetConfig is fast.To execute write operations, such as ReserveEpochAndGetConfig as shown in Figure 9, configservice uses the Begin method (which should be invoked on the leader, otherwise the operation will be unable to commit).Begin returns the current replicated state from the local server, as well as a callback function commit that configservice will use to try to commit its new state.The commit callback is the core of paxos: it actually talks to other servers, unlike WeakRead and Begin.It tries to replicate the new state to a majority of servers, while checking that no other operations have been committed in the meantime.commit could succeed in getting a quorum of servers to accept this new state, or it might fail because another leader has been chosen.Finally, if the current leader appears to have crashed, the operator could call TryBecomingLeader to choose another server as the leader; it is always safe to call TryBecomingLeader.

Versioned state machine API E
To build an application on top of vRSM, the developer must implement the versioned state machine interface shown in Figure 10.The Apply() executes a application-level read-/write operation and Read() executes a application-level read operation against the against the current in-memory state, while SetState() and GetState() allow serializing the inmemory state.Using this developer-provided interface, the vRSM library takes care of checkpointing the state on disk and copying the state to new replicas.For example, the call to GetStateAndSeal() in Figure 6 ultimately uses the developerprovided GetState() method to checkpoint the current state, and the call to SetNewEpochState() in Figure 6 ultimately calls the developer-provided SetState() method to initialize the replica's local state.As described earlier, one complication is that read operations may observe writes that have not been replicated to all servers yet and thus have to wait for those operations to be committed.As an optimization, the application-provided Read() method can specify which writes the read result depends on, by returning the idx of the most recent write dependency as the first part of its return value.This allows reads to return quickly, without waiting for commits of recent writes, if the result is not reading a recent write.This propagates into the idx value returned by s.stateLogger.LocalRead() in Figure 7 and determines what committed operations the replica library waits for.

vRSM clerk E
vRSM provides a client clerk library, which hides the complexity of issuing requests to vRSM over the network.The clerk API is shown in Figure 11.Applications can use the clerk library on many client machines to access vRSM.To execute an operation, the clerk needs to know the address of the replica servers.The clerk initially obtains this information from the configuration service and caches it locally.Calling clerk.Apply(op) issues read-write operations to the primary while clerk.Read(op) issues a read-only operation to any replica.If a server indicates that it is no longer the primary (or replica, for read-only operations), the clerk asks the config server for the new server information and retries.Because of retries, it is possible for a single clerk.Apply() call to result in an operation being applied more than once.A higher-level library handles deduplicating operations ( §2.2.8).

exactlyonce library E
The exactlyonce library helps applications using vRSM ensure that operations execute exactly-once.It consists of a new clerk that wraps over the vRSM clerk (which potentially duplicates operations through retries) and a state machine transformer that adds duplicate detection and handling to an application-level state machine.The clerk API is the same as the underlying vRSM clerk, but additionally guarantees that operations are not applied more than once.To achieve this, the clerk adds a unique request ID made up of a client ID and sequence number to each operation.On the other side, the exactlyonce state machine transformer augments an application-level state machine with a reply table to keep track of previously applied requests along with their replies.Upon getting a new request, the exactlyonce library calls the application state machine's Apply function and stores the reply in the reply table.Upon getting a duplicate, the library does not call into the application state machine and instead returns with the previous reply.For read-only operations, exactlyonce ignores the reply table and calls Read on the application state machine.

Applications on top of vRSM 2.3.1 vKV
vKV is implemented on top of vRSM and the exactlyonce library.The server-side part of vKV is an implementation of the state machine interface expected by vRSM.The client-side part of vKV is a clerk implemented on top of the exactlyonce clerk, with the API shown in Figure 12.By building on top vRSM, the implementation of vKV itself is simple: it consists of (de)serialization methods to turn key-value operations into byte slices and a few functions to read and update an in-memory map.In addition to storing a map of keys to values, vKV also stores a map from keys to the index of the last operation that modified that key, which allows vKV to take advantage of vRSM's versioned state machine interface ( §2.2.6) to improve the performance of reads.

Lease-based client-side caching
As another example of using leases, cachekv is a lease-based client-side caching library that works by storing both data and lease expiration times in vKV. Figure 13 shows GetAndCache function, which returns the value of the specified key and caches it internally for cachetime time.It uses CondPut to atomically increase the lease duration, which ensures that a concurrent modification did not change the value since the Get on line 4. Similarly, CacheKv's Put function (not shown) uses CondPut to ensure the value is only changed if the lease is expired.Finally, CacheKv's Get function first tries reading from k.cache and only invokes the Get on vKV if the value is not cached.The client-caching library is simple, but exemplifies how leases can be used for cache consistency [19].corresponding to one key-value pair.The lock service provides a specification for its Acquire() and Release() methods that allows applications to implement exclusive locking, such as accounts in our bank example, in the style of a traditional concurrent separation logic lock specification [39].This specification is quite different from the specifications of the underlying vKV methods like CondPut(), and makes it easy for the bank example to keep separation logic resources protected by locks that the lock service provides.

Bank transactions E
As a top-level application, we implemented a toy bank application, which uses transactions built on top of the vKV clerk and lock service interfaces, and does not depend on the details of how those interfaces are implemented.However, the specifications for the vKV clerk and lock service are strong enough to prove correctness for the bank's transactions.
The bank uses an instance of vKV to store its account state, with one key-value pair used to store the balance of one account.The bank uses the lock service (with its own separate instance of vKV) to handle concurrent access to accounts.Every Transfer(src, dst, amt) operation obtains two locks, on the src and dst accounts (sorted to avoid deadlock) before accessing their respective account balances in vKV.This ensures that concurrent transfers are safe to execute, and allows for concurrency when transfers access different accounts.The Audit() function grabs locks for all accounts, computes the total balance by retrieving each account's balance, and then releases all of the locks.
If one of the bank nodes crashes, the locks held by any threads on that node in the lock service will remain locked.Recovering from this would require some form of undo or redo logging; for instance, the bank threads could send undo log entries to the lock service.We have not implemented this in the bank prototype.

Grove
To formally verify distributed systems such as the case studies described in the previous section, Grove adopts the ideas of concurrent separation logic (CSL) [26,39].CSL enables modular specifications and proofs: a developer can take two verified components, each with their own specification, and use both of them in their application without worrying that the combination breaks either component's proof.In the context of distributed systems, this allows Grove developers to separately specify and verify different services that run on different machines but that will eventually be used together (e.g., a configuration service, a key-value store, and a lock service), as well as different libraries that will run on the same machine (e.g., a clerk that talks to vKV, a clerk that talks to the lock service, etc).
In the rest of this section, we first introduce Grove's execution model ( §3.1), followed by how Grove generalizes separation logic and resource ownership to distributed systems ( §3.2), and Grove's library of reasoning principles.

Execution model
Grove models distributed systems as a collection of nodes, each running a multithreaded program written in Go.Each node has its own memory heap (accessed in Go using loads and stores) as well as durable storage (accessed by reading from and writing to files using read() and write()).
Crash recovery.Each node has a main() function that runs when the node starts up for the first time as well as when the node restarts after a crash.When a node crashes, it loses the contents of its memory heap and restarts with an empty heap, but retains its durable state.
Nodes crash independently of one another.A few nodes might crash while others keep running, or all of the nodes might crash at the same time.Crashes can happen at any point, including when a node is still recovering from an earlier crash.For instance, a node's main() function might have a recovery phase during which it loads durable state into memory or communicates with other nodes to restore its state; crashes can occur even during this phase.
Unreliable network.Nodes communicate over an unreliable network.The low-level network API has a notion of a Connection, resembling a connected UDP socket.The API provides two functions, conn.Send(msg) and conn.Receive() that respectively send and receive messages over that connection.Grove models the network as unreliable: conn.Send(msg) is not guaranteed to deliver messages in order, and messages may be dropped or duplicated.
Clocks.There is a global clock, which advances monotonically and represents a notion of wall-clock time.Every node exposes a TrueTime-like API [13], GetTimeRange(), which returns a pair of timestamps that represent an interval (lower and upper bounds) that, according to our model, must contain the global clock value.This assumes that node clocks are synchronized to within known bounds (on the order of less than a second, for the purposes of vKV's use of leases).

Separation logic for distributed systems
Grove generalizes concurrent separation logic (CSL) [26,39] to reason about distributed systems.CSL uses Hoare logicstyle specifications for pieces of code (e.g., functions) of the form {P } f() {Q} meaning the precondition for running f() is the assertion P and the postcondition is Q.To prove such a spec, a developer applies proof rules to reason line-by-line about f(), starting with a state matching P , and showing that the final state matches Q.
This section reviews the background on CSL and introduces key abstractions that Grove provides on top of CSL, along with how they are used in vKV's proof.

Ownership reasoning
In separation logic, assertions not only describe what is true about a system's state, but also what parts of the state are logically owned by the thread executing the code at that point.For example, the assertion x → v (pronounced "x points to v") says that memory location x stores a value v and that the thread running that function owns the location x, in the sense that, as long as this thread continues to own this assertion, no other thread can access location x.Such ownership constraints form the basis for modular reasoning: for instance, the fact that no other thread can access location x allows a developer to reason about this function without considering other concurrently executing code.
Grove's library brings this ownership-based modular reasoning to distributed systems.Grove provides per-node heap points-to resources: x → j v denotes ownership of location x with value v on node j's heap.Grove also provides resources for network state and file contents, as we discuss later.In a distributed systems setting, this enables the developer to verify the code running on one node without worrying about what code might be running on other nodes at the same time.
Separation logic additionally introduces a new logical connective, * , called separating conjunction.The assertion P * Q holds in a state s if both P and Q are true in s, and furthermore, s can be split into two disjoint resources satisfying P and Q, respectively.In conventional CSL, disjointedness means separate subsets of a program's memory heap.Grove uses the separation conjunction to account for separation across different nodes as well.

Ghost resources
In addition to physical resources like the heap, separation logic allows proofs to use ghost resources, a modern form of auxiliary variables [24,29].Ghost resources talk about the state of the system at a more abstract level.In concurrent separation logic, ghost resources represent ghost state-state that is not materialized by the actual running code, but is useful for specification and proof.Just like physical resources, ghost resources can be owned.While the evolution of physical resources is entirely determined by the code (e.g., based on how the code modifies memory or file contents), ghost resources are controlled by the proof.
Ghost resources are especially useful for reasoning about distributed systems because they can span nodes and allow developers to reason about the system at a higher level of abstraction.Ghost resources are more powerful than regular abstract state, because these resources can be owned, which in turn provides constraints on how different threads can modify the ghost state, and thereby enables modular reasoning.
Epochs.Using ghost resources, Grove provides an epoch abstraction, which is used in the proof of vKV to keep track of and reason about the current configuration.Grove provides two resources for representing epochs.The first, CurrentEpoch → e, states that the current epoch number is exactly e.This resource is owned by the configuration service: it is the only component that can approve a reconfiguration.The second, CurrentEpoch ≥ e, states that the current epoch is at least e.This resource is duplicable, meaning that many threads can have it at the same time.In a way, this resource represents knowledge of the fact that the epoch is at least e, rather than any exclusive ownership of some part of the state.The fact that CurrentEpoch ≥ e is duplicable implies that epoch numbers are monotonically increasing (i.e., the resource promises that the current epoch number cannot decrease).Many vKV components, including the primary and the backup replicas, make use of this resource to represent knowledge that a new epoch exists.This means that a server can reject operations from earlier epochs, such as a stale

SetNewEpochState().
Logs.Grove provides a log abstraction using ghost resources, encoded as an append-only list.The proof of vKV encodes the main logical state of each replica server using this log abstraction, representing the operations that the node has applied so far.There are three kinds of ghost resources provided by Grove that talk about the state of an append-only log: The points-to resource a list → ℓ denotes ownership of an append-only list named a with current value ℓ.The only way to update this points-to resource in the proof is to go from ownership of a list → ℓ to a list → ℓ + ℓ ′ , i.e. to append at the end.
The lower bound resource a list ⊒ ℓ denotes knowledge that the list a has prefix ℓ.This is similar to the lower-bound epoch resource CurrentEpoch ≥ e described above, and just like it, a list ⊒ ℓ is duplicable.Other parts of the proof cannot possibly violate lower bound resources.
Finally, the read-only resource a list → □ ℓ denotes knowledge that a has value ℓ and can never be updated.To establish this read-only resource, a proof has to give up ownership of a list → ℓ and in exchange get ownership of the read-only resource.After this, no part of the proof can possibly have ownership of a list → ℓ so the list can never be updated again.vKV's proof represents operations accepted by each server as of epoch number e with append-only lists: server j ∈ {0, 1, . . ., n} owns the resource accepted j [e] list → ℓ. 2 Server j can only gain knowledge (not ownership) about other servers' accepted k [e] list and does this through RPCs (discussed in §3.3).All servers also own heap resources for their inmemory representation of this abstract state, but the proof does not involve sharing these heap resources across nodes.Servers only talk about other servers in terms of ghost resources for their accepted j [e] list.Read-only resources are used to represent sealed replicas.
The proof also has a global points-to committed list → ℓ that represents the committed list of operations.When a primary server commits an operation, the proof updates the committed points-to resource.However, when reconfiguration happens, the new primary server will need to do the same.Thus, the committed points-to resource cannot be permanently owned by any one node.To share this resource between nodes, the proof uses a separation logic invariant, which we explain next.

Invariants
Concurrent separation logic allows for resources to be shared through invariants, which can talk about the resources relevant to only a small part of the system without needing to know about the entire system's state. 3These invariants maintain ownership of resources that must always be available.The assertion P denotes an invariant that maintains ownership of P .When reasoning about code, proofs can temporarily "open" P to get ownership of P , but are required within one physically atomic step of the code to return ownership of P in order to "close" the invariant.An invariant is created by starting with ownership of P and giving it up to establish P .The proposition P asserts knowledge of the invariant, as opposed to direct ownership of the resources P ; many threads can hold P at the same time.
Multiple invariants that each talk about separate parts of a larger system can be freely combined.As an example, a per-node invariant can describe how resources are shared between the node's threads; this is how invariants are used in traditional single-machine CSL ( [26,39]).At the same time, a separate invariant can connect the logical state of all replicas to ensure that the replicas agree on the log of accepted operations.Finally, yet another invariant can cover the configuration service and how the reconfiguration logic ensures that only one set of servers is active at a time.
Example.In the vKV proof, each node has a local invariant I nodej that maintains ownership of local heap resources and list → ℓ to help reason about node-local concurrency, using Grove's log ghost resource.
Separately, to reason about how nodes coordinate with each other, the proof has a "replication invariant" I rep defined as ∃ℓ, ∃e, Here, * is the "separating conjunction" operator that combines ownership of multiple disjoint resources.The invariant maintains ownership of the committed list of operations and, for every replica server, knowledge that the server has accepted all the committed operations.This invariant encodes the primary/backup protocol: in order for an operation to be committed, all the servers must have accepted it.Not shown is a part of this invariant that says epoch e corresponds to a configuration consisting of servers 0 through n.
When the primary commits an operation op, the proof opens the replication invariant to update the committed pointsto from ℓ to ℓ + [op].In order to close the invariant after the update, the primary needs knowledge of the lower-bound resources accepted j [e] list ⊒ ℓ + [op] from all servers j.To get these lower bound resources from the backups to the primary, the proof uses Grove's RPC reasoning principles.

Reasoning about RPCs
Building on Grove's network model, the rpc verified Go RPC library provides reasoning principles that allow developers to reason about RPCs much like how they would reason about local function calls in separation logic.Key to this RPC specification is its use of duplicable assertions.Formally, an assertion P is called duplicable if P implies P * P , meaning that it's possible to create a copy of any resource in P .For instance, a list → ℓ is not duplicable because one thread's ownership of this resource precludes any other thread from owning it.On the other hand, knowledge such as a list ⊒ ℓ is duplicable.The notion of duplicability allows stating Grove's RPC specification: for any function f with specification {P } f {Q}, the specification for invoking f through an RPC is {P } rpcClient.Call("f") {Q}, as long as P is duplicable.Duplicability of P is crucial because the RPC library may retransmit its request multiple times before it receives a response and each execution of f will consume one instance of P .Note that the specification does not, strictly speaking, require Q to be duplicable, because Grove obtains a fresh copy of Q from each invocation of f() on the server.Example.As an example, consider the ApplyAsBackup RPC in vKV, issued by the primary to backup servers when replicating a new operation.The postcondition of ApplyAsBackupRPC(e, index, op) to server j is the asser- . This represents a promise that server j accepted op in its log.As the primary collects more of these lower-bound resources, it will eventually have enough to commit the operation using the replication invariant I rep .
When it comes to choosing the precondition of ApplyAsBackupRPC(e, index, op), one might naively pick "ownership of resources to apply operation op once."But, such a precondition is not duplicable, as required by RPCs.A recurring pattern when specifying RPCs with Grove is rephrasing such preconditions to not involve any exclusive ownership of resources, but instead talk about knowledge.The (correct) precondition for the ApplyAsBackup RPC is "knowledge that op is the operation at position index."The full spec (ignoring errors) for an ApplyAsBackupRPC is: {knowledge that op is operation at index} server j .ApplyAsBackupRPC(e, index, op) This precondition allows the primary to retry RPCs to ensure that every backup has learned about the operation.

Reasoning about leases
To reason about leases, the Grove library provides the notion of a time-bounded invariant.The invariant contains some resources R representing what the lease L promises to maintain until its expiration, denoted by R L , and a separate resource representing the expiration time exp of the lease, denoted by L expires → exp.L is a logical identifier for the lease, and does not show up in execution.
As an example, vKV uses a lease to promise that the configuration epoch number will remain the same, which ensures that no reconfiguration will take place for the duration of the lease (which in turn allows replicas to handle readonly requests on their own).When the configuration service hands out such a lease, it creates a time-bounded invariant CurrentEpoch → e L , along with a resource indicating when the lease expires, L expires → exp.It then gives out the invariant and a duplicable version of the lease expiration resource, L expires ≥ exp; having a duplicable lower bound on the expiration time, as opposed to the exact expiration time, simplifies lease renewal.
There are four rules for time-bounded invariants.First, a time-bounded invariant can be created by giving up ownership of some resource R, and specifying a time at which it will expire.For instance, vKV's configuration service does this when issuing a lease in response to a GetLease() RPC, giving up its ownership of CurrentEpoch → e.
Next, if a time-bounded invariant expires, according to the L expires → exp resource, its resources can be reclaimed.vKV's configuration service does this as part of reconfiguration: TryWriteConfig() waits for lease expiration, and gets back ownership of CurrentEpoch → e, which it can then increment to CurrentEpoch → e + 1.
Third, a time-bounded invariant can be extended: in vKV, the configuration service owns L expires → exp if there is an existing lease, and if another GetLease() RPC arrives, the configuration service extends the lease by advancing the expiration time to L expires → exp + ∆ (and sends a duplicable Finally, the resources inside of the time-bounded invariant can be accessed by opening the invariant, as long as the time-bounded invariant has not expired.Opening a timebounded invariant comes with the same obligations as opening a regular invariant-that is, the proof gets ownership of the resources from the invariant, but must return them back to close the invariant after at most one atomic step.In vKV, the CurrentEpoch → e resource inside the lease invariant is accessed by the primary-backup replication library (in contrast with the previous three operations, which all happen on the configuration service).

Reasoning about clocks
Consider the lease expiration check in vKV shown in Figure 7.This pattern is tricky to reason about: by the time s.stateLogger.LocalApplyReadonly() runs, the lease may no longer be valid, if there was a long delay right after if statement's check.As a result, whatever invariant the lease was protecting might no longer be true by the time the developer wants to use it in their proof.
To address this proof challenge, Grove's specification for GetTimeRange() allows the developer to perform arbitrary proof steps (such as opening and closing invariants and updating ghost resources) at the instant when GetTimeRange() executes.In the context of these proof steps, the developer also gets access to a CurrentTime → t resource which represents the current time, and a promise that the return value r of GetTimeRange() satisfies r.earliest ≤ t ≤ r.latest.(Grove implements this using logical atomicity [23].) One subtlety is that, at the instant that GetTimeRange() executes, the code has not yet executed the comparison checking if the lease is still valid (i.e., comparing to leaseExpiry).As a result, once the proof gets the CurrentTime → t resource, the developer must explicitly consider two cases at the instant of GetTimeRange(): either the subsequent check will succeed or it will fail.In the case where the subsequent check will succeed, the developer can use CurrentTime → t to then open the time-limited invariant and access the CurrentEpoch → e resource inside it.( §4.3 has a more detailed discussion of how this allows proving linearizability for reads.)When the proof eventually considers the actual comparison in the if statement, only one of the if branches will be viable in each of the two proof cases.

Reasoning about exactly-once operations E
vRSM's exactlyonce library deduplicates requests, by running the request the first time it is seen, and using the saved reply if the request appears again.(This is part of the state machine and hence the reply cache is duplicated across replicas.)This poses a proof challenge in situations when handling the request requires exclusive ownership of resources: on the one hand, the proof must be ready to provide these resources in case this is the first time the request is seen (and thus must be executed), but on the other hand, the proof cannot have a second copy of the resources if this is a duplicate of the original request (because the resources are exclusive and have already been used).
For instance, the top-level spec for vKV's Put is written in terms of an exclusive k → kv v that denotes ownership of a particular key k and that its value is v.A high level proof plan for this is to first transfer ownership of k → kv v to the primary server so that it can be certain that it is safe to Put on the key.However, Grove's RPC library can only send duplicable resources, because it may retransmit the request.Replication complicates ownership transfer even more: even if the client's Put somehow does transfer k → kv v to a primary server, that primary may crash and reconfiguration may set up a new primary.Then the client retries and-following the high-level proof plan of transferring resources to the server in charge of handling a request-the proof would need to somehow reclaim k → kv v from the old primary and then transfer it to the new primary.
To deal with such ownership transfer challenges, Grove allows proofs to use an escrow pattern [43].The idea of the escrow pattern is to indirectly transfer ownership of some nonduplicable resources by "depositing" the resource in a "known location" (an escrow invariant), and then only transferring duplicable knowledge that the deposit has happened.For this to work, the other party needs to have the sole right to take things out of the escrow.
Example.At the beginning of a client's Put, the client owns k → kv v.To allow a server to access it, the client gives up ownership to establish the invariant k →kv v ∨ Tok , called the escrow invariant.Here, Tok is an exclusive ghost token that is initially owned by the replica servers.Exclusive means that Tok * Tok → False.The full proof of the exactlyonce library deals with multiple requests and escrows by having a separate token for each request ID, all of which are initially owned by the replica servers.
When the client sends the Put operation to a primary server, it only transfers knowledge of the escrow invariant.Because knowledge is duplicable, the client can retry and also transfer knowledge of the invariant to future primary servers.On the other end, when a server receives a request it also gains knowledge of the invariant.If the request is fresh, then the server will own Tok.When the request is committed, the server opens the escrow invariant and has to deal with the two possible disjuncts in the invariant.In the left disjunct, the server now has ownership of k → kv v and can close the invariant by placing Tok inside of it.In the right disjunct, the proof can derive a contradiction because there would be two copies of the exclusive Tok: one from the invariant one already owned by the server.Finally, if the request is found to be a duplicate when committed, then the server does not own Tok to start with, and cannot get ownership of k → kv v; instead, it replies to the client with the previously cached reply.

Linearizing read operations E
When a client retransmits a read operation to vRSM, the server re-executes it instead of using the exactlyonce library to look up the previous response.This improves performance because it avoids the cost of logging the read operation to disk.Re-executing reads is safe in terms of the server state: since the read has no side effects, it is safe to run any number of reads.Re-executed reads might return different results each time they are executed; however, when the clerk eventually receives the response to one of its retransmitted requests, it will use that response as if that was the only read that was ever executed, ignoring all others.This makes this optimization safe from the client perspective as well.
Proving correctness for this read optimization in Grove is challenging.vRSM specifies linearizability for reads by establishing an exact order in which operations are executed, requiring the proof to "linearize" an operation by adding it to this execution order at most once and at the instant that the operation logically executes.(Specifically, vRSM is specified in the style of logical atomicity [23].)Importantly, proofs must add reads to the execution order before later writes are acknowledged to clients.
The proof challenge lies in determining when to linearize a read.On one hand, at the moment that the read executes, the proof does not know if that execution's response will be received by the client, so it is unclear whether to linearize the read.On the other hand, if the proof waits until the client receives a response, it is too late to linearize the read because concurrent write operations might have been acknowledged to other clients and already added to the execution order.
Grove enables the vRSM proof to address this challenge by providing support for prophecy variables [1,27], which was initially added to Perennial to support vMVCC [10].Specifically, when a clerk issues a read request RPC to a vRSM server, it uses a prophecy variable to speculate on what the eventual response will be.When the server executes a read, it checks whether the read result matches the prophecy variable speculation, and if so, whether this is the first execution of the read (using ghost state to track re-execution).If so, the server linearizes the read.When the client ultimately receives a response, it resolves the prophecy variable prediction against the actual response.In case the speculation was wrong, the proof stops with a contradiction.In case the speculation was right, the proof learns that the read was linearized with the expected value.

Reasoning about crashes E
Grove reasons about crashes by extending Crash Hoare logic [8,11] to the distributed setting, and borrows the notion of a crash obligation to encode what can be assumed about the durable state of the system after a crash.To reason about the state of durable storage, Grove introduces a file pointsto resource, written "filename" file → j data, which says that filename on node j contains data.File points-to resources are durable, so ownership of them usually appears in the crash obligation of a node.In contrast, heap resources are volatile and cannot be made part of a crash obligation, because the heap will be lost after a crash.
Example.The proof of vKV's replica servers maintains the per-node crash obligation ∃e, ∃ℓ, "log.dat" The crash obligation says that at all times, replica server j must own the log file and that the contents of the file match the list operations in the accepted ghost state.
The accepted ghost state tracks all promises made to other nodes about operations that have been applied.The per-node crash obligation therefore captures the need to prove that operations are made durable before being acknowledged to other nodes.

How Grove rules out bugs
This section sketches the specification and proof for several components from §2.It also poses some tricky scenarios for the components and explains either (1) why the scenario would result in buggy behavior and where the proof would get stuck because of the bug, or (2) why the scenario is subtly safe and how the proof covers it.A common theme is that the proofs center around choosing the right kinds of resources and do not need to break into cases for the different scenarios.

Primary replica server
Specification.An important part of vRSM's primary/backup replication protocol is embodied in the primary server's Apply function, whose job is to add a new request to the log on the primary as well as all backup replicas.Figure 4 shows the simplified code for this function, and in the rest of this subsection, we will walk through the proof of its correctness.
The spec we aim to prove for Apply is that its postcondition is ∃ℓ, . This is a formal way of stating that, after Apply(op) is done, the client's op definitely shows up in the committed log somewhere.The op might appear in the log after a number of other operations, which may have come from other clients.Similarly, op might not be the latest operation in the log either, if other operations arrived after it; however, the use of ⊒ allows the postcondition to ignore subsequent parts of the log.The spec does allow the operation to be added multiple times; a stronger exactly-once spec can be obtained on top of this spec via the exactlyonce library.
Proof.The first line acquires the primary server lock, which serializes concurrent calls to Apply.In the proof, the postcondition of s.mutex.Lock() provides ownership of the primary's points-to resource, accepted 0 [e] list → ℓ.While holding the lock, the primary establishes the order in which this op will execute (namely, nextIndex), and applies op to its local state.At this point, the proof updates the thread's ownership of The proof of Apply collects these postconditions of the n calls to ApplyAsBackupRPC to get the resources At this point, the proof has enough lower bound resources to be certain that the operation is committed.The proof opens the invariant I rep defined in §3.2.3 and temporarily gets ownership of committed list → ℓ. 4 The proof then updates it to Before the proof can complete, it must close the invariant I rep by returning ownership of the committed points-to.To close the invariant I rep , the proof gives up the resources which matches I rep 's body with ℓ + [op] in place of ℓ.
Since the lower-bound resource for the committed list matches the desired postcondition, the proof is complete.

What if a backup concurrently applies more operations?
If backup j applies an operation concurrently, it will end up growing its accepted j [e] list (that is what happens, for instance, during ApplyAsBackupRPC).But, since the points-to is append-only, the lower bound resource accepted j [e] list ⊒ ℓ + [op] that the primary received from calling ApplyAsBackupRPC is still valid, which captures the fact that the operation op appears in backup j's log, even if it's not the latest.Since the operation is still in all the backups' logs, it is safe for the primary to reply OK.
Why can't a backup concurrently remove operations?E From the point of view of the primary server, a (buggy) backup j might remove the latest operation op, perhaps because it crashed and restarted with a truncated log recovered from its disk.In this case, it would be incorrect for the primary to reply OK to the client.
Fortunately, this possibility is ruled out in the proof by the resources that the primary owns.By the time the proof of Apply has accepted j [e] list ⊒ ℓ + [op], the proof knows that backup j owned accepted j [e] list → ℓ + [op] at some point, and since the points-to is append-only, it would be impossible for the list to shrink.So, the primary does not need to consider this possibility, and can be sure that it is safe to reply OK.This reasoning even works across epochs, since one of these append-only lists will be used as the starting point for the next epoch.
Note that if the code for ApplyAsBackup really was buggy, and could lose operations on the backup, the developer would not be able to prove the correctness of ApplyAsBackup.The append-only nature of accepted j [e] and the backup's crash obligation mean that the backup's proof must ensure the log is properly preserved under crashes and recovery.

Reconfiguration
A separate challenging aspect of vRSM lies in reconfiguration.This subsection will walk through the proof of Reconfigure, as shown in Figure 6.This function is invoked on an operator's machine when the operator wants to change the set of servers, perhaps adding new machines to replace failed ones.The spec is the following:

{⊤}
This specification states that calling Reconfigure() requires all of the servers specified in newServers to be already running the vRSM replica software (although the servers might have been just installed, containing no key-value state), so that the reconfiguration logic knows it is safe to use them as a replica.Reconfigure() will contact both the old servers and the new servers, transferring the state to the new servers before registering them with the configuration service.
The postcondition in Reconfigure()'s specification appears to be weak, in the sense that it does not promise that Reconfigure() will make progress.This is because Grove is limited to safety properties-ensuring that vRSM never returns the wrong result-as opposed to liveness, such as guaranteeing that a client will receive a response.However, the ⊤ postcondition does actually guarantee an absence of all safety bugs during reconfiguration, such as corrupting the state sent to the new servers, losing some operations applied concurrently by the old servers, etc.This is because Reconfigure() does not own any interesting resources to start with, which precludes it from tampering with any resources held by the rest of vRSM.Any resources that Reconfigure() obtains must come from invariants, such as I rep .However, Grove requires that the proof correctly re-establishes any invariant that is opened, thereby ensuring that the invariants are maintained throughout the execution of Reconfigure().
Proof.The overall structure of Reconfigure() is to get a new epoch, then choose one of the old servers to seal and get a copy of the old state from, then send this state to all of the new servers, register the configuration, and activate the new primary.The proof relies on the fact that Grove's append-only list resource, used to represent vRSM's log, is indexed by epoch.
A key aspect to the proof lies in the resource returned to Reconfigure() by the GetStateAndSeal RPC.The postcondition of GetStateAndSeal(newEpoch) is: ∃ℓ, (oldState corresponds to log of operations ℓ) * Once the proof receives ownership of the read-only resource accepted j [e] list → □ ℓ for the old epoch, it can conclude that the old configuration will not apply any more operations.This is because replica j must have given up ownership of its list → ℓ to produce the read-only resource, which precludes it from extending ℓ with more operations.After reconfiguration completes, a new append-only list resource accepted j [e + 1] will be allocated (assuming the new epoch is e + 1), which allows vRSM to append new operations.
As in the primary/backup replication proof, Grove allows the developer to prove the correctness of Reconfigure() with modular reasoning, without having to consider explicit interleavings with other parts of the system.Nonetheless, the proof does rule out bugs due to subtle interactions, as follows: Why can't the old primary commit additional operations?The developer might worry that after Reconfigure() fetches the old state, the old primary could execute additional operations.This would mean that Reconfigure() will send an incomplete state to the new servers, losing an operation.This possibility is ruled out in vRSM's proof due to the read-only resource sent back by the GetStateAndSeal RPC.Having this read-only resource implies that the old primary cannot add any more operations to committed list → ℓ, since that would require adding the operation to every old replica, which would in turn contradict the read-only resource.
What if the log contains operations that were never committed?It is possible that the log obtained by Reconfigure() contains some operations that were never committed in the old configuration, for example if the primary sent the operation to some but not all backups before failing.However, it is nonetheless safe to commit those operations in the new config.This is because the primary first adds an operation to its own log before sending it to the backups.vRSM captures this in an invariant by stating that every operation in a backup's accepted j [e] must also appear at the same position in the primary's accepted 0 [e], making it safe to commit.A client clerk will learn the outcome of its operation when it resends its request to the servers in the latest configuration.
What if a replica is sealed again?E If some machine initiates reconfiguration, it might seal one of the existing replicas but then crash.A second reconfiguration may then try to seal the replica for a second time.In vRSM, a sealed replica can be sealed any number of times in the same epoch, until a new epoch starts.In the proof, this is reflected by the fact that accepted j [e] list → □ ℓ is duplicable and can be sent in response to repeated seal requests.

Lease-based reads
The key challenge for lease-based reads lies in ensuring that the result returned by ApplyReadonly(), as shown in Figure 7, reflects a properly linearized execution: the result cannot be stale (i.e., missing writes that have already finished), and cannot be rolled back (i.e., reflect uncommitted writes that might be lost due to a crash or reconfiguration).
vRSM proves that the value is not stale by using the lease.The read operation will be executed against the state of the local server, represented by accepted j [e].At the instant of the GetTimeRange invocation on line 4, the proof opens the lease invariant to obtain CurrentEpoch → e, and opens the replication invariant I rep to obtain the fact that all of the operations in committed are also in accepted j [e] (which has not been superseded by any higher epoch e ′ ).
To prove the second part (that the returned value cannot observe writes that will be rolled back), vRSM waits until all of the writes preceding the read's linearization point to the same key have been committed, in waitForCommitted.There are two cases to consider.First, there may be no pending writes, e.g., if at the instant of the GetTimeRange invocation, idx is already committed.In this case, the proof linearizes the read operation immediately at the instant of GetTimeRange.The second possibility is that there are some pending writes to commit.In this case, the proof maintains a set of ghost state updates (based on the helping pattern [5,7]) that must be logically applied when the preceding write is committed (which happens in the primary server's proof of Apply()).Note that this case distinction is only possible in the proof; the code does not know which case it's in when running GetTimeRange, so waitForCommitted waits for idx to be committed in both cases (and, in the first case, returns right away).If reconfiguration happens in the meantime, and the epoch number changed before idx could be committed, the read result might not be valid, and the server tells the client to retry.
What if the server pauses for a long time after checking the lease?A typical concern with leases is the freshness of the lease check.What if the server running ApplyReadonly() does the lease check on line 4 of Figure 7, but then pauses for a long time (e.g., due to garbage collection) before actually executing the read operation on line 6?By that time, reconfiguration may have taken place, choosing a new primary, and that primary has executed more writes.
Such delays cannot violate correctness.Even if a new primary executes writes, the read will be linearized before those writes.This is allowed because the read request arrived before reconfiguration (since the lease check passed after the request was sent).The lock held by ApplyReadonly() ensures that the server's state does not change between the lease check and the execution of the read operation.
In the proof, the read operation is linearized at the instant of the GetTimeRange() invocation in Figure 7, if the returned latest time is less than the lease expiration time.This allows the proof to open the lease invariant to establish that the epoch is still e, and thus the in-memory state of that replica j that will be accessed by LocalRead, corresponding to accepted j [e], will be committed if waitForCommitted() succeeds.

Client cache consistency with leases
The top-level spec for the key-value caching library cachekv is that Get and Put behave like a linearizable key-value service, with GetAndCache working functionally like a Get.
To prove the linearizability of its lease-based caching, the library uses a ghost map resource, which has a k → kv v assertion representing the fact that the current value of k is v.The library maintains two invariants: one global invariant across all instances of the key-value caching library that share the same state, and one local lock invariant for each node's own cache, protected by a mutex lock on that node.The global invariant maintains, for each key, a time-bounded invariant k →kv v L and ownership of the expiration time L expires → exp, corresponding to the expiration time encoded in the underlying key-value pair.The expiration time may be in the past if the most recent lease on k has already expired.On each node, the library's lock invariant maintains k →kv v L and a lower bound on the lease expiration time, L expires ≥ exp, corresponding to the expiration time stored in its local cache.
When a client executes Get, the library acquires its pernode lock and checks the lease expiration time for the requested key.If the key is cached and the lease is not expired, the library opens the time-bounded invariant to prove that the cached value is the current value for that key.Put waits until the lease expires, at which point its proof can reclaim the k → kv v resource from the lease, update it to the new value v ′ , and then put it back into a new lease with the same (expired) expiration time, to re-establish the global invariant that every key corresponds to some lease.The proof of GetAndCache extends the lease in the global invariant, and makes a copy of the lease and the corresponding expiration time lower bound in its local node invariant.
Why can't a Put change a currently cached value?The proof of Put has to update k → kv v to maintain the invariant.To do this update, the proof needs ownership of the points-to.If a client currently has that key cached (and not expired), then the time-bounded invariant containing that key's points-to has not expired yet, and the Put proof cannot reclaim it.
Why can't GetAndCache decrease the lease expiration time instead of extending it?This would result in a bug because a Put might change the value while another client still believes it has an up-to-date cache.The proof would get stuck when re-establishing the global invariant with the decreased lease expiration time, because that would require updating ownership of the expiration time L expires → exp to a smaller number exp ′ < exp, which is not possible: the lease extend rule from §3.4 only allows the expiration time to increase.

Versioning read-only operations E
When the developer supplies a VersionedStateMachine struct to the vRSM library, the developer must prove that their methods satisfy the spec expected for this interface.To do this, the developer can choose what resource they want to use to represent ownership of the in-memory state on a given node, which we will denote by VSM → ops.ops refers to the list of operations that have been executed so far, and VSM → ops says that the local state corresponds to exactly those operations applied in that order, even if the implementation does not keep all of them in memory.The developer also specifies a logical function, ComputeReply(ops, newop), which determines the expected reply for a new operation newop executed after a list of preceding operations ops.
A key optimization for achieving fast reads is to avoid waiting for all of the writes to be committed before responding to a read.The Read() method in Figure 10 specifies the write operation that the read depends on, by returning the uint64 index of this write dependency as the first return value.The specification for Read(), which the developer must prove, requires that this index be correct: where StableReply(ops, idx, op, res) is defined as: ∀ops', ops' ⊑ ops ∧ len(ops') ≥ idx → ComputeReply(ops', op) = res This specification says that Read() does not modify the logical state (the postcondition returns the same VSM → ops that it got in the precondition) and captures the necessary conditions for the vRSM library to safely execute the read after idx is committed.
Why can't the application incorrectly track read dependencies?If Read() fails to correctly track read-write dependencies, it might return an idx value that's too small-that is, missing a relevant recent write that affects the read result.The Read specification ensures this cannot happen.StableReply requires that the read op returns the same result res regardless of where in the log of operations it is executed, from idx up to the entire current history ops.If Read() had a bug, the developer would not be able to prove this specification for their implementation.
Why can't a read-only operation have side effects?The implementation of Read() is arbitrary code, and can modify state.It is safe for Read() to make internal changes, which are not visible at the interface (e.g., modifying some internal caches).However, if Read() makes changes to the state that are visible at the interface level (e.g., inadvertently changing the value of some key in vKV), it would violate vRSM's guarantees, such as linearizability and replication.By specifying VSM → ops, the developer implicitly decides which in-memory state is visible: state not reflected in ops is not visible to ComputeReply and hence cannot affect vRSM's behavior.Since the postcondition of Read() requires VSM → ops, the developer would be unable to prove this postcondition if Read() made any visible changes.

Application state consistency in paxos E
The proof of the paxos library, which is used to replicate configservice, is structured much like the proof of the primary-backup library.One difference is that paxos uses majority quorums instead of replicating to every server, which means that a majority quorum q of the accepted i resources agree with the committed resource, rather than all of them: ∃ℓ, ∃e, ∃q,

⊒ ℓ
Intersection of quorums allows the proof to conclude that the new leader has all of the previously committed operations after checking with a majority of the n servers, allowing it to gain ownership of the leader resource for the new epoch.
The big difference from primary-backup replication is that paxos provides weak guarantees when reading the state.Figure 15 shows the precise specification for Begin, which returns two values: the oldstate value and the commit callback function.The WF predicate (chosen by the caller of the paxos library, such as configservice) captures the notion of a wellformed state.For example, in the case of configservice, WF defines a well-formed encoding of the configservice state consisting of an epoch, a configuration, lease expiration, etc. paxos captures the fact that oldstate might be stale by requiring that WF is duplicable: that is, WF can only capture properties that are always going to be true about a state, and cannot capture non-duplicable facts such as freshness.The rest of the commitspec specification is written using angle brackets, which in Grove (like in Iris) indicates a logically atomic specification [27].This means that the transition between the angle-bracketed pre-and postcondition occurs atomically at the linearization point inside this function, which-crucially-lets multiple threads or nodes call this operation concurrently on the same exclusive resources.
In this case, that exclusive resource is Paxos → state', which represents ownership of the fact that state' is the latest committed state.This resource is held by the application's invariant; for instance, configservice maintains an invariant owning this resource and specifying that there is also a lease valid until the expiration time encoded in that state.
The commit specification now says that, at some point during commit's execution (specifically, at the linearization point), it will need ownership of the Paxos → state' resource, for some state'.(The special ∀∀ quantifier indicates that this variable is quantified across both the pre-and postcondition, and that its value only gets determined at the linearization point.The value is chosen by the client that invokes commit.)This state' might not be the same as oldstate, which captures the possibility that other transactions may have modified the state by the time commit is executed.The logical atomicity postcondition says that, either commit will return an error and the state remains state', or commit will return success, in which case the caller learns that state' was in fact the same as oldstate and the committed state is changed to newstate.
What if configservice gets stale or invalid state from paxos?When configservice's implementation of ReserveEpochAndGetConfig, shown in Figure 9, gets the starting state oldstate, Begin's postcondition promises that it is well-formed according to WF, which allows the proof to conclude that it's safe to unmarshal it.However, because WF is duplicable, the proof would not be able to (incorrectly) conclude anything about the freshness of oldstate, such as whether the corresponding lease is currently valid.When invoking commit, if commit is about to return success, then the logically atomic specification allows the proof to conclude that the state was indeed fresh.However, the precondition of commit requires the caller (the proof of ReserveEpochAndGetConfig) to establish WF(newstate) before knowing whether oldstate was fresh or not, which prevents the proof from embedding any information about whether oldstate was ever committed into WF(newstate).

Bank transactions E
The bank uses one instance of vKV to store account balances and one instance of a lock service to maintain locks on individual accounts.The specification for the lock service allows maintaining a lock invariant for each lock.For each account maintained by the bank, the bank puts the account's balance resource, acct → kv bal, in the lock invariant for lock acct maintained by the lock service.When the bank needs to access a specific account (either for Transfer or Audit), it acquires the account's lock, which gives it that account's points-to resource, allowing it to access the balance in vKV.
Why can't the lock service lose its state after a crash?If the lock service lost track of held locks, and allowed acquiring a lock that was already held before the lock service crashed, the proof of the lock service would get stuck.Specifically, the lock service would need to give the caller access to the lock's invariant (the balance points-to in the case of the bank example), but the lock service already handed out that resource before the crash, and separation logic does not allow duplicating resources.Thus, the proof must ensure that at most one copy of the lock resource is handed out at any given time.
Why can't the bank fail to follow the locking rules?If the bank fails to acquire the appropriate locks, it might have race conditions updating the same accounts from different threads.However, separation logic rules prevent this: accessing a keyvalue pair in vKV requires ownership of the corresponding points-to resource for that key.The bank has exactly one points-to resource for every key, and that resource is stored in the lock service.Thus, if the bank does not acquire the lock for an account, it will not have the resource needed to allow its proof to access the balance.

Developing Grove proofs
A major benefit of Grove's use of concurrent separation logic is that it allows proving each function-and even each line of code-in isolation, without explicitly considering the interleavings due to concurrency, crashes, recovery, etc.This not only allows incrementally proving a single library, but also enables combining the proofs of multiple components or functions into a single proof for the composed system.
Composability of proofs is a powerful property that is not true in the general case.For example, if a system is running both a key-value service and a lock service, the key-value library might inadvertently send a message to the lock service that causes it to release a lock, thereby invalidating its proofs.As another example, if the developer adds a new RPC method to a key-value server, and this RPC incorrectly updates data structures used by other RPCs, adding this RPC would invalidate the proofs of existing RPC methods.Concurrent separation logic enforces ownership rules to ensure that all verified code is "well-behaved" in a way that avoids problems like the above, and allows for sound composition.
The rest of this section presents several case studies that illustrate the benefits of concurrent separation logic in Grove.

Proving top-level spec for vKV
The top-level theorems for vKV are specifications for the top-level functions: 1.The replica server main() function is crash-idempotent [7,43].This covers the execution of any code invoked by main(), including any RPC handlers that main() sets up.
2. Reconfigure(newServers) is always safe to run, if newServers are all valid replica servers.
3. MakeClerk(configAddrs) correctly initializes a clerk, if configAddrs are the addresses of the configservice.Proving these theorems involves proving specs for functions in vKV one-by-one, using specs for lower-level components to verify higher-level code, culminating in a proof of the top-level functions.A client application that uses vKV (e.g. the bank) can then be verified by applying these theorems to reason about Put and Get calls.

Evolving vKV to add leases
We originally built and verified the vRSM library and vKV without leases.This original version executed Get operations as a read-write operation; that is, by replicating the Get operation to the primary and all backups (including waiting for the replicas before replying to the client).
We later decided to improve performance of read-only operations by adding leases.This involved several changes: (1) adding GetLease to the configuration service and making sure other RPCs wait for any outstanding leases to expire before advancing the epoch number; (2) adding ApplyReadonly to the vRSM library as well as a helper thread on each replica that extends its lease with the GetLease RPC; (3) propagating the number of committed operations from the primary to the backups; (4) introducing VersionedStateMachine as the vRSM library's interface to allow some reads to happen without waiting for ongoing writes to finish; and (5) bypassing the exactly-once operations library for read-only operations.
In the proof before adding leases, the configuration service always owned the epoch number and could always advance it.With leases, ownership may reside in a time-bounded invariant, so the proof now must establish ownership by using the fact that the code checks for lease expiration, which allows the proof to use the time-bounded invariant expiration rule from 3.4.To reason about linearizability of lease-based reads from replica servers, the proof of the primary/backup replication and reconfiguration protocol remained the same, but we added a new proof on top that shows that ownership of the current epoch means any replica's state is at least as up-to-date as the committed operations.

Evolving configservice to use paxos E
We originally built and verified vRSM with an in-memory and unreplicated configservice.This original version was not fault-tolerant, and the previous top-level theorems of vKV assumed that the sole configservice server never crashes and restarts.To improve the fault tolerance of the overall vRSM system, we implemented the paxos library and modified the configservice to use paxos to manage state.
The proof of paxos borrows heavily from the proof of primary/backup replication and reconfiguration.We started by making a copy of the proof of the primary/backup replication protocol, then modified key invariants to make them quorumbased (such as the replication invariant described in §4.6), and finally reproved many of the same lemmas against these modified invariants.
The old configservice used a lock to coordinate concurrent RPCs accessing the configuration state.The new paxos-based configservice replaces those Lock and Unlock calls with calls to paxos.Begin and commit respectively, with additional error handling since commit can fail.Across this change, configservice's clerk API and specification (Figure 5) remained unchanged, except for clerks now taking multiple network addresses as input for the multiple configservice servers.Since the proofs of other components that use configservice depend only on the specification of configservice-and not on how that specification is proven-those proofs also still worked without major changes.For instance, the new proof of replica differs from the old proof by 18 lines added and 4 lines changed, which were largely needed because replica servers now take as input a list of multiple network addresses used to contact the configservice instead of a single address.

Line-by-line reasoning E
A developer in Grove can verify each component's implementation line-by-line.Instead of having to explicitly worry about interleavings and interactions (like the mover reasoning in IronFleet [21], CSpec [6], and Armada [35]), Grove's resources (either owned by specific threads or by specific invariants) indirectly constrain how different components can interact with one another.
For example, the proofs of different functions shown in §4.1, §4.2, and §4.3 form a proof about the combined system, without any explicit consideration of how these functions (Apply, Reconfigure, and ApplyReadonly) will interleave.The proof of Apply ( §4.1) uses the I rep invariant, and the proof of ApplyReadonly also uses the same invariant, yet the two proofs do not explicitly reason about each other (and moreover, the proof of Apply does not even know about the existence of the time-bounded lease invariant).Although the proofs are independent, they nonetheless cover all of the possible interactions in the resulting system.

Separate proofs of components E
The bank exemplifies how Grove enables verifying crash-safe, high-performance, distributed applications out of individual verified components.The bank code and proof refer only to the simple interfaces and specifications provided by the vKV and lock service clerks.Despite that, the bank's specification provides a strong guarantee in the face of crashes, reconfigurations, concurrency, network partitions, loosely-coupled clocks, etc. Grove's modular specs and proofs also allow concurrent development and replacement of components: the bank application developer can write and prove their code in parallel with the vKV developers, once they agree on the specification for the vKV interface, and similarly, vKV's implementation can be replaced with a different key-value store (e.g., using Raft [40]) as long as it satisfies the same API spec.

Implementation
Grove is implemented by extending Perennial [7], which is based on Iris [25,26,31] and Coq [45].Grove inherits reasoning principles for concurrent Go code from Perennial, inherits general support for interactive separation logic reasoning and ghost resources from Iris, and adds support for distributed systems with new reasoning principles for the network, clock, and independent node crashes.Grove's extensions to Perennial involved 1,597 lines of Coq proof for new reasoning principles, along with other hard-to-quantify minor changes throughout Perennial.Grove comes with a distributed composition soundness theorem, which proves correctness of Grove's reasoning principles by showing that they imply a simple statement about the behavior of the distributed system under Grove's execution model.Figure 17 shows the breakdown of code and proof for the different components.The top-level specification of the bank, which builds on most of the other components, is 52 lines.The specification for vKV, consisting of the four parts described in §5.1, is 49 lines.We confirm that the proof is complete using Print Assumptions in Coq.Across the different components, verification required 12× the lines of proof as lines of code, which is comparable to other concurrent and distributed systems verification projects: IronFleet's overhead is slightly lower [21] (and IronFleet also includes a proof of liveness, though it does not handle thread concurrency, leases, crashes, or reconfiguration), but GoJournal's is slightly higher [8].
One conclusion is that verifying a complete distributed system, such as vKV, which handles node-local concurrency, crash recovery, leases, and reconfiguration, did not come at the cost of an inflated proof overhead.

Evaluation
To demonstrate that Grove is capable of verifying realistic high-performance distributed systems, this section experimentally demonstrates that the vKV prototype, which we verified using Grove, is able to achieve high performance.
We also demonstrate that leases are particularly important for achieving high performance for reads in vKV.
Experimental setup.To evaluate vKV's performance, we use 8 CloudLab servers, with up to 3 for replicas, 4 for clients, and 1 for the configuration service.Each machine has an Intel Xeon CPU E5-2630v3 2.4GHz processor with 8 cores, 64GB of RAM, an Intel 200GB 6Gb/s SSD (SSDSC2BX200G4R) for storage, and an I350 Gigabit network card.We generate requests using YCSB [12] with uniformly random keys and 128-byte values.Clients run in a closed loop, issuing a new request as soon as the previous request completes; for each data point, we warm up the system for 20 seconds and then measure the performance for 1 minute.To measure throughput, we keep increasing the number of clients until the total throughput of all of the clients stops growing.
Baseline performance.To demonstrate that vKV achieves good performance, we compare with Redis.Redis is a widelyused high-performance key-value server, written in C. Redis targets somewhat different goals than vKV (it is designed to run on a single core, it does not support synchronous replication or live reconfiguration, etc), but it nonetheless provides a reference point in terms of absolute performance for a keyvalue store.To make Redis comparable to vKV in terms of its guarantees, we run Redis with the appendfsync always option to ensure it made changes durable before replying, and we run vKV on a single core (disabling all other cores in Linux) and with no backup replicas.Note that Redis does not implement exactly-once semantics for its operations (if a write gets retransmitted, it may end up being executed twice), whereas vKV stores a 16-byte request ID for each operation.
Figure 18 compares the performance of vKV with that of Redis.We report the mean of 10 runs; Redis's standard deviation is 1-2%, and vKV's is 7-11%, due to the high variance of the Go runtime when running on a single core.When running on multiple cores, vKV achieves higher throughput-e.g., 5.1× on 8 cores for YCSB 5% writes, with minimal performance variability.The results show that vKV's throughput is 67-73% of Redis's, and its request latency is comparable.Reconfiguration.To demonstrate that vKV can recover from server failures by reconfiguring the system to add new servers, all while continuing to correctly handle client requests, we run a two-server configuration of vKV.At 10 seconds into the experiment, the primary server is killed, and reconfiguration starts (changing to a new primary and a new backup server).We use a variation of the YCSB workload, with 100 clients always issuing writes, and 100 clients always issuing reads (rather than each client issuing a mix).This is because, during reconfiguration, writes block if one of the servers is sealed (which would ultimately cause all clients to block if they were issuing reads and writes), but reads can proceed (so clients that never issue writes can proceed).Figure 19 shows the observed throughput by the read and write clients over time during this experiment.The results show that vKV can continue serving reads while reconfiguring.When the primary is initially killed, read throughput dips while clients with outstanding read requests sent to the primary wait to discover their connection is closed and while the remaining backup server marshals its key-value state to be sent to the new servers.After the backup is done marshaling its state, and after the clients connect to the backup and retransmit their requests, reads recover some of the throughput.Reads do not recover to their original throughput because of stuck reads: for a client that tries to read one of the keys whose write was in flight when the primary was killed, waitForCommitted returns only after reconfiguration, because the old primary did not commit those writes before being killed (and the backup doesn't know yet that those writes will not be committed by the primary).As more read clients get stuck, read throughput starts declining again.After the state is transferred to new servers (copying 1M key-value pairs, each 128 bytes long), the system switches to the new configuration and resumes executing reads and writes (including all previously-stuck operations).Most of the reconfiguration time is spent marshalling the state and sending it to new servers via the reconfiguration process, (∼4 seconds in total).Read performance with leases.Figure 20 shows vKV's throughput for different workloads as more replicas are added.For write-heavy workloads (50% or 100% writes), adding replicas reduces performance because writes encounter more overhead at the primary server, and there are not enough reads handled by other replicas to offset the costs.For read-heavy workloads, adding replicas improves performance-e.g., for YCSB 5% and 0% writes, 3 servers achieve 1.7× and 2.3× the throughput of a single server, respectively.

Related work
Grove is the first to support verifying distributed systems with thread-and node-level concurrency, crash recovery with durable state, time-based leases, and reconfiguration.Verifying all of these aspects in a single framework is critical because subtle bugs can occur due to interactions between these features.vKV, a realistic replicated key-value store, demonstrates the benefits of Grove's modular reasoning by proving the correctness of its primary/backup replication, durable storage, reconfiguration, concurrency, and leases.vKV's design is not novel, but rather a case study of what it takes to build a fault-tolerant primary-backup replication system.In doing so, it captures key challenges in state-of-the-art (unverified) distributed systems with primary/backup replication and a configuration service for reconfiguration, such as Chain Replication [46], FaRM [14], Boxwood [37], Bigtable [9], Megastore [3], FoundationDB [48], Kafka [28], and Tuba [2].
Concurrent separation logic for distributed systems.Broadly similar to our work, Disel [42] and Aneris [17,32] also use concurrent separation logic in the context of distributed systems.However, neither Disel nor Aneris provide support for reasoning about time-based leases or recovery from crashes, and they have not been used to verify a system with reconfiguration.These restrictions limit the distributed systems they can reason about.For example, Gondelman et al. [18] use Aneris to verify an eventually-consistent primarybackup key-value store.However, that system does not support reconfiguration, so if the primary fails, the system cannot process any further writes.Furthermore, writes are only lazily copied to replicas for availability, and thus reads from replicas may return stale values.vKV uses a combination of reconfiguration and leases for availability when a primary fails, while also guaranteeing that reads from replicas are up-to-date.
State-machine refinement.An alternative approach to verifying distributed systems is to prove refinements from a high-level protocol description down to executable code, as in IronFleet [21], Verdi [47], and IronSync [20].However, these systems do not reason about time-based leases, reconfiguration, or node recovery.State machines also make it challenging to compose larger systems out of smaller components, which features extensively in our case study.IronSync [20] shows how to bring some benefits of ownership-based reasoning to state-machine approaches, but at a coarse granularity.
Distributed system abstractions.Adore [22] proposes an abstraction for reasoning about reconfiguration for replicated state machine protocols, such as Raft.vKV's primary/backup replication and reconfiguration uses a configuration service to simplify the protocol, but verifies many of the same issues, such as concurrent request execution during reconfiguration.
vKV also handles interactions between reconfiguration and crashes, recovery, leases, and thread-level concurrency, which the Adore abstraction does not directly address.
Protocol reasoning.TLA + [33,34] provides a modeling language for concisely describing distributed protocols, which can then be model-checked or interactively verified.In other tools, constraining the modeling language used for expressing protocols enables automatic or semi-automatic proofs of correctness, such as ByMC [30], Ivy [38,41], and I4 [36] and its follow-ons.Although protocol verification can ensure the absence of bugs in the protocol design, many bugs in distributed systems only manifest at the level of implementations, and so fall outside the scope of protocol verification.Grove aims to verify implementations of systems to address these bugs.

Conclusion
Grove is a library for verifying distributed systems using concurrent separation logic (CSL).Grove generalizes CSL to support distributed systems with RPCs, leases, replication, reconfiguration, and crash recovery.We demonstrate Grove by implementing and verifying range of distributed system components, such as primary-backup replication, locking, client caching, and a configuration service.Verifying these components in Grove eliminates broad classes of bugs, and comes with a 12× proof-to-code ratio, in line with previous efforts to verify concurrent and distributed systems.vKV, a key-value store built out of these components, supports primary-backup replication and reconfiguration, achieves 67-73% the throughput of Redis on a single core, and scales read throughput with more replicas due to its use of leases.

Figure 1 :
Figure 1: Components verified using Grove as case studies.

Figure 2 :
Figure 2: Case study components.An arrow A → B means A uses B. Gray components are described only in the extended version of this paper.

Figure 3 :
Figure 3: A running vRSM system.Double borders represent machines.Arrows represent RPCs.The cloud represents the replicated configuration service.reconfig represents an operator performing reconfiguration.

1 9 if 11 } else { 12 * 15 } 16 }Figure 9 :
Figure 9: Implementation of the ReserveEpochAndGetConfig RPC handler in configservice, built on top of paxos.This handler is invoked by the configservice clerk shown in Figure 5.

Figure 11 :
Figure 11: Interface provided by the vRSM client clerk.

Figure 12 :
Figure 12: Interface provided by the vKV client clerk.

Figure 14 17 return old.v 18 } 19 } 20 }Figure 13 :
Figure14shows the interface provided by the lock service, built on top of vKV.The lock service uses vKV's conditionalput CondPut() operation to implement locks, with one lock

Figure 14 :
Figure 14: Interface provided by the lock service.
accepted 0 [e] list → ℓ to accepted 0 [e] list → ℓ + [op], and gets the knowledge resource accepted 0 [e] list ⊒ ℓ + [op] before releasing the lock.To release the lock, the proof must give up ownership of accepted 0 [e] list → ℓ + [op], but retains knowledge of accepted 0 [e] list ⊒ ℓ + [op], which will be useful later on.Next, the primary invokes ApplyAsBackupRPC concurrently on all of the backup servers, passing in nextIndex to ensure that op is added to the backup's log at the right position.Using the RPC spec shown at the end of §3.3, the primary gets a promise that the jth backup accepted the operation, in the form of the lower bound resource accepted j+1 [e] list ⊒ ℓ + [op].The RPC spec is established in a separate proof.
committed list → ℓ + [op].With this points-to, the proof gets the lower-bound resource committed list ⊒ ℓ + [op], which is needed for the postcondition of Apply.

Figure 15 :
Figure 15: Specification for Begin() in the paxos library.The definition of commitspec is shown in Figure 16.

4 .
A clerk's Put(k,v), Get(k), and CondPut(k,e,v) functions behave as though accessing a local in-memory keyvalue map with linearizable operations.

Figure 17 :
Figure 17: Lines of Go code and Coq spec/proof for the verified components.

Figure 18 :
Figure 18: Throughput and latency of vKV compared to Redis.

Figure 19 :
Figure 19: Throughput over time (averaged over 0.5 second time slices), with the primary crashing at 10 seconds, followed immediately by a reconfiguration to a new primary and backup.

Figure 20 :
Figure 20: Peak throughput of vKV with increasing number of servers, labeled by the percentage of write operations.

1
// Reserve a new epoch number for reconfiguration, and 2 // return the current configuration (set of servers).Get a lease for specified epoch, as long as it's the current 15 // epoch, returning the new lease expiration time.16 func (ck *Clerk) GetLease(epoch uint64) (Error, uint64) Initialize state on all new servers with the state from the old replica, informing them of the new epoch (line 19). 4. Make the new epoch live at the configuration service, by sending it the new configuration (line 26).
1. Ask the configuration service to atomically create a new epoch and return the new epoch's number as well as the latest previous configuration (line 2).2.Seal a replica server from the previous configuration and fetch its key-value mappings (line 6).3.
The second component of the postcondition, commitspec, allows the caller to later attempt to commit a new state by invoking the callback commit, passing the new state as an argument.The definition of commitspec is shown in Figure16.Its precondition requires that newstate is well-formed according to WF. if err then Paxos → state' else ⌜state' = oldstate⌝ *