LiDO: Linearizable Byzantine Distributed Objects with Refinement-Based Liveness Proofs

Byzantine fault-tolerant state machine replication (SMR) protocols, such as PBFT, HotStuff, and Jolteon, are essential for modern blockchain technologies. However, they are challenging to implement correctly because they have to deal with any unexpected message from byzantine peers and ensure safety and liveness at all times. Many formal frameworks have been developed to verify the safety of SMR implementations, but there is still a gap in the verification of their liveness. Existing liveness proofs are either limited to the network level or do not cover popular partially synchronous protocols. We introduce LiDO, a consensus model that enables the verification of both safety and liveness of implementations through refinement. We observe that current consensus models cannot handle liveness because they do not include a pacemaker state. We show that by adding a pacemaker state to the LiDO model, we can express the liveness properties of SMR protocols as a few safety properties that can be easily verified by refinement proofs. Based on our LiDO model, we provide mechanized safety and liveness proofs for both unpipelined and pipelined Jolteon in Coq. This is the first mechanized liveness proof for a byzantine consensus protocol with non-trivial optimizations such as pipelining.


INTRODUCTION
Byzantine State Machine Replication (SMR) protocols [Schneider 1990], such as PBFT [Castro 2001], HotStuff [Yin et al. 2019], and Jolteon [Gelashvili et al. 2022] form the basis of modern blockchain applications.They ensure that a linear history of a state machine is correctly replicated to a group of nodes and safe from tampering by a minority of malicious nodes.They are also called consensus protocols because a key part of these protocols is to make the participating nodes agree on a single history.As the open nature of public blockchains requires executing consensus protocols on a large number of nodes, in recent years, there has been a significant amount of research proposing new protocol designs that have better safety and liveness properties [Abspoel et al. 2021;Civit et al. 2022;Lewis-Pye 2022;Naor et al. 2021;Naor and Keidar 2020].
Despite the results that improve various aspects of byzantine SMR, it remains a herculean task to implement these protocols correctly, so that they enjoy the features claimed on paper.A paper description of a protocol is almost never sufficient to specify the behavior of a process under all possible situations.The unspecified aspects are open to interpretation, yet these details can have very subtle effects on the actual system.This issue is especially prominent under byzantine faults since the adversary now has more ways to influence the non-faulty nodes.For a concrete example of this issue, we look at the pacemaker component of SMR protocols.Most SMR protocols are structured as an infinite sequence of smaller protocols called rounds or views, and each node participates in only one round at a time.The pacemaker drives the nodes to a new round when the current round is not making progress.As such, it plays a vital role in maintaining liveness.
The pacemaker usually consists of making each node broadcast a timeout message for its current round when no progress is observed within a certain period, and enter a new round after receiving a quorum of timeouts.In Fig. 1, we show four variants of this simple idea, with subtly different liveness properties.Notice that versions (b) and (d) differ only in allowing mixing timeouts from different rounds.This is significant because it allows non-faulty nodes to keep only the timeout of the highest round from each peer.Without this optimization, byzantine nodes can launch denial-of-service attacks by flooding non-faulty peers with timeouts, a tricky situation that would not occur under benign faults.This shows that paper proofs of protocols are not enough.We need proofs that can be directly tied to the implementation, which can only be achieved by machine-checked proofs on a formal model of the implementation.
Today, we have many formal frameworks for verifying the safety properties of distributed systems [Krogh-Jespersen et al. 2020;O'Hearn 2007;Sergey et al. 2017;Sharma et al. 2023].In particular, formal safety proofs of consensus protocols have been studied in Carr et al. [2022]; Cirisci et al. [2023]; Honoré et al. [2021Honoré et al. [ , 2022]]; Rahli et al. [2018]; Taube et al. [2018]; Wilcox et al. [2015].However, there are very few works that also establish formal liveness results for consensus protocols.IronFleet [Hawblitzel et al. 2015] and PSync [Drăgoi et al. 2016] establish liveness for benign-fault protocols such as Multi-Paxos [Lamport 1998], but do not handle byzantine faults.Padon et al. [2017] proposes a liveness-to-safety reduction approach for proving liveness, which has been applied to byzantine-fault protocols in Berkovits et al. [2019]; Losa and Dodds [2020], but their methodology has never been applied to partially synchronous protocols, which is the most common class of consensus protocols in practice.
Finally, none of the existing machine-checked liveness proofs are based on refinement, whereas safety results are usually stated as a refinement between the network model and the abstract interface.This situation is unsatisfactory for several reasons.Most importantly, it obscures highlevel reasoning and prevents proof reuse since every definition and every lemma is tied to networklevel details.It also poses a challenge to users of the system since they must understand the implementation details of the system in order to understand what is proved as "liveness." In this work, we aim to simplify the task of constructing liveness proofs of byzantine consensus protocols.We achieve this by introducing an intermediate model of consensus between the SMR interface and network details that supports proving both safety and liveness via refinement.Our key insight is that existing models lack a representation of the pacemaker, which, as we have seen, is critical to liveness, and this prevents them from handling liveness.We start from the Atomic Distributed Object (ADO) model, which has been used to verify the safety of several benign-fault protocols [Honoré et al. 2021[Honoré et al. , 2022]].We show that by adding the pacemaker state into the model, the liveness of consensus can be reduced to a few safety properties on timed traces, which can be easily proved through refinement.We also introduce segmented traces, a variant of timed traces that enables more effective formalization of liveness properties and proofs.Using our LiDO model, we obtain safety and liveness proofs for both unpipelined and pipelined Jolteon [Gelashvili et al. 2022].
To summarize, our contributions are: • LiDO, a model of consensus formalized in Coq, supporting both safety and liveness reasoning via refinement; • Segmented traces, an effective formalism for proving liveness properties via refinement; • Implementations of both unpipelined and pipelined Jolteon in Coq, providing case studies for our methodology; • Refinement-based proofs of both safety and liveness of Jolteon using our LiDO model.
All proofs have been mechanized in Coq and are available as artifacts [Qiu et al. 2024a].We have also extracted unpipelined Jolteon into an OCaml executable, showing that our network model is reasonable and realistic.For additional details, see the appendices A, B, C, and D, which are available in the extended technical report [Qiu et al. 2024b].

Background: State Machine Replication Under Partial Synchrony
The Partial Synchrony Assumption.Message-passing distributed systems rely on getting messages delivered to make progress.Therefore any liveness property of such systems depends on assumptions about message delivery.Depending on the kind of assumptions they make, the systems are classified into asynchronous, synchronous, or partially synchronous protocols [Dwork et al. 1988].This work targets proving the safety and liveness of partially synchronous SMR protocols.Our theory can be applied to both benign-fault and byzantine-fault tolerant protocols, but in this work, we mainly consider Byzantine fault-tolerant (BFT) protocols.
There are two versions of partial synchrony [Dwork et al. 1988].In one version, there is a fixed upper bound Δ of message delivery latency, but it only holds after a certain timepoint called global synchronization time (GST).The participating processes know Δ but do not know when GST commences.In another version, the delivery latency is always bounded, but the processes do not know the exact bound.In this work, we use the first version, as it is easier to work with formally.
State Machine Replication.The safety definition of SMR is well-known.Clients submit requests to the system.The system outputs responses to requests, each request is responded to at most once, and the request-response trace must linearize to an atomic spec of a state machine [Herlihy and Wing 1990].Under partial synchrony, the system processes do not know when GST begins, so they cannot rely on messages being delivered in time.As such, they maintain safety even during periods of asynchrony.
The liveness definition of SMR is more subtle.There can be two reasonable definitions: Definition 2.1.An SMR system is live if every client request is eventually responded to.
Definition 2.2.An SMR system is live if it responds to client requests infinitely often.
There is a gap between the two definitions.Under Definition 2.2, the system may selectively respond to a subset of requests.When there is only a fixed number of clients, we can simply make the system choose among client requests in a round-robin fashion, so that each client is fairly serviced.When the SMR clients are unbounded in number, as in public blockchains, maintaining fairness among clients is a research problem that is beyond the scope of this work [Kelkar et al. 2020;Kursawe 2020].Therefore, this work aims to establish Definition 2.2.

The ADO Model of Consensus
Although the safety and liveness definitions of SMR are simple and intuitive, proving that an implementation satisfies these definitions is not.As one of the early attempts, Verdi [Wilcox et al. 2015;Woos et al. 2016] used 50,000 lines of code to prove the safety of Raft but did not verify liveness.Proofs of this complexity are difficult to maintain and difficult to port to other implementations.
To better manage the complexity of proofs, a successful strategy is to introduce an intermediate abstraction between SMR and the network model.The abstraction captures essential information about the network state but remains simple enough to allow easy reasoning about system behavior.Most notably, the Atomic Distributed Object (ADO) theory has been proposed to verify the safety of multiple benign-fault consensus protocols, including Raft with reconfiguration [Honoré et al. 2021[Honoré et al. , 2022]].Here we give an intuitive introduction to ADO.The formal details are in Section 3.
The ADO theory gives a detailed view of how the consensus log grows during the execution of a consensus protocol.It models the execution of a protocol as a group of proposer processes interacting with a concurrent object.The basic idea of ADO is that within each round of consensus, three events occur in sequence: first, the leader is given an up-to-date branch of the consensus log; then, it appends one or more requests at the tip of the branch; finally, it attempts to commit its changes.These steps are called pull, invoke, and push.In each step the leader may collect enough votes and succeed, or it may fail to collect votes before the pacemaker drives it to the next round.
To model this behavior, the ADO object exposes three operations ( ), ( , ), and ℎ( ), where is the round the proposer participates in; is the request (called method in ADO theory) the proposer wishes to append.The object responds to each call with either or .When it responds to a call, a cache node is created to record information about the response.Successful responses to , , ℎ correspond to ℎ , ℎ , ℎ respectively, where , , stand for Election, Method invocation, and Commit.The cache nodes are chained together by causal relation to form a cache tree.
Fig. 2 shows an example cache tree.Each ℎ represents a client method that has been proposed by a proposer.We say an ℎ is committed if there exists a path from that ℎ to a ℎ .The safety property of an ADO object is an invariant of the cache tree: there exists a path from that contains all committed ℎ .It follows that we can take the sequence of all committed ℎ on this path as the consensus log, and implement SMR on top of it, as was done in Honoré et al. [2021Honoré et al. [ , 2022]].The linearization point of each client method is the point where the corresponding ℎ becomes committed.Hence, SMR liveness is equivalent to creating ℎ and ℎ infinitely often in the ADO cache tree.ℎ is commi ed if there exists a path from it to a ℎ .Hence ℎ of round 1 and 3 are commi ed, but ℎ of round 2 is not.

The Need for a New Model
The ADO model nicely abstracts out the common logic of safety proofs, but it has not been useful for verifying liveness.We now look at why proving liveness remains difficult.Intuitively, a refinement-based liveness proof should involve the following steps: (1) Among the valid traces of the abstract model, we identify a subset of live traces; (2) We prove that all live traces of the abstract model satisfy SMR liveness (Definition 2.2); (3) We identify the live traces of the implementation; (4) We prove that every live trace of the implementation refines a live trace of the model.Clearly the key part of this plan is the first step.We need to carefully define the model and its live traces so that it is both easy to prove that every live trace satisfies SMR liveness, and to prove that an implementation refines the live traces.
Temporal properties are easiest to work with when they are posed as safety properties.That is, they concern system dynamics over only a finite period of time.For example, "the system commits client methods infinitely often" is a liveness property, but "the system will commit one client method within each period of 10Δ" is a safety property.Ideally, when we define live traces of the abstract model and the network model, we should always characterize them using safety properties.
We now try to execute the plan on the ADO model.As discussed above we have to create a ℎ infinitely often.To reduce this to a safety property, naively, we may try: Example 2.3.After GST, when a non-faulty proposer calls ℎ( ), it receives within 2Δ.
If we could prove this, and we arrange non-faulty proposers to call ℎ infinitely often, we could show that a ℎ is created infinitely often.At first glance, this seems intuitive.At the network level, a call from a non-faulty proposer usually corresponds to it broadcasting a request message.Since the message will be delivered to every non-faulty voter within Δ, and the votes will come back within Δ, the request will succeed within 2Δ.
However, in making this inference we have neglected an important factor: the pacemaker.By influencing the round-change process, the adversary may obstruct liveness in a number of ways: • The non-faulty nodes may never enter round , so the leader may never make its request; • The byzantine nodes may initiate a round-change before the request succeeds, so the leader will not receive a response.Even if byzantine nodes do nothing, the non-faulty nodes will still initiate round-change when their timers expire.Therefore to prove that a request will succeed, we at least need to assume that all non-faulty nodes still have sufficient time in their timers.Unfortunately, the ADO model does not capture information about these local timers, preventing us from formally expressing this assumption.This clearly shows that we need a new model that incorporates the pacemaker information we need.This motivates us to propose the LiDO model.As shown in Fig. 3, we add two state variables and remaining time ( _ for short).These variables represent a logical timer: represents the round the voters currently guarantee liveness for, while _ is the least amount of time the voters promise not to timeout.In this work, we will consistently use Δ as a unit of time, so _ = 3 intuitively means none of the non-faulty voters will timeout within 3Δ.
The timer variables can only be manipulated through a number of calls, shown in Fig. 3.In particular, () represents the flow of time: it decreases _ by 1.The call () increases by 1, but it can only be called when _ = 0.This says that the adversary cannot terminate a round until the logical timer has expired.The adversary also cannot make time flow too fast.We capture this by allowing () to be called at most once per period of Δ. Hence if _ = 3, then the adversary must call () three times, taking a period of 3Δ, before it may increase .Our safety rules on () and () thus formalize the notion that the adversary cannot preempt a non-faulty leader too soon.We also allow the leader of round to call ( ), a request to start round + 1, after all requests in round have succeeded.More formal details are in Section 3.2.
With pacemaker information added to the model, we can weaken Example 2.3 to Example 2.4.After GST, if = and _ ≥ 2, the leader of round is non-faulty and calls ℎ( ), it receives within 2Δ.
This is now intuitively implementable because the pacemaker will not intervene within 2Δ.

Proving Liveness Under Partial Synchrony
In the previous sections we gave an intuitive explanation of our LiDO model design but did not give any formal liveness details.We now introduce our liveness formalism.
In synchronous or partially synchronous systems, the most general formalism for characterizing live traces is timed traces [Lamport 2005].We first assume there exists a variable that represents physical time and increases continuously.When each event occurs, we pair it with the current reading of .The trace thus consists of a sequence of pairs The problem is that continuous time is difficult to encode in proof checkers.We can look at timed traces in a different way.In general, we only consider non-Zeno timed traces, meaning only a finite number of events may occur within a finite period of time.Hence assume the set of all events occurred before any timepoint is always finite.Then we can cut the trace into segments, each representing a period of Δ, and use them to cover the entire trace.Formally: Definition 2.5.A segmented trace is an arbitrary-length sequence of finite untimed traces ( 0 , 1 , • • • ), such that each is a valid trace, and each is a prefix of +1 .
Definition 2.6.Let be any timepoint with ≥ .We define the ( , Δ)-segmentation of a non-Zeno timed trace to be ( 0 , 1 , • • • ) where is the sequence of all events that occurred at some timepoint < + Δ.If is infinite, the segmentation is also infinite; if is finite and only covers events up to timepoint + Δ, the segmentation also ends at .
In Definition 2.6, the requirement that ≥ ensures that all events occurred before GST are hidden into 0 , so we do not need to worry about whether each segment between and +1 is before or after GST.
Segmented traces provide a convenient formalism for stating temporal properties with time constraints.To see this, we look at formalizing the partial synchrony assumption in a network model.In this work, the formal definition of partial synchrony under timed traces is: Assumption 2.7.If process sends to process at timepoint , both , are non-faulty, then process receives at least once in the interval For a valid trace , let ( ) be the set of all messages already sent within , and let _ ( ) be the set of delivered messages, represented as ( , ) pairs.For each message , let ( ) be its sender and ( ) its recipient set.Let be the set of non-faulty processes.Then we have: Lemma 2.8.In every ( , Δ)-segmentation of every live timed trace of a network model, we have Proof : If was sent before GST then it is delivered at least once before + Δ.If it was sent within the interval [ , + Δ) then it is delivered at least once before + ( + 1)Δ.Thus we can take Lemma 2.8 as the definition of partial synchrony under segmented traces.
Refinement of Segmented Traces.We now observe that segmented traces enjoy a natural notion of refinement.Let be a refinement mapping between a spec system and an implementation system.Then maps traces of the implementation to traces of the spec.Furthermore, the definition of a refinement mapping requires that, if is a prefix of ′ , then ( ) is also a prefix of ( ′ ).
It follows, if ( 0 , 1 , • • • ) is a valid segmented trace of the implementation, then ( ( 0 ), ( 1 ), • • • ) is also a valid segmented trace of the spec.We say that ( 0 , 1 , Thus, segmented traces are a very convenient formalism for analyzing partially synchronous systems.They are easy to encode in proof checkers, and it is easy to define refinement between them.Throughout this work, we will use segmented traces as the main formalism for analyzing liveness.Our plan consists of the following steps: (1) We specify the LiDO model of consensus as our spec (Section 3); (2) We specify a set of segmented traces as the live traces of the LiDO model, and prove that they satisfy SMR liveness (Section 3.3); (3) We specify a system model for implementing unpipelined and pipelined Jolteon (Section 4.1); (4) We define a set ′ of segmented traces as live traces of the implementation, and show that ′ covers all timed traces of the implementation (Definition 4.3); ( 5) We establish refinement mapping between the implementations and LiDO, and prove that every trace in ′ refines a trace in (Section 4.3).
Layered Refinement Proof.Fig. 4 shows a schematic diagram of our refinement proof between LiDO and Jolteon.In between LiDO and the network model, we used two additional layers called the Server and Voting layer.These layers implement the LiDO model in a shared-memory manner.This allows us to focus on the important invariants maintained by the protocol while ignoring details such as message delivery and bookkeeping, which only appear in the network model.Introducing these intermediate layers simplifies overall engineering effort.See Section 4 for details.

THE LIDO MODEL OF CONSENSUS
In this section, we formally define the LiDO model and its live traces.We first define the ADO model as a concurrent object, and then define the LiDO model as an extension of ADO.

The ADO Model
Algorithm 1 The Method Pool Object 1: initialize: ← ∅ 2: upon ℎ ( ): return ∈ In byzantine consensus, although byzantine proposers can send any message, the system is required to maintain external validity, meaning all requests committed in the log must come from external clients, not fabricated by byzantine proposers.Therefore, we first assume there exists an object called the method pool (Algorithm 1).The object state is a set of client-signed methods.Initially, = ∅.The object exposes two operations ℎ ( ) and ℎ ℎ ( ).SMR clients may call ℎ ( ) to add a method into .The ADO object may call ℎ ℎ ( ) to check whether has been registered or not.Both calls are atomic.In an implementation, they correspond to the client signing a request, and the voters checking the signature.
The ADO object proper is a concurrent object, formalized as a transition system consisting of two kinds of events: an agent making a call on the object, and the object responding to a call.The object does not need to respond to each call immediately; it may respond to it at some arbitrary later time, but no changes to the object state occur before the response.We assume each agent is sequential: it only waits upon one call at a time.Thus in a valid trace, each agent alternates between making a call and receiving its response, and these events can be interleaved.
The agents interacting with the ADO object are the proposers of a consensus protocol.In the standard setting, there are 3 + 1 proposers, of which 2 + 1 are non-faulty proposers and are byzantine.We assume that in the consensus protocol, rounds are numbered from 1, and in each round, one of the proposers is predetermined as the leader.We use _ ( ) to represent that leader.Assume that each proposer becomes a leader infinitely often.
The object state of ADO is a set Σ of cache nodes that form a cache tree.Therefore, we first formally define cache nodes and the cache tree, then define the ADO object.The cache node structure is defined in Fig. 5 (a).Each cache node except contains a field, along with other data.Let Σ be a set of cache nodes with at most one ℎ , one ℎ , and one ℎ per round.We use ℎ to represent that unique cache node of round .We write Σ[ ].
ℎ = ⊥ if round does not have an ℎ , similarly for other notations.For each ℎ , ℎ , ℎ in Σ, we define its parent as in Fig. 5 (b).The cache tree of Σ is a graph with all cache nodes except ℎ as its vertices, and directed edges from each node to its direct children as its edges.The cache tree is well-defined when the parent of each node is Fig. 6.Preconditions for response of ADO object.For explanations, see Appendix A.
well-defined, and the graph forms a rooted tree with as its root.See Fig. 2 for an example cache tree.We say an ℎ is committed, if there exists a path in the cache tree from that ℎ to a ℎ .Hence in Fig. 2, ℎ is not.When the cache tree is well-defined, there is a unique path from to each cache node .The sequence of all ℎ on that path forms the consensus log up to node .We denote it by ( ).We now define the ADO object.The object state is a set Σ of cache nodes.Initially, Σ = { }.The object exposes three operations, which are ( ), ( , ), and ℎ( ) with > 0. Only _ ( ) may call ( ), ( , ), or ℎ( ).Also, ( , ) can only be called when has been registered in the method pool.Otherwise, it is considered an invalid call and fails immediately with no change in object state.
The object may respond to each call with either or .If the caller is byzantine, it may voluntarily stop waiting for its current call and we represent this with a special response , with no change in object state.The object may always respond to a call with , adding a ℎ to Σ.When it decides to respond to ( ) with , it must non-deterministically choose a such that (Σ, , ) is currently satisfied.Similarly, when it responds to ( , ) or ℎ( ) with , the conditions (Σ, ) or ℎ (Σ, ) must be currently satisfied, respectively.The definitions of these conditions are shown in Fig. 6.The changes to the object state upon each response are defined in Algorithm 2. Linearizability of ADO.In Honoré et al. [2021Honoré et al. [ , 2022]], the ADO object was described as an atomic object.The refinement proof works by reordering network events to their linearization points.However, for liveness refinement, we have to define a refinement mapping between ADO and the network model, and events cannot be reordered, which forces us to switch to a concurrent spec.
Nevertheless, we can define an atomic version of ADO as follows.The object exposes exactly the same interface as concurrent ADO.However, when any proposer makes a call, the object atomically chooses a response ( or ) and returns it immediately.The preconditions and effects of each response are exactly the same as defined in Fig. 6 and Algorithm 2.
Lemma 3.1.The concurrent ADO object is linearizable to the atomic ADO object.
Proof : We simply choose the point where the object generates a response as the linearization point of a call.The call does not have any effect on the object state before the response is generated.The preconditions of generating a response depend only on the object state at the response point.Therefore, moving every call event to the response point results in a valid atomic trace with the same final object state.

Safety of ADO.
In Appendix A, we give a presentation of the safety theory of ADO.Here we simply understand that, the ADO cache tree is always well-defined, and there is always a path starting from that contains all committed ℎ .Let be the committed ℎ with the highest round number, then we can take ( ) to be the current committed consensus log.
Implementing ADO.We also define what it means that a network system with byzantine processes implements the ADO object.Let be the message space of the system, the set of all possible messages that may be created within the system.For every reachable system state , let ( ) ⊆ denote the set of all messages that have been actually created at state .Then we define: Definition 3.2.A refinement between the ADO object and a network system consists of the following data: (1) A refinement mapping that maps valid finite network traces to valid finite traces of ADO object, which defines correspondence between network state and ADO cache tree Σ; (2) For each possible cache node , a certificate set ( ) ⊆ , such that in every corresponding pair of network state and ADO cache tree Σ, we have ∈ Σ iff at least one member of ( ) is in ( ).
Although byzantine processes can send any message, they still have to follow cryptographic restrictions, which is why they cannot fabricate messages in ( ) to claim the existence of a cache node.Thus whatever byzantine processes do in the network system, the net effect is still as if they are following the ADO interface.Hence external processes such as SMR clients and executors can use ℎ certificate messages as evidence that a method is committed, and act accordingly.

The LiDO Model
We now define the LiDO model, which is the ADO model extended with state variables and operations that represent an abstract pacemaker.
As shown in Fig. 3, the LiDO object state consists of an ADO cache tree Σ, and two integers and _ .The object exposes six operations.The ( ), ( , ), and ℎ( ) operations affect the cache tree, and their semantics are exactly the same as the ADO object (Algorithm 2).There are three new operations ( ), (), and () which affect the pacemaker state and are described below.
We introduce a new agent called the adversary A, which represents the effect of time flowing.A may call (), which decreases _ by 1.When _ = 0, A may call () to increase by 1 and reset _ to a preconfigured value _ .This models a logical timer that is simulated by the local timers of each voter.It allocates a fixed duration for each round, and when the timer expires, the pacemaker may intervene to start the next round.Both () and () are atomic calls: the object responds to the call immediately.
We allow _ ( ) to call ( ).This call sends a signal to the pacemaker that it may start round + 1 without waiting for the timer of round to expire.This is a concurrent call: the object does not need to respond to the signal immediately.
The formal effects of these calls are shown in Algorithm 3.

The Live Traces of LiDO
We now define the live traces of LiDO.In general, we define live traces by liveness requirements.
A valid segmented trace ( 0 , 1 , • • • ) is a live trace whenever it satisfies these requirements.Our requirements only concern events over a fixed-length duration.This makes them safety properties which are easy to handle using refinement.
The liveness requirements on LiDO are divided into protocol-independent ones and protocoldependent ones.The reason is that pipelined protocols provide a weaker liveness guarantee, as it needs the cooperation of two (or more) leaders to commit a method, so certain liveness properties of unpipelined protocols are not enjoyed by pipelined ones.Here we focus on unpipelined protocols.The liveness of pipelined protocols will be discussed in Section 5.
Definition 3.3.The protocol-independent liveness requirements are: (1) Between and +1 , () is called at most once; (2) If _ ( ) > 0, between and +1 if () is not called then is increased at least once; (3) There exists constant , such that if _ ( ) = 0, then ( + ) > ( ); By "between and +1 , " we mean the trace +1 with prefix removed.Together, these requirements ensure that the round number will increase unboundedly in an infinite trace, while still giving sufficient time to each round.
Many protocols allow two consecutive non-faulty leaders to cooperate to start a new round, without waiting for the timer to expire.This feature can be formulated as an additional liveness requirement: there is ℎ of round by the end of +1+ .On the other hand, if +1+ > , then _ ( ) must have called ( ), so there exists a ℎ of round by the end of +1+ as well.Given that each non-faulty proposer becomes a leader infinitely often, this implies that ℎ is created infinitely often during execution.Since each ℎ represents a new method being committed, we see that new methods are committed infinitely often.

PROVING SAFETY AND LIVENESS OF UNPIPELINED JOLTEON
In Section 3 we fully defined the LiDO object and its live traces.In this section we use unpipelined Jolteon as an example to show how to prove a network model refines LiDO.We study Jolteon because its design maps nicely onto the ADO's three-step view of consensus.We consider a system consisting of a fixed finite set of non-faulty and byzantine processes.The only way of communication between these processes is through sending and delivering messages.The set of all messages that may potentially be created within the system is called its message space.The set of all internal states each non-faulty process may potentially reach during execution is called its state space.Within set , a special state 0 is designated as the initial state of each non-faulty process.We do not model the internal state of byzantine processes.

System Model
The system state consists of three parts: 1) A finite map _ from process IDs to the current state of that process; 2) A finite set of all messages that have been created within the system; 3) A finite set _ of process-message pairs, indicating which messages have been delivered to which processes.Initially, _ ( ) = 0 for each process, and = _ = ∅.Fig. 7 shows the architecture of each non-faulty process.It is specified as a main process with a timer object attached.The main process has three operations: it can receive requests from external clients, receive messages from the network, and receive timeout signals.Only the timer can send timeout signals to the main process.The timer object has two operations called reset and elapse, where reset can only be called by the main process while elapse is an external signal.The formal details of the timer are explained later.
Each event that may occur within the system belongs to one of the following kinds: (1) Deliver an external client request to a non-faulty process; (2) Deliver a message to a non-faulty process, provided it has been sent previously; (3) Deliver a time-elapse signal to a timer object; (4) A byzantine process sends an arbitrary message, subject to constraints (explained later).The action of a non-faulty process upon each delivery event is specified by a handler function.The action may involve state changes and sending messages and is atomic with the event.
The Timer Object.We now study the timer object more closely.Normally, a timer is considered a continuous object that exposes a single operation reset and sends out timeout signals.After GST, the timer sends out a timeout signal when and only when a predetermined duration has elapsed from the most recent reset call.This model is intuitive and is implicitly adopted in paper proofs of liveness such as Bravo et al. [2022].However, continuous objects are difficult to formalize in proof checkers.Therefore, in this work, we replace it with a discrete model that approximates the continuous behavior.We explain how this model is derived.Without loss of generality, let us assume that the timeout duration is Δ where is a positive integer.
Let be the time duration elapsed since the most recent reset call and = Δ − .Then is a continuous variable that decreases linearly as time flows.When reaches 0, the timer sends out a timeout signal.Now instead of focusing on , we consider its approximate value = ⌊ /Δ⌋.
We observe that only changes discretely.If reset is called at timepoint , then before timepoint + Δ we have = − 1, and it decreases by 1 at timepoints + Δ, + 2Δ, • • • .The timer sends out its timeout signal at timepoint + Δ.
Algorithm 4 The Discrete Timer Model 1: Assume timeout duration = Δ.2: initialize: else 13: _ ← 14: () We can picture the discrete changes of as being triggered by an external elapse signal.Formally, we make the timer maintain an internal variable _ _ .When reset is called, it sets _ _ = − 1.When it receives an elapse signal, it decreases _ _ by 1.After _ _ reaches 0 and the elapse signal is received again, it delivers a timeout.Algorithm 4 shows the formal pseudocode.
To use this model in liveness proofs, we also have to specify its live traces.We first formally characterize live traces of a timer using timed traces: Fig. 8. Message space of Jolteon.

Unpipelined Jolteon
In Gelashvili et al. [2022], the Jolteon consensus protocol was described in its pipelined form.Here we consider an unpipelined form.The pipelined form is considered in Section 5 and Appendix D.
Message Space.As shown in Fig. 8, the message space of Jolteon consists of five kinds of messages, which we call , ℎ , , , and .Requests, votes, and cache certificates are subdivided into type and type.This naming shows the correlation between Jolteon and the ADO model.In Jolteon, ℎ and ℎ are created simultaneously in a single phase, and serves as the certificate for these caches; ℎ is created in a second phase, using as its certificate.In the original description [Gelashvili et al. 2022], and are called and respectively.
Constraints on Byzantine Processes.In theory, byzantine processes are allowed to send "any message." However, it is common practice to use cryptographic primitives such as digital signatures to constrain their behaviors.We impose two constraints on byzantine processes, called cryptographic constraint and semantic constraint.As shown in Fig. 8, each message except contains a sender ID field ( ).Also, some messages may embed other messages, like embedding messages.Our cryptographic constraint takes a Dolev-Yao-like approach [Dolev and Yao 1983]: byzantine participants may only fill their own IDs in the sender field, and they may only embed already existing messages; however, we give them access to every existing message in the network, regardless of whether they are intended recipients or not.
In addition to cryptographic validity, we enforce a semantic validity rule.For each kind of message, we define a decidable property on its content that must be satisfied.For example, a cache certificate must embed a quorum of votes supporting that cache.Since these properties are decidable, the non-faulty processes simply call the decision procedure and discard the message if the test fails.This allows us to ignore messages that are not semantically valid, and simplify the proof.See Appendix C for details.
State Space.Fig. 9 shows a simplified view of the internal state of non-faulty processes.Although proposers and voters are logically separate, they are implemented in the same process.The field most relevant to liveness is _ , which dictates which round the process currently participates in.The _ field records the highest round in which the process has cast a commit vote ( ).It corresponds to the ℎ ℎ .variable in the original description.The _ _ field is the remaining time of the timer object, as discussed previously.The other fields are progress indicators for leaders and voters, and buffers for received messages.Although the buffers are shown as lists here, they are implemented as finite maps from process IDs to messages, and we keep only one message per ID.
Algorithm 5 Unpipelined Jolteon Protocol 1: Assume _ = 2: ⊲ Invoke phase 3: as leader: ) 13: as voter: 14: Wait for (_, , , , ) Send ( , , , ) of its own choice.This corresponds to a simultaneous pull and invoke in ADO.In the second phase the leader rebroadcasts the votes received, and the voters store the method.This corresponds to a push in ADO.When a process receives a timeout signal from the timer, it broadcasts a message and no longer produces votes for the current round.The message contains the current _ .A quorum of , each of round at least _ (they do not need to be of the same round), is used to build a .Any non-faulty process that receives a or should forward it to the next leader.This ensures the leader will also enter the new round within Δ.The pacemaker described in Algorithm 5 corresponds to part (c) of Fig. 1, which is sufficient for our refinement proof.We also implemented a version with pacemaker improved to part (d) of Fig. 1.See Appendix C for details.
The timer is reset when and only when the process increases its _ .The process enters round > _ in one of the following situations: (1) A or of round − 1 is received; (2) A valid of round is received; (3) A message that embeds an with .= is received.

Proving Safety and Liveness of Jolteon
We proved both the safety and liveness of Jolteon by constructing a refinement between Jolteon and LiDO.The proof was done in three steps: (1) We construct a refinement between Jolteon and ADO (Definition 3.2), which derives the safety of Jolteon from the safety of ADO; (2) For each network state , we define its abstract pacemaker state, which consists of and _ , and prove that each network step either does not change these values or changes them in accordance with one action of the abstract pacemaker; this proves that Jolteon refines LiDO; (3) We prove that all live traces of Jolteon (Definition 4.3) refine live traces of LiDO.
In Appendix C, we present the full details of the proof.Here we present the overall structure and discuss some of its interesting details.
Layered Safety Refinement.The refinement mapping itself is straightforward to define.We map a proposer broadcasting to calling at ADO level.Since ℎ and ℎ are created in a single phase, we map building to an atomic sequence of creating ℎ , calling , and creating ℎ .Broadcasting and building are mapped to calling ℎ and building ℎ .If a proposer enters the next round without collecting enough votes for its request, we map it to creating ℎ .The hard part is to show that the image of every valid network trace is a valid ADO trace.It is possible to prove this theorem in a single shot.However, the proof would be quite complex and involve dozens of mutually dependent invariants.Instead, we introduced two intermediate layers called Server and Voting (Fig. 4), which allowed us to reduce proof complexity by proving some invariants on a simpler, more abstract layer.Each lower layer is a transition system that captures more information about the network state but is also more deeply tied to implementation details.
The informal idea of safety proof is as follows.Each ℎ and is backed by a quorum of votes or timeouts.For every pair of a of round and a of round ′ ≥ , at least one non-faulty voter has voted for both.The must have been produced before the .Hence, the highest embedded in the must be of round at least .Since the _ of any of round ′ + 1 must come from either a or a of round ′ , we see that the new leader must observe all committed methods.We now decompose the above argument, so that we only deal with one key invariant at a layer.
At Server layer, the events we capture are proposers building and ℎ messages, and the pacemaker building messages.The proposers, as well as the pacemaker, are modeled as threads running on a shared-memory system.Each thread can observe all existing messages, and may atomically create a single new message.The invariant we enforce at this layer is that for every pair of a of round and a of round ′ ≥ , the must embed an of round at least .At Voting layer, we additionally capture voters sending and messages.Again, the voters are threads on a shared-memory system and can observe all existing messages.We enforce that non-faulty voters cannot make conflicting votes.This means they cannot make two different in a single round, they cannot make of round after sending of round ′ ≥ , etc.It is then easy to prove that the Voting layer refines the Server layer, through the quorum-overlap argument.
The Network layer implements the proposer, voter, and timer threads in our message-passing model.The messages must now be explicitly delivered to each process.Each voter maintains its own bookkeeping and decides whether to produce a vote upon receiving a request.To show that the Network layer refines the Voting layer, we prove that whenever a voter decides to produce a vote, the relevant safety invariant is respected.
A Subtle Safety Issue.Although the proof outlined above seems intuitive, there are many subtle details.Here we present one example.Suppose that _ ( ) enters round by receiving a quorum of timeouts of round − 1.According to Algorithm 5, it should find the highest embedded within the timeouts.It is possible that an of round + 1 has already been created.If so, a byzantine process could include it in a timeout of round .In this case, when the request succeeds, the leader would have to set _ = + 1, which violates ADO safety rules.The above situation would not actually happen.The reason is that if an of round + 1 exists, then a quorum of voters have already entered round +1, and so will not vote on the request of round .However, this argument is not easy to formalize using invariants.Instead, we adopt a much simpler solution: we make non-faulty processes reject with . .> . .This ensures that in every valid of round , the highest embedded can be of round at most , which eliminates the difficult case described above.Liveness Refinement.Fig. 10 presents an overview of our layered liveness proof.We first proved that the pacemaker mechanism satisfies the protocolindependent assumptions (Definition 3.3).Then we decomposed the time allocated to each round into 3 steps.In the first step, the leader receives a or from the previous round and enters the new round.Then the leader completes the two phases of a round.Each phase is further decomposed into two stages: the voters receive the request, and the leader receives the votes.
More specifically, we first define for each network state the corresponding abstract pacemaker states ( ) and _ ( ).
Definition 4.4.Let be the set of non-faulty processes.For each valid network state , define: (1) Pipelining works by merging the Commit phase of each round with the Invoke phase of the next round.While pipelining improves latency when there are no byzantine faults [Yin et al. 2019], the fact that committing a method requires the cooperation of two consecutive leaders weakens the liveness guarantee of the protocol.This issue has been studied in Giridharan et al. [2023].Nevertheless, in the 3 + 1 rotating-leader setting, one can show there must be at least two consecutive non-faulty leaders by a counting argument.If every non-faulty leader is sandwiched by byzantine leaders then the proportion of byzantine processes must be at least 1/2 instead of 1/3.
Verifying pipelined protocols is more challenging than unpipelined protocols.This is because the liveness of pipelining is tied to the round change mechanism.Proving the liveness of each single round is not enough.We also have to analyze the cooperation of consecutive leaders and potential byzantine interference.
We have completed a safety and liveness proof of pipelined Jolteon.This shows our approach can be applied to systems with non-trivial optimizations.The details of our implementation and proof are in Appendix D.Here we present the modifications to the liveness proof.
We observe that the liveness of pipelined Jolteon essentially consists of two parts.First, each nonfaulty leader can create an ℎ on its own.Second, under suitable conditions, two consecutive non-faulty leaders can cooperate to commit the ℎ .The "suitable conditions" of the second part are a bit tricky.The first leader initiates pipelining by sending its message to the second leader.On the other hand, the pacemaker may also send a to the second leader.To make pipelining successful, the second leader must receive the before any messages.We implement this by requiring that the first leader must send its soon enough: when it sends out we must have _ ≥ 1.This implies that no will be created within Δ, so the next leader must build its own request using the received .Our liveness theorem is as follows: Theorem 5.1.In every infinite segmented trace of pipelined Jolteon, let = ( 0 ), then for every ′ > such that both _ ( ′ ), _ ( ′ + 1) are non-faulty, eventually an ℎ and a ℎ of ′ are created.
Proof Effort.The safety proof effort remains almost the same.The liveness proof grew slightly more complex and required around 2,000 lines.To show that our Coq specification is realistic and faithful to runnable code, we extracted the network layer specification of unpipelined Jolteon into executable OCaml code.The code specifies messages to be exchanged among different nodes and abstractions for sending and receiving messages but lacks implementations of the network primitives and the timer.We manually glued the network abstraction to OCaml's libraries that realize TCP/UDP-based communications through a shim layer and added a timer that triggers timeout events when the round does not advance within a threshold.Still, the main logic comes from the unmodified extracted code.We evaluate the code on a research cloud with a four-node setting.Fig. 11 shows a series of latency measurements to increment the round either by committing a method or by timing out.Without any failed or Byzantine nodes, the system exhibits an average latency of 2.56 ms to commit a request.With a single failed or Byzantine node that hinders making progress, it takes an average of 4.45 ms to advance the round (timeout is set to 10 ms).The system is not optimized for performance and does not include pipelining, but the experiment shows that our specification is realistic, the code maintains liveness under failure, and the execution exhibits comparable performance (i.e., 1 ms overhead under steady state) to other verified PBFT Rahli et al. [2018] and non-verified BFT-SmaRt implementations [Bessani et al. 2014].

RELATED WORK
Theoretical Solutions to Byzantine Consensus Liveness.Maintaining the liveness of byzantine consensus protocols has been traditionally considered a difficult problem.The original PBFT thesis [Castro 2001] did not give a formal liveness proof, although they had a semi-formal safety proof.The problem is that byzantine participants may attempt to force an early round-change, and honest participants need to correctly deal with the messages they send.
HotStuff [Yin et al. 2019] first proposed to use an independent component called pacemaker to control round-change so that each round gets allocated sufficient time to commit methods.However, the pacemaker of HotStuff is relatively unusual.The participants may enter new rounds without observing QC or TC of previous rounds.Therefore, HotStuff had to use exponential backup to ensure that, eventually, the participants would enter the same round.Its liveness dynamics are difficult to analyze.Jolteon [Gelashvili et al. 2022] was then proposed as a variant of HotStuff that reverts to a pacemaker with the all-to-all broadcast of timeout messages.The Cogsworth pacemaker [Naor et al. 2021] was proposed as a way to avoid all-to-all broadcast needed in Jolteon.It has been incorporated into a new version of HotStuff [Malkhi and Nayak 2023].While our work has only inspected the pacemaker of Jolteon, we expect that most of the pacemaker designs in the literature can be captured and analyzed by our approach.Bravo et al. [2022] proposed a theory of synchronizers, which are objects that control the roundchange of each process but are completely independent of other parts of the protocol.They showed that it can applied to a number of different protocol designs.However, synchronizers are not band-aids that magically repair broken protocols.To apply the synchronizer to a protocol requires changes to the protocol itself, and indeed a large part of Bravo et al. [2022] is showing that the modified protocol still satisfies safety and liveness.This shows that synchronizers do not replace the need for a formal framework for safety and liveness proofs.We also observe that it is unclear how to apply synchronizers to pipelined protocols, as pipelining relies on a fast path for round change, which synchronizers currently do not provide.
Verifying Safety and Liveness of Consensus Protocols.A large number of frameworks for verifying the correctness of consensus protocols have been proposed in the literature.Figure 2 gives a comparison between our work and existing approaches.The figure shows a clear pattern: verifying safety is relatively easy, but verifying liveness is a lot harder.Especially for byzantine consensus protocols, all existing liveness results only work for fully asynchronous or synchronous protocols.
A number of projects have aimed at verifying the safety properties of byzantine consensus protocols similar to HotStuff [Yin et al. 2019].These include Velisarios [Rahli et al. 2018], Carr et al. [2022], and QTree [Cirisci et al. 2023].In particular, the Velisarios proof uses a logic-of-events approach, which constructs a causal ordering of events and proves safety by induction on this ordering, with which our safety refinement proof bears similarity.However, recording only causal ordering does not provide enough information to establish liveness.For partially synchronous protocols, one also needs temporal ordering, which our segmented-trace formalism addresses.Carr et al. [2022] suggests that one proves a weak version of liveness called plausible liveness, which essentially means that one can always extend any valid execution to commit some data.This notion is inadequate in an adversarial environment: the byzantine adversary may actively delay successful commit.Another issue is the protocol may selectively ignore certain requests.Our notion of liveness guarantees every proposer may always commit some method of its own choice.
IronFleet [Hawblitzel et al. 2015] and PSync [Drăgoi et al. 2016] are the only results we are aware of that cover liveness and can be connected to executable code.Both works only cover benign consensus.PSync uses the Heard-Of model, and the verified code is coupled to a pacemaker.The pacemaker component is not mechanically verified.IronFleet explicitly models timers, heartbeats, and other factors.The model is very comprehensive, but the accompanying proofs are equally verbose.Our methodology results in proofs with a more transparent structure and better reusability.Padon et al. [2017] proposed a liveness-to-safety reduction approach that allows verifying the liveness of protocols in first-order logic.It has been extended to byzantine protocols in Berkovits et al. [2019]; Losa and Dodds [2020], but has so far not been applied to partially synchronous protocols.Our work has shown that it is easy to both capture network dynamics using safety properties, and decompose SMR liveness into safety requirements on the network.However, automating our proofs in model checkers is future work.
AdoB [Honoré et al. 2024] is a recent variant of ADO that supports reasoning about benign and byzantine faults in a unified way.The main difference between AdoB and LiDO is that AdoB is an atomic model, whereas LiDO defines a concurrent but linearizable object.This has significant implications for liveness reasoning.Refinement proofs for AdoB linearize each valid network trace into a valid atomic trace of AdoB.In doing so, it reorders network events and eliminates important temporal information.For example, even if the trace 1 is a prefix of 2 , there is no general relation between their linearized traces ′ 1 and ′ 2 .Consequently, although AdoB claims to have a liveness proof, it does not support liveness refinement like our LiDO model does: live traces of the network model cannot be directly correlated to live traces of AdoB.
Consensus Beyond Partial Synchrony.In this work, we have only considered mechanizing liveness proof of partially synchronous protocols with a fixed set of participants.In practice, public blockchains often demand byzantine consensus algorithms supporting dynamic or open membership.There are a number of works proposing protocol designs that work under this new setting [Buterin et al. 2020;D'Amato et al. 2023].Also, Thomsen and Spitters [2021] have mechanized a liveness proof for Nakamoto-style Proof-of-Stake (PoS) consensus under a synchronous setting.In the future, we plan to extend our theory to cover open-membership protocols.

Fig. 1 .
Fig. 1.Variants of a timeout-based pacemaker, with different liveness properties.Red text shows differences from (a).In variants (c) and (d), timeouts can come from different rounds.

Fig. 2 .
Fig. 2.An Example ADO Cache Tree.Anℎ is commi ed if there exists a path from it to a ℎ .Hence ℎ of round 1 and 3 are commi ed, but ℎ of round 2 is not.
Fig. 5. Definition of ADO Cache Nodes and Node Parents.
Operation of Jolteon.Algorithm 5 is a summary of our implementation of Jolteon.Each round has two phases, which we call Invoke and Commit.During the Invoke phase, the leader broadcasts a request that contains a or of the previous round, along with a client method Proc.ACM Program.Lang., Vol. 8, No. PLDI, Article 193.Publication date: June 2024.

Table 2 .
Comparison between consensus verification projects.*: The liveness proof does not cover partially-synchronous protocols.