NeoBFT: Accelerating Byzantine Fault Tolerance Using Authenticated In-Network Ordering

Mission critical systems deployed in data centers today are facing more sophisticated failures. Byzantine fault-tolerant (BFT) protocols are capable of masking these types of failures, but are rarely deployed due to their performance cost and complexity. In this work, we propose a new approach to designing high performance BFT protocols in data centers. By re-examining the ordering responsibility between the network and the BFT protocol, we advocate a new abstraction offered by the data center network infrastructure. Concretely, we design a new authenticated ordered multicast primitive (aom) that provides transferable authentication and non-equivocation guarantees. Feasibility of the design is demonstrated by two hardware implementations of aom- one using HMAC and the other using public key cryptography for authentication - on new-generation programmable switches. We then co-design a new BFT protocol, NeoBFT, that leverages the guarantees of aom to eliminate cross-replica coordination and authentication in the common case. Evaluation results show that NeoBFT outperforms state-of-the-art protocols on both latency and throughput metrics by a wide margin, demonstrating the benefit of our new network ordering abstraction for BFT systems.


INTRODUCTION
Online services today are commonly deployed in large data centers and rely on fault-tolerance protocols to provide high availability in the presence of failures.An important class of faulttolerance protocols is state machine replication (SMR).SMR protocols [32,35,40,44,54] have long been deployed in production systems [15,17,21,29] to ensure a set of distributed nodes behaves like a single, always available state machine, despite failures of individual machines.These protocols, however, can only tolerate node crash failures.In reality, systems running in data centers are facing more sophisticated failures.This is particularly relevant today with the growing adoption of permissioned blockchain systems [2,46,47] in data centers for applications such as trading [3,51].Major cloud providers have also introduced infrastructure support for blockchain-based platforms [5,6,26], highlighting their increasing demand.These systems require tolerance to adversarial nodes and attacks while maintaining low latency and high transaction throughput.Recent work [46] has demonstrated that fault tolerance protocols are becoming their main performance bottleneck.
Numerous Byzantine fault-tolerant (BFT) protocols [16,20,34,43,55,58,60] have been proposed to handle arbitrary node failures.Their strong failure models, however, come with significant performance implications.BFT protocols typically incur rounds of replica communication coupled with expensive cryptographic operations, resulting in low system throughput and high request latency.To obtain higher throughput, many BFT protocols resort to heavy request batching, which leads to long end-to-end decision latency -often in the range of tens of milliseconds.Unfortunately, such latency overheads are prohibitive for modern data center applications with strict service-level objectives (SLOs).Speculative BFT protocols, such as Zyzzyva [34], offer improved commitment latency.However, even a single faulty replica would negate the performance benefit of these latency-optimized protocols.
In this paper, we introduce a new approach to building highperformance BFT protocols in data centers.We observe that traditional BFT protocols are designed with minimum assumptions about the underlying network, assuming only best-effort message delivery.As a result of this weak network model, application-level protocols are responsible for enforcing all correctness properties, such as total ordering, durability, and authentication.Our key insight is that by strengthening the network model to provide ordered message delivery, the complexity and performance overhead of BFT protocols can be reduced.Prior research [38,39] has demonstrated the promises of in-network sequencing for crash fault-tolerant systems.However, these approaches fall short in the presence of Byzantine failures.For instance, faulty nodes can disseminate conflicting message orders, and the network sequencer may equivocate by assigning different sequence numbers to each replica.
In this work, we propose a new network-level primitive, authenticated ordered multicast (aom), that addresses the above challenges.aom ensures that correct receivers always deliver multicast messages in the same order, even in the presence of Byzantine participants.A key property offered by aom is transferable in-network authentication: Receivers can verify that a multicast message is properly delivered by aom, and they can prove the authenticity of the message to other receivers in the system.We additionally propose a mixed failure model [41,52] where possible faulty behaviors of the network infrastructure are considered separately.For deployments that trust the network infrastructure, aom assumes a crash failure model for the network.This slightly weaker model allows aom to provide ordering guarantees with minimum network-level overhead.For systems that require tolerance of Byzantine network devices, aom employs a simple cross-receiver communication round to handle equivocating sequencers.
We demonstrate the feasibility of aom by implementing it on commercially available programmable switches [30].The switch data plane performs both sequencing and authentication for aom messages.While implementing packet sequencing is relatively simple, generating secure authentication codes poses major challenges given the switch's limited resources and computational constraints.We propose two designs of in-switch message authentication, each with its own set of trade-offs between switch resource utilization, performance, and scalability.The first variant implements SipHashbased [4] message authentication code (HMAC) vectors directly on the switch ASICs.The second variant generates signatures using public-key cryptography.Given the hardware constraints, a direct implementation of cryptographic algorithms such as RSA [49] and ECDSA [31] remains infeasible on these switches.To overcome this limitation, we introduce a novel heterogeneous switch architecture that couples FPGA-based cryptographic coprocessors with the switch pipelines.This design enables efficient in-network processing and signing of aom messages, scales to larger aom groups, and minimizes the hardware resource requirements of the switch data plane.
Leveraging the strong properties of aom, we co-design a new BFT protocol, NeoBFT.In the common case, NeoBFT replicas rely on the ordering guarantee of aom to commit client requests in a single round trip, eliminating all cross-replica communication and authentication.Furthermore, even in the presence of (up to  ) faulty replicas, NeoBFT stays in this fast path protocol while meeting the theoretical minimum replication factor (3 + 1).In the event of network failures, we design efficient protocols to handle message drops and faulty switch sequencers while preserving the protocol's correctness.By evaluating against state-of-the-art BFT protocols, we show that NeoBFT can improve both protocol throughput by up to 4.1× and end-to-end latency by 42×.Additionally, NeoBFT maintains its high performance in the presence of Byzantine participants, scales to 100 replicas, and is robust to network anomalies and sequencer failures.

BACKGROUND
In this section, we give an overview of state-of-the-arts BFT protocols.We then review recent proposals that use in-network ordering to accelerate SMR systems.Lastly, we specify the targeted deployment model of our work.

State-of-the-Art BFT Protocols
There has been a long line of work on BFT SMR protocols.We present a summary of the state-of-the-art BFT protocols and their key properties in Table 1.PBFT [16] is the first practical BFT protocol that tolerates up to  Byzantine nodes using 3 + 1 replicas, which has been shown to be the theoretical lower bound [14].In PBFT, client requests are committed in five message delays.First, the client sends a request to a primary replica, who then sequences and forwards the request to the backup replicas.Next, the backup replicas authenticate the requests and broadcast their acceptance.Once a replica receives a quorum of acceptance, it broadcasts a commit decision.Finally, replicas execute the request and reply to the client after collecting quorum commit decisions.As replicas exchange messages in an all-to-all manner, each replica processes  ( ) messages, which results in an authenticator complexity of  ( 2 ).
Zyzzyva [34] employs speculative execution of client requests to reduce communication overhead.The protocol offers two execution paths: a fast path that completes in three message delays when clients receiving matching replies from all replicas, and a slow path that requires at least five message delays.The primary replica in Zyzzyva still sends signed messages to all backup replicas ( ( )).But with all-to-all communication eliminated, the authenticator complexity is reduced to  ( ).
Rather than relying on the clients to collect authenticators, SBFT [27] uses a round-robin message collector among all replicas to eliminate all-to-all communication.As a result, authenticator complexity is similarly reduced to  ( ).Additionally, SBFT leverages threshold signatures to reduce message size and to decrease the number of client replies to one per decision.
Several BFT protocols ( [16,27,34]) use an expensive view change protocol to handle leader failure.For instance, the standard view change protocol in PBFT requires  ( 3 ) message authenticators, limiting its scalability.HotStuff [58] introduces an additional phase during normal operation to address this issue.This modification reduces the authenticator complexity of the leader failure protocol to  ( ), matching that of the normal case protocol.However, it incurs an extra one-way network latency to the request commit delay.
BFT with trusted components.To reduce protocol complexity, recent research [20,22,37,55,61] proposes to use trusted components on each replica.These components can be implemented in a Trusted Platform Module (TPM) [53] or run in a trusted hypervisor, and are assumed to always function correctly, even when residing on Byzantine nodes.
A2M-PBFT-EA [20] utilizes an attested append-only memory (A2M) to securely store operations as entries in a log.Each A2M log entry is associated with a monotonically increasing, gap-less sequence number.Once a log entry is appended, it becomes immutable and its content can be attested by any node in the system.
MinBFT [55] introduces a message-based trusted primitive called Unique Sequential Identifier Generator (USIG).USIG generates a unique identifier for each input message that is monotonic, sequential, and verifiable.By authenticating USIG identifiers, MinBFT replicas can validate that all other replicas have received the same messages in the same order.This property enables MinBFT to reduce the message delay to four.Unfortunately, MinBFT's authenticator complexity remains at  ( 2 ).

In-Network Ordering for CFT Protocols
Another recent line of work [38,39,45] proposes a new approach to designing crash fault-tolerant (CFT) SMR protocols.These systems move the responsibility of request ordering to the data center network.By doing so, application-level protocols only ensure durability of client operations.This network co-design approach improves SMR protocol performance by reducing coordination overhead among servers needed to commit an operation.For instance, NOPaxos [39] dedicates a programmable switch in the network as a sequencer, which stamps sequence numbers to each request.This ensures that all replicas execute requests in the same sequence number order.However, these solutions only target CFT protocols and are unable to handle Byzantine faults, such as when a Byzantine node impersonates the sequencer and broadcasts conflicting message orders.

Deployment Model
Our work targets the permissioned [2,46] BFT setting, where access to the system is controlled and there is no mutual trust among the participants.We further target blockchain applications with strict latency requirement such as trading [3,51].For performance considerations, these systems are commonly deployed within a single data center.Our solution can be easily extended to geodistributed settings, but in this work we focus on the single data center use case.

AUTHENTICATED IN-NETWORK ORDERING
The core of a BFT SMR protocol is to establish a consistent order of requests even in the presence of failures.Traditionally, this task is accomplished by explicit communication among the replicas, typically coordinated by a leader.In this work, we propose a new approach to improving the efficiency of BFT protocols.Our approach shifts the responsibility of request ordering to the network infrastructure.

The Case for an Authenticated Ordering Service in Data Center Networks
To guarantee linearizability [28], BFT SMR protocols require that all non-faulty replicas execute client requests in the same order.However, due to the best-effort network assumptions, an applicationlevel protocol is fully responsible for establishing a total order of requests among the replicas.For example, in PBFT [16], the primary replica assigns an order to client requests before broadcasting to backup replicas.All replicas then use two rounds of communication to agree on this ordering while tolerating faulty participants.As discussed in §2, adding trusted components to each replica does not alleviate the coordination and authentication overhead in BFT protocols.Replicas still require remote attestations to verify the received messages.
What if the underlying network can provide stronger guarantees?Prior work [38,39,45] has already demonstrated that innetwork ordering, realized through network programmability [13,30], can offer compelling performance benefits to crash fault-tolerant SMR protocols.In this work, we argue that BFT protocols can similarly benefit from shifting the ordering responsibility to the network.By offloading this task to the network, BFT replicas can avoid explicit communication to establish an execution order, thereby reducing cross-replica coordination and authentication overhead.This network ordering approach improves both protocol throughput and latency, as less work is performed on each replica, and fewer message delays are needed to commit a request.
Why authenticated ordering in the network?In previous network ordering systems, the responsibility of ordering requests is entirely delegated to the network primitive, such as the Ordered Unreliable Multicast in NOPaxos [39] and the multi-sequenced groupcast in Eris [38].In non-Byzantine contexts, this network-level ordering is the only request order observed by any replica.However, in a BFT deployment, a faulty node can easily impersonate the network primitive and assign a conflicting message order, violating the ordering guarantee of the network layer.To prevent this, we augment the network primitive to provide authentication: Non-faulty replicas can independently verify that the received message order is indeed established by the network and not by any faulty node.In §4, we explain how such authentication can be efficiently implemented using commodity switch hardware.
Hybrid fault model and Byzantine network.If the network itself exhibits Byzantine faults, it can equivocate by assigning different message orders to different replicas, thereby violating the ordering guarantee.In this work, we argue for a dual fault model.The model always assumes a Byzantine failure model for end-hosts.The network infrastructure, on the other hand, can either be crash-faulty or Byzantine-faulty.Our argument is inspired by prior work proposing hybrid fault model [52] and work [41] that separates machine faults from network faults.Our approach provides deployment flexibility -users can choose either a hybrid failure model or the traditional Byzantine model -with an explicit trade-off between fault tolerance and performance.For deployments that trust the network to only exhibit crash and omission faults, i.e., a hybrid fault model, our solution offers the optimal performance; if the network infrastructure can behave arbitrarily, our solution can tolerate Byzantine faults in the network, albeit taking a small performance penalty.
We contend that a hybrid fault model, which assumes the network is crash-faulty, is a practical choice for many systems deployed in data centers.Networking hardware presents a smaller attack surface and is less vulnerable to bugs compared to software-based components.They are single application ASICs without sophisticated system software, and formal verification of their hardware designs is a common practice.Furthermore, systems deployed in data centers inherently place some level of trust in the hardware infrastructure.Data center operators also have a strong economic incentive to maintain the trust of their customers by providing reliable services.We, however, admit that this model is weaker than those assumed by traditional BFT protocols.Under the hybrid fault model, our system no longer guarantees safety or liveness if the network becomes adversarial.
Our fault model resembles existing deployment options in the public cloud, where only deployments that do not trust the cloud infrastructure run their virtual machines on instances with a Trusted Execution Environment (TEE) such as Intel SGX.Most use cases, however, place trust on cloud hardware and hypervisors.In return, they attain higher performance compared to their TEE counterpart.

Authenticated Ordered Multicast
So far, we have argued for an authenticated ordering service in the network for BFT protocols.To that end, we propose a new Authenticated Ordered Multicast (aom) primitive as a concrete instance of such model.Similar to other multicast primitives like IP multicast, an aom deployment consists of one or multiple aom groups, each identified by a unique group address.aom receivers can join and leave an aom group by contacting a configuration service.A sender sends an aom message to an aom group address, which the network is responsible for routing to all group receivers.Notably, senders do not have knowledge of the identity or the address of individual receivers.Instead, they only specify the group address as the destination.
Unlike traditional best-effort IP multicast, aom provides a set of stronger guarantees, which we formally define here: • Asynchrony.There is no bound on the delivery latency of aom messages.• Unreliability.There is no guarantee that an aom message will be received by any receiver in the destination group.One of the key distinguishing properties of aom is the ability for receivers to verify the authenticity of aom messages independently.In this context, authenticity refers not to the identity of the sender, which still requires end-to-end cryptography.Instead, it assures that a message has been correctly processed by the aom primitive, and that its ordering has not been tampered by other participant in the system.The authentication capability is also transferable: an aom message can be relayed to any other receiver in the group, who can independently verify its authenticity.

AOM DESIGN
In this section, we present our design of the proposed aom network primitive on programmable switches.The overall system architecture is shown in Figure 1.

Design Overview
Our aom primitive design consists of three major components: a network-wide configuration service, a programmable network data plane, and an application-level library running on aom senders and receivers.Analogous to IP multicast, receivers create and join an aom group by contacting the configuration service via secure TLS channels.The configuration service then designates one programmable switch in the network as the sequencer for the group and directs it to broadcast routing advertisement for the group address using a protocol such as BGP.Upon successful propagation of the advertisement, aom messages destined for the group address will be forwarded to the designated sequencer switch.
The sender-side library generates a custom packet header that follows the UDP header.This custom header includes the group ID, a sequence number, an epoch number, a message digest, and an authenticator.The digest is generated using a collision-resistant hash function [48].The sequencer switch is responsible for filling in all fields in the custom header, excluding the group ID and the message digest.
The switch features a sequencing module ( §4.2), which stamps sequence numbers onto aom packets.The pipeline then feeds the stamped sequencer number, concatenated with the message digest, into an authentication module.This module generates an authenticator, which can be a vector of HMAC ( §4.3) or a single public key signature ( §4.4).The switch then incorporates the generated authenticator into the header and multicasts the packet to all group receivers.
The receiver-side library verifies the authenticator and delivers aom messages in sequence number order.For any gap in the number sequence, the receiver delivers a drop-notification.For deployments that operate in a Byzantine-faulty network, the receiver-side library additionally exchanges confirm messages with other receivers within the group to tolerate sequencer equivocation ( §4.2).

Message Ordering and Failure Handling
To establish a consistent ordering of aom messages, we leverage programmable switches to stamp monotonically increasing sequence numbers to each aom packet.The sequencer switch employs a register array to maintain a counter for each aom group.A separate match table maps aom group IDs to indices into this array.During aom packet processing, the switch locates the counter register using the group ID within the header, increments the counter, and inserts the counter value into the header.After the switch generates an authenticator, it uses its replication engine to multicast the stamped aom message to all aom receivers within the group.As detailed in §4.1, receivers deliver authenticated aom messages in sequence number order.If a gap in the number sequence is observed, the receiver delivers a drop-notification.
Tolerating Byzantine-faulty network.If the network infrastructure is trusted to be non-Byzantine ( §3.1), a receiver can directly deliver authenticated aom messages in sequence number order.This delivery rule conforms our ordering property, as any two receivers are guaranteed to receive identical messages for each sequence number.Combining with the transferable authentication property, a single aom message suffices as a publicly verifiable ordering certificate for itself, which we exploit in our NeoBFT protocol ( §5).
However, in a Byzantine-faulty network, the sequencer may equivocate by sending different message ordering to each receiver.
To tolerate network-level equivocation, upon receiving an aom message, a receiver  broadcasts a signed confirmation ⟨confirm, s, h⟩   to the group, where s is the message sequence number and h is the hash of the message. ignores subsequent aom messages with the same sequence number, and only delivers an aom message after it collects enough matching confirms (at least 2 + 1 where  is the number of faulty receivers).This strengthened delivery rules ensures ordering, as quorum intersection guarantees that no two non-faulty replicas can deliver distinct aom messages for the same sequence number.The entire message set, including the aom message and the matching confirms, is delivered to the application and serves as an ordering certificate.
Sequencer switch failover.Due to network partitions, faulty sequencer switches, or other network anomalies, aom receivers may fail to receive authenticated aom messages indefinitely.In such cases, receivers can request the configuration service to fail over to a different sequencer for the group.However, the set of delivered messages at each receiver may differ when the new switch takes over.Furthermore, the old sequencer may only suffer a transient fault.To properly handle a switch failover, the application-level protocol is responsible for reaching consensus on the set of messages delivered by the failed sequencer ( §5.5).Once an agreement is reached, the receivers ask the configuration service to select a new sequencer switch and exchange the necessary authentication keys.Subsequently, they can start delivering authenticated aom messages from the new sequencer switch and ignore messages from the old one.

HMAC-Based Authentication
Generating secure and transferable authentication tokens in network hardware is more challenging than message sequencing due to switch resource constraints.Our first design uses HMAC vector as an aom authentication token [16].Upon joining an aom group, a receiver uses a key exchange protocol [42] to share a secret key with the sequencer switch, facilitated by the configuration service.The switch control plane installs the secret key of each receiver in the data plane.To authenticate an aom message, the switch generates a vector of HMACs, one entry for each receiver.Each code is computed by inputting the concatenated message digest and the sequence number ( §4.1), and the receiver's secret key to a keyed cryptographic hash function.The switch then writes the entire HMAC vector into the message header.A receiver authenticates an aom message by comparing a locally computed HMAC to the corresponding entry in the received vector.By including the entire HMAC vector, aom authentication is transferable ( §3.2).
In-switch HMAC implementation.Implementing an unforgeable HMAC requires access to collision resistant cryptographic hash functions.Recent advancements, such as HalfSipHash [59] and P4-AES [19], demonstrate the feasibility of implementing highthroughput cryptographic hash functions on switches.In this work, we use HalfSipHash as a building block for our in-switch HMAC design.
A cryptographic hash function, however, only solves half of the equation; implementing HMAC vector in the switch data plane introduces several new technical challenges.Firstly, the hash function consumes significant switch hardware resources.For instance, the reference HalfSipHash implementation uses all 12 pipeline stages of a Tofino [30] switch.Naively replicating HMAC calculation to  generate HMAC vectors will easily exceed the switch resource constraints.Second, there exists data dependencies between HMAC computation and other switch logic (e.g., sequencing).A sequential combination of the components would result in a dependency chain that surpasses the hardware limit [33].Lastly, the size of HMAC vectors grows linearly with the aom group size.The switch data plane design needs to scale to handle large aom instances given a fixed set of resources.

Pipe 1 Egress
Our approach.In Figure 2, we illustrate our switch data plane design for HMAC vector generation.To overcome the challenges of limited resources and data dependency, we dedicate one switch pipeline (pipe 1) solely for the computation of HMAC vectors.After ingress processing, packets requiring HMAC vector generation are forwarded to the loopback ports in the designated HMAC pipeline.Upon completion of the HMAC vector calculation, the HMAC module multicasts the resultant packets to the intended egress pipelines or returns them to the original ingress for further processing.Our pipeline-folding architecture extends the computation capacity beyond the available pipeline stages.Furthermore, our approach decouples HMAC vector computation from the other packet processing logic, leading to a more streamlined and modular design.
The original HalfSipHash design [59] requires 6 pipeline passes to produce one HMAC.Even for a small aom group with 4 receivers, this design uses 24 pipelines passes to generate the entire HMAC vector.To improve the overall vector generation latency, we trade off the number of pipeline passes per HMAC for higher degree of parallelism.Specifically, our design unrolls the reference Half-SipHash implementation, which extends the number of pipeline passes to 12.In return, we reduce the hardware resources required for one HMAC instance by half.As a result, our design can fit four parallel instances of HalfSipHash in the HMAC module, thereby generating a 4-HMAC vector in 12 pipelines passes.
To scale beyond 4 receivers in an aom group, we leverage the additional loopback ports in the dedicated HMAC pipeline.As our base design can produce 4 HMACs each time, we partition receivers into subgroups of 4. To request HMAC vector generation, the switch multicasts the packet to  loopback ports in the HMAC pipeline, where  is the number of subgroups.The switch then runs  independent instances of the based design, each generating 4 HMACs for a subgroup.The resulting  packets are all sent to the receivers, who assemble the complete HMAC vector.With 16  loopback ports [30], our design can scale up to 64 receivers.For smaller aom groups, the switch load balances between the loopback ports to increase the vector generation capacity.

In-Network Public-Key Cryptography
Our switch HMAC design is optimized for small aom groups.However, as the group size increases, performance of our HMAC authentication degrades.The switch requires additional pipeline passes to compute the larger HMAC vector, which reduces the effective HMAC vector generation rate.Moreover, each receiver processes more packets for each aom message, as the number of HMACs that can fit in a single header is limited by the Packet Header Vector (PHV).To address this scalability issue, we propose an alternative design that implements public-key signature-based authentication.
Concretely, each sequencer switch generates a private-public key pair.All public keys are stored and distributed by the configuration service.To authenticate an aom message, the switch uses its private key and the concatenated message digest and sequence number ( §4.1) as input for a public key algorithm [31] that generates a digital signature.aom receivers then use the switch public key to verify the sequencer signature in the message.Performance of our signature-based authentication design is group size agnostic, as the switch generates a single signature for each aom message, regardless of the number of receivers.
In-network cryptography design.Implementing public-key cryptography in a network switch is a daunting task.The RSA [49] public-key algorithm requires modular exponentiation of large prime numbers.Even with aggressive optimizations, calculating an RSA signature still involves unbounded loops of multiplications and modulo operations.The ECDSA [31] algorithm involves similar complexity, and additionally requires random number generations and multiplicative inverses.Unfortunately, current generation programmable switches lack support for these computations, and future switch data planes are unlikely to accommodate them due to strict timing, power, and resource constraints.
To overcome the limitations of existing switches, we propose a new switch architecture that includes a specialized cryptographic coprocessor alongside the primary switching chip.This coprocessor is equipped with a simple processing element, dedicated fast memory, and cryptographic accelerators, and is connected to the switching chip through high-speed network links.To offload cryptographic operations to the coprocessor, the switch constructs a remote procedure call (RPC) metadata that specifies the operation type, key identifier, input message, and offsets into the packet for operation outputs.After egress pipeline processing, the switch submits both the RPC metadata and the original packet to the coprocessor for processing.The coprocessor then writes the result into the packet and forwards it back to the switch.Our offloading design is best-effort.The coprocessor implements a tail-drop queue for submitted operations, and the switch does not maintain RPC state locally.
Cryptographic coprocessor implementation.We develop a coprocessor prototype for aom signature signing on a Xilinx Alveo U50 FPGA card [1]. Figure 3 shows the high-level architecture of our hardware design.The card connects to one of the switch ports through a 100Gbps QSFP28 cable.A parser module parses the RPC metadata, and forwards the operation input to a hashing module to calculate an SHA-256 [24] hash.A signing module then generates a signature of the hash using the secp256k1 elliptic curve [31].Finally, a stream merger module stamps the signature into the packet and sends it back to the QSFP28 port.We developed all the hardware modules, except the Xilinx QSFP28 hard IP, in-house using a combination of RTL and HLS.
Even with a powerful FPGA chip, calculating an secp256k1 signature is still a time-consuming process.We reduce signing latency by exploiting the underlying mathematical property of secp256k1 -a significant portion of the curve computation is input-independent.Specifically, we design a pre-compute module that continuously calculates multiples of a generator point of the elliptic curve and stores them in a pre-computed table in fast block RAM.The signer module uses values in this table to speed up scalar point multiplication.
The rate of generating pre-computed table entries can limit the overall coprocessor signing throughput.To address this limitation, we propose a novel hash chaining technique.A packet updater module stamps into each aom packet an additional SHA-256 hash of the preceding packet in the number sequence.A signing ratio controller monitors the stock level of the pre-computed table and instructs the signer module to skip generating signatures for packets once the stock level falls below a threshold.Consequently, while all aom packets contain an SHA-256 hash of the previous packet in the stream, only a subset of them may include a signature.To authenticate signature-less packets, receivers wait until the next signed packet and verify the entire batch by validating the hash chain in the reverse order.

Which Authentication Variant to Use?
The two authentication variants have distinct set of trade-offs.The HMAC-based scheme is lighter weight and can be implemented on existing switches without special hardware support.However, it suffers from poor scalability, requires more complex credential setup, and has weaker security guarantees due to the hash function limitation.The scheme is therefore a better option for smaller deployments with stricter performance requirements.The public-key signature variant scales to large receiver groups and is more secure.
It, however, requires switches that are not yet commercially available or additional FPGA hardware.Verifying public-key signatures also incurs a higher overhead on the receivers.

THE NEOBFT PROTOCOL
Leveraging the authenticated ordering guarantee provided by aom, we co-designed a new BFT protocol, NeoBFT, that commits client operations in a single RTT, even in the presence of Byzantine replicas.

System Model
We assume a Byzantine failure model for both clients and replicas.The fault model of the aom primitive can be either hybrid or Byzantine, as discussed in §3.1.We make standard failure assumptions [16] about the configuration service.The service ensures that no more than  faulty replicas are present in a replication group, and a correct1 sequencer switch is eventually installed for each group.Note that the eventual correct switch assumption is only for protocol liveness, not safety.
We make standard cryptography assumptions: Nodes do not possess enough computational resources to subvert the cryptographic hash functions, message authentication codes, and publickey crypto algorithms we use in the protocol.We also assume a strong adversary model: Byzantine nodes can collude, but cannot delay correct nodes indefinitely.
NeoBFT is a state machine replication [50] protocol.We assume all operations executed by the protocol are deterministic.With less than ⌊ −1 3 ⌋ (Byzantine) faulty replicas in the system (where  is the total number of replicas), NeoBFT guarantees linearizability [28] of client operations.Due to the impossibility of asynchronous consensus [25], NeoBFT only ensures liveness in periods of synchrony.

Protocol Overview
NeoBFT relies on the guarantees provided by the aom network primitive to achieve single RTT commitment in the common case.Specifically, clients multicast requests to NeoBFT replicas using aom.In the absence of network-level anomalies (e.g., message drops and switch failures), all correct replicas delivers aom messages in the exact same order.Crucially, such guarantee implies that replicas require no explicit communication to agree on the order of messages.NeoBFT thus avoids the expensive cross-replica coordination and server signature signing/verification required by other BFT protocols.Moreover, adversaries can not temper with the order of messages nor their content, as correct replicas can independently verify the authenticity and integrity of each aom message.Once an aom message is delivered, replicas can immediately execute the request and respond to the client, resulting in a single phase fast path protocol.As discussed in §4.2, when delivering messages, the network primitive provides an ordering certificate.Similar to previous speculative protocols [34,39,45], NeoBFT relies on clients to confirm operation durability.As a result, replicas execute speculatively, before the final commitment.However, since all correct replicas have already established a total order of operations, NeoBFT does not require extra protocols to handle faulty replicas (Zyzzyva [34]) or state divergence due to out-of-order executions (Speculative Paxos [45]).Only in the exceptional case where a speculatively executed operation is later agreed to be skipped (due to the gap commit or the view change protocol), a NeoBFT replica is required to roll back application state.
In the rare case where aom messages are dropped in the network, the aom primitive delivers drop-notifications to non-faulty replicas.To handle drop-notifications, replicas only need to agree on whether to process or to skip the message, not the order of messages.NeoBFT uses a BFT binary consensus protocol, driven by a leader replica, to reach this agreement.In this protocol, the leader uses a single ordering certificate (received by any replica) to commit the corresponding message.To permanently skip the message, the leader replica collects evidences from a quorum of replicas to form a drop certificate, which non-leader replicas would verify before committing the message as a no-op.We use drop certificates to prevent Byzantine replicas from delaying the agreement indefinitely.
A faulty aom sequencer may stop multicasting messages or deliberately equivocating or dropping messages.To ensure progress, replicas request the configuration service to replace the faulty sequencer.Installment of a new sequencer indicates the start of a new epoch.Correctness of the protocol requires replicas to agree on the set of messages processed in the last epoch before entering the new epoch.To that end, each NeoBFT instance goes through a sequence of views; each view is identified by a view number represented as a ⟨epoch-num, leader-num⟩ 2-tuple.When the current leader replica has failed (or suspected to be failed) or an old epoch has ended, replicas advance the respective field in the view number, and use a view change protocol [16,39,40,45] to reach agreement on the set of messages in the last view.
NeoBFT also includes a protocol to periodically synchronize replica states and finalize speculatively executed requests.Details of the protocol can be found in §B.2.

Normal Operation
We first consider the common case protocol in which aom messages, instead of drop-notifications, are delivered to NeoBFT replicas in a stable epoch.A client c requests execution of an operation op by sending a signed message ⟨reqest, op, request-id⟩   using the aom primitive, where request-id is a client-generated identification to match replica replies.The message is processed by the network primitive ( §3.2), and an ordering certificate (oc) for the message is delivered to all replicas.If the client does not receive replies in a timely manner, it uses regular unicast to send the request to all replicas (while keeps resending the request using aom).Our view change protocol ( §5.5) ensures that the request is eventually committed even if the aom sequencer is faulty.
Replica  verifies the oc by authenticating the aom authenticator and checking the 2 + 1 matching confirms (only for Byzantinefaulty network).It then adds the oc to its log, speculatively executes op, signs and replies ⟨reply, view-id, , log-slot-num, log-hash, request-id, result⟩   to the client, where view-id is the current view number, log-slot-num is the log index the request occupies, log-hash is a hash of the log up to the index, and result is the execution result.We use hash chaining for  (1) hash calculation [45].
Client  waits for 2 + 1 replies from different replicas with valid signatures and matching view-id, log-slot-num, log-hash, and result.It then accepts the result in the reply.

Handling Dropped Messages
When a non-leader replica  receives a drop-notification, it attempts to recover the missing message from the leader.To do so, it sends a ⟨qery, view-id, log-slot-num⟩ to the leader.qery messages require no signatures since they do not alter the state of a correct replica.If the leader has the corresponding oc, it responds with a ⟨qery-reply, view-id, log-slot-num, oc⟩.Replica  verifies the oc and ensures the enclosed aom message is the missing message by checking the internal sequence number.It then resumes normal operation.Because oc can be independently verified by any replica, qery-replys also require no signatures.Note that replica  blocks on waiting for the leader's response or a committed noop before processing subsequent client requests, resending qery messages if necessary.
If the leader  itself receives a drop-notification, it broadcasts a ⟨gap-find-message, view-id, log-slot-num⟩   to all replicas.When replica  receives a gap-find-message, it replies to the leader with either a ⟨gap-recv-message, view-id, log-slot-num, oc⟩ if it has received the ordering certificate, or a ⟨gap-drop-message, view-id, , log-slot-num⟩   if it has also received a drop-notification for the message.If a replica replies gap-drop-message to a gap-findmessage, it blocks until it receives the gap agreement decision (ignoring qery-replys for the message).
Once the leader receives one gap-recv-message or 2 + 1 gapdrop-message (including from itself), whichever happens first, it uses a binary Byzantine agreement protocol, similar to PBFT [16], to commit the decision.Specifically, the leader broadcasts a ⟨gap-decision, view-id, log-slot-num, decision⟩   , where decision is either a single gap-recv-message or 2 + 1 gap-drop-messages.If a gap-recvmessage is received, the leader first verifies the enclosed oc following the same procedure as above.
When replica  receives a gap-decision, it verifies the enclosed oc if the decision contains a gap-recv-message.If the decision contains 2 +1 gap-drop-message, the replica verifies that all 2 +1 messages are from distinct replicas, and their log-slot-num matches the one in the gap-decision.It then broadcasts a ⟨gap-prepare, view-id, , log-slot-num, recv-or-drop⟩   , where recv-or-drop is a binary value indicating the decision, to all replicas.
Once replica  receives 2 gap-prepares from distinct replicas (possibly including itself) and it has received a validated gapdecision with a matching decision from the leader, it broadcasts a ⟨gap-commit, view-id, log-slot-num, recv-or-drop⟩   to all replicas.The replica also stores the gap-decision and 2 gap-prepare in its log for the view change protocol.
When replica  receives 2 + 1 gap-commits from different replicas (possibly including itself), it stores either the oc (if it hasn't done so) or a no-op to the log slot based on the decision, and resuming normal operation if it is blocking on a qery-reply or a gap agreement decision.It also stores all 2 + 1 gap-commits in its log.This quorum of gap-commit will serve as a gap certificate for the state synchronization and view change protocols.In the rare case where the replica has already speculatively executed the request and the decision is a drop, it rolls back the application state to right before log-slot-num, and re-executes subsequent requests in the log.

View Changes
We use a view change protocol, inspired by PBFT, to handle both leader failures and faulty aom sequencers.The protocol guarantees that all committed operations (including no-ops) will carry over to the new view.View changes can be initiated when a non-leader replica fails to make progress in a gap agreement or state synchronization protocol, or when a replica receives a request message directly from the client ( §5.3) but the request is not delivered by aom after a timeout.
For view changes that involve switching epochs, the protocol requires log consistency before entering the new epoch.To that end, we introduce an epoch certificate consisting of 2 + 1 valid epoch-start messages from distinct replicas.An epoch certificate is a proof of the agreed starting log position of the epoch.We then define validity of a replica log as the following: a replica log is valid if and only if (i) the starting log position of all epochs are supported by a valid epoch-cert, and (ii) within each epoch , all log positions are filled with either a valid oc or a no-op supported by a gap certificate.
Our view change protocol is similar to that of PBFT.The main differences are the epoch certificates and the definition of log validity.Details of the protocol can be found in §B.1.

Correctness
Here, we sketch a correctness proof for the NeoBFT protocol.Complete safety and liveness proofs can be found in §C.
The safety property we are proving is linearizability [28].A key definition we use in our proof is committed operations: an operation is committed in a log slot if it is executed by 2 + 1 replicas with matching view-ids and log-hashes.
First, we show that within an epoch, if a request  is committed at log slot , no other request  ′ ( ′ ≠  ) or no-op can be committed at . Due to the guarantees of aom, no correct replicas will execute  ′ at log slot  given that some correct replica has already executed  , so  ′ can never be committed at .To show that no-op cannot be committed, we prove by contradiction.Assume a no-op is committed at , some replica would have received a gap-decision with 2 +1 valid gap-drop-message from the leader, and 2 gap-prepare containing a drop decision from different replicas.By quorum intersection, no replica can receive 2 distinct gap-prepares with a recv at , making drop the only possible decision.Our view change protocol also ensures that the decision will persist in all subsequent views.Moreover, 2 + 1 replicas have sent a gap-drop-message, and at least  + 1 of those replicas are non-faulty.Since they block until they receive gap-commits and the only possible outcome is drop, they will not execute  .By quorum intersection,  cannot be committed, leading to a contradiction.
Next, we show that within an epoch, if a request is committed at log slot , all log slots before  will also be committed.A committed request at  implies that at least  + 1 correct replicas have matching logs up to .For each log slot before , since the same request has been processed by  + 1 correct replicas, we can apply the same   reasoning as above to show that no other request or a no-op can ever be committed at that slot.Lastly, we show that our view change protocol guarantees that correct replicas agree on all committed requests and no-ops across views, and that they start each epoch in a consistent log state.The first point is easy to show given that our view change protocol merges 2 + 1 logs and using the quorum intersection principle.To prove the second point, we only need to show that for each epoch , all correct replicas end  at the same log slot before starting  + 1.To enter epoch  +1, a correct replica needs a valid epoch-cert for epoch  + 1: 2 + 1 distinct epoch-starts with matching log-slot-num.By quorum intersection, no other epoch-cert can exist for  + 1 with a different log-slot-num.And since correct replicas verify epoch-cert for every epoch during view changes, by induction, their logs will be in consistent state.

EVALUATION
We implement NeoBFT and the aom library in ∼1600 lines of Rust code.The HMAC version of the aom sequencer is implemented in ∼1900 lines of P4 [13] code and compiled using the Intel P4 Studio version 9.7.0.The FPGA-based cryptographic accelerator is written in ∼1500 lines of HLS C++/Verilog code, synthesized using the Xilinx Vivado Design Suite 2020.2 [56].
We compare NeoBFT to PBFT [16], HotStuff [58], Zyzzyva [34], and MinBFT [55].To ensure a fair comparison, all protocols are implemented in the same Rust-based framework.For MinBFT, we run the trusted USIG service in Intel SGX [51].We also add batching support to all protocols, following the batching techniques proposed in their original work.NeoBFT does not use any batching on the protocol level.Only when using the public key variant of aom, NeoBFT replicas buffer packets until receiving a signed message from the network.We run all protocols on four replicas, thereby tolerating one Byzantine failure, unless specified otherwise.
Testbed.Our testbed consists of nine servers and a Xilinx Alveo U50 FPGA, all connected to an Intel Tofino-based [30] switch.Replicas are deployed on machines with dual 2.90GHz Intel Xeon Gold 6226R processors (32 physical cores), 256 GB RAM, and Mellanox CX-5 EN 100 Gbps NICs.Clients run on machines with a 2.10GHz

Micro-benchmarks
We first conduct micro-benchmarks to evaluate the performance of our aom network primitive.Both the HMAC-based switch design (aom-hm) and the FPGA-accelerated public-key variant (aom-pk) are evaluated.Specifically, we generate 64-byte aom packets at line-rate using the Tofino built-in packet generator.To accurately measure the latency of our design, we take an ingress and an egress switch timestamp for each aom packet.Table 2 shows the switch resource utilization of our aom in-network HMAC vector design.Table 3 summarizes the FPGA resource usage of our public-key cryptography coprocessor design.Figure 4 and Figure 5 shows the latency CDF of the two designs at different load levels.The aom group size is fixed at 4. Before fully loaded, aom-hm attains a median latency of ∼9s, while aompk achieves a median latency of ∼3s.The longer latency of the HMAC design is due to the additional pipeline passes (12 in total) to generate a secure hash.Latencies of both hardware designs are highly consistent.The 99.9% latency increases by only 0.7% compared to the median for aom-hm, and 0.6% for aom-pk.At close-to-saturation load, aom-hm shows longer latency tail due to significant queuing delays in the switch pipelines.
Figure 6 shows the maximum throughput attained by each design with varying aom group size.With a group size of 4, aom-hm achieves a maximum HMAC vector throughput of 77Mpps, which is around 300 million hashes per second.Its throughput, however, starts to drop when adding more receivers to the group.With 64 receivers, throughput of aom-hm drops to 5.7Mpps which is only 8% that of the 4-receiver setting.aom-pk, on the other hand, attains a constant signing throughput of 1.1Mpps regardless of the group size.We note that throughput of aom under a single sequencer is at least an order of magnitude higher that of our NeoBFT protocol ( §6.2).Our network ordering design, therefore, will not become the performance bottleneck of the system.

Latency vs. Throughput
We next evaluate the latency and throughput of NeoBFT and compare them to other BFT protocols.Our focus here is protocollevel performance, so we run an echo-RPC application with randomly generated strings as requests.We use an increasing number of closed-loop clients and measure the end-to-end latency and throughput observed by the clients.As shown in Figure 7, HMAC-based NeoBFT achieves higher maximum throughput than PBFT (2.5×), HotStuff (3.4×), and MinBFT (4.1×).More aggressive batching can further increase HotStuff's throughput to a level comparable to NeoBFT; however, its latency also increases to more than 10ms.To commit a client operation, these protocols require explicit coordination among the replicas, with each message requiring expensive cryptographic operations.MinBFT utilizes the trusted component to reduce the replication factor to 2 + 1, but does not improve the authenticator complexity.NeoBFT, on the other hand, leverages guarantees of aom to eliminate coordination and cross-replica authentication overhead in the common case.Comparing to Zyzzyva, NeoBFT still achieves 1.8× higher throughput.Moreover, when one of the replicas becomes faulty, throughput of Zyzzyva (Zyzzyva-F) drops by more than 54%, while throughput of NeoBFT is unaffected.When using the public-key variant of aom, NeoBFT only suffers a 60K throughput decrease, despite requiring more expensive cryptographic operations, demonstrating the efficiency of our in-network crypto design.Figure 7 also shows the bigger benefit of NeoBFT-latency.HMAC-based NeoBFT outperforms PBFT in latency by 14.68×, Hot-Stuff by 42.28×, Zyzzyva by 8.56×, and MinBFT by 6.08×.NeoBFT commits client operations in two message delays, while the other four protocols require at least three message delays with additional authentication penalties.Using the public-key variant of aom adds 55s to the latency of NeoBFT.However, this version of NeoBFT still outperforms all the other protocols in latency by at least 2.7×.
Tolerating Byzantine network.As discussed in §4.2, to tolerate Byzantine sequencers, aom receivers exchange and authenticate confirm messages.This can lead to degraded throughput and latency compared to deployments that trust the network.Figure 7 shows the performance of NeoBFT when tolerating a Byzantine network.By batch processing confirm messages, NeoBFT minimizes the impact of the additional message exchanges, and is able to sustain a high throughput at the expense of higher latency.As shown in the figure, this NeoBFT variant still outperforms the other comparison protocols in both throughput and latency.

Protocol Scalability
To evaluate the scalability of NeoBFT, we gradually increase the number of NeoBFT replicas, and measure the maximum sustainable throughput.Due to the limited capacity of our own cluster, we deploy aom on Amazon EC2 in the AWS ap-east-1 region.We run up to 100 aom replicas m5.4xlarge instances with hyper-threading disabled, and clients on t3.micro instances.As EC2 does not offer programmable switches, we implement a software version of the aom switch using Rust.Due to hardware differences, the maximum throughput number is lower compared to that of our cluster.As shown in Figure 8, NeoBFT using aom-pk scales to 100 replicas with only a 13% throughput drop.NeoBFT replicas process a constant number of messages per client request, regardless of the replica count.This allows NeoBFT to scale its performance almost linearly with more replicas.Adding replicas, however, would increase the number of reply messages NeoBFT clients need to receive.NeoBFT effectively shift the collector load to the client, which can naturally scale.As discussed in §4.3, when using aom-hm, replicas receive  messages for each client request, where  is the number of aom-hm subgroups.When the group size increases, throughput of NeoBFT drops as each replica processes linearly more messages.

Resilience to Network Anomalies
When aom messages are dropped in the network, NeoBFT replicas coordinate to agree on the fate of the message.To evaluate NeoBFT's resilience to network anomalies, we simulate packet drops in the network, and measure the maximum throughput of NeoBFT.As shown in Figure 9, throughput of NeoBFT is largely unaffected when moderate amount of packets are dropped, since drop-notification allows non-faulty NeoBFT replicas to efficiently recover missing messages from each other, without the expensive agreement protocol.When a higher percentage of packets are dropped (1%), NeoBFT does suffer a more observable throughput drop.
Sequencer switch failover.When the sequencer switch becomes faulty, NeoBFT performs a view change and fails over to a different sequencer ( §5.5).To understand the impact of a faulty sequencer, we measure the throughput of NeoBFT during a switch failover.We ran NeoBFT at maximum throughput for 10 seconds, and then simulated a sequencer failure by dropping aom packets on the switch.Throughput of NeoBFT immediately dropped to zero.When the replicas detected the sequencer failure, they ran a view change protocol which finished in less than 200 s.They then informed the configuration service to switch to a new sequencer, which we simulated by re-initializing the sequencer switch state through the control plane.After switch reconfiguration is done, throughput of NeoBFT quickly resumed to its peak.Overall, sequencer failover took less than 100ms, with the majority of the delay caused by network-level updates rather than the view change protocol.

BFT Storage System Performance
Lastly, we evaluated the performance of NeoBFT when running more complex real-world applications and compared against other protocols.We developed an in-memory, B-Tree-based key-value store, and ran YCSB workload A with 100K records and 128-bytes fields.Maximum YCSB throughput attained by each system is shown in Figure 10.NeoBFT achieved higher throughput than PBFT, HotStuff, Zyzzyva, and MinBFT when running a more complex application.This KV-store requires protocols to handle larger requests than previous experiments, leading to reduced batching efficiency for Zyzzyva, MinBFT, PBFT, and HotStuff.NeoBFT exploits its lower message complexity to attain higher performance.

RELATED WORK
BFT protocols.As discussed in §2.1, there has been a long line of work on designing practical BFT protocols [16,27,34,58].These protocols guarantee correctness in an asynchronous network, and ensure liveness during weak asynchrony.They all use a single leader node to coordinate ordering and agreement, and rely on view change (or similar) protocols to deal with faulty leaders.Byzantine Paxos [36] and DBFT [23] propose a leaderless BFT design, but require a synchronous protocol that commits in  ( ) or more rounds.HoneyBadger [43] attacks the weak synchrony assumption and provides optimal asymptotic efficiency.However, it introduces  ( 3 ) message complexity and five message delays.NeoBFT leverages the guarantees of aom to eliminate the leader and coordination overhead in the common case, leading to a bottleneck message complexity of  (1) and two message delays to commit an operation.
BFT with trusted components.Recent work have proposed leveraging trusted components to improve BFT protocols [9,20,22,37,55].Using a local trusted component on each replica enables these protocols to reduce the replication factor to 2 + 1.However, since the trusted components are local to replicas, they still necessitate coordination among replicas to commit client operations.Moreover, many of these protocols implement trusted components in resource-constrained TPM hardware which significantly limit their performance.NeoBFT implements its ordering service in the data center network.Relying on authenticated network ordering, the protocol avoid all coordination in normal operation.And by implementing on fast networking hardware, the service does not become the performance bottleneck.
Network ordering.A classic line of work in distributed computing proposes stronger network models, such as atomic broadcast [12,32] and virtual synchrony [10,11], to simplify distributed system designs.These network primitives guarantee that a total order of messages are delivered to all broadcast receivers.However, atomic broadcast and virtual synchrony do not offer performance benefits to distributed systems -implementing them is equivalent to solving consensus [18].NOPaxos [39] and Eris [38] pioneered a weaker network model in which messages are delivered in a consistent order, but reliable transmission is not guaranteed.This weaker model can be efficiently implemented using programmable switches.NOPaxos proposes an Ordered Unreliable Multicast primitive for state machine replication, while Eris designs a multi-sequenced groupcast primitive for distributed transactions.BIDL [46] uses sequencers to parallelize consensus and transaction execution in a permissioned blockchain system.However, BIDL still uses traditional BFT protocols for consensus and its sequencer design does not improve performance of the BFT protocol itself.The network sequencing approach has also been applied to other distributed system designs [7,8,57].Our aom primitive was inspired from these work, but additionally provides transferable authentication, a guarantee crucial for BFT protocols.

CONCLUSION
In this work, we propose a novel in-network authenticated ordering service, aom.We demonstrated the feasibility of this design by implementing two variants, one using HMAC and another using public key cryptography for authentication.We then co-designed a new BFT SMR protocol, NeoBFT, that eliminates cross-replica coordination and authentication in the common case.NeoBFT outperforms state-of-the-art BFT protocols on both throughput and latency metrics.
Appendices are supporting material that has not been peerreviewed.

A ARTIFACT APPENDIX A.1 Abstract
The NeoBFT artifact comprises two components: the source code of the main protocol, and paper-specific parts (such as artifact scripts and data) to replicate the results in this paper.

A.2 Scope
Our prototype serves as a demonstration of the following: • The throughput of aom, including both aom-hm and aom-pk variants, is compatible to support replication deployment without performance degradation.• The co-designed replication protocol outperforms mainstream BFT protocols with lower latency and higher throughput.• The replication protocol can support fault-tolerance deployment of realistic applications e.g.key-value store.

A.3 Contents
The artifact includes the implementation of libneo, the NeoBFT protocol, and all comparison BFT protocols: PBFT, Zyzzyva, HotStuff, and MinBFT.The artifact also includes P4 programs for implementing the aom primitive, and an FPGA bitstream file that works as the aom-pk accelerator.

A.4 Hosting
The source codes are hosted at https://github.com/nus-sys/neobftartifact.The repository also includes instructions for running the system and reproducing evaluation results.

A.5 Requirement
Reproducing aom micro-benchmarks requires a Tofino-1 switch and a Xilinx FPGA card.Experiments for reproducing Figure 7, 9 and 10 additionally require 4 servers running replicas.The scalability evaluation, i.e., Figure 8, requires up to 100 servers to host replicas and additional servers to simulate in-network sequencing and multicast.We conducted the evaluation on AWS EC2 instances.

B ADDITIONAL PROTOCOL DETAILS B.1 View Change Protocol
In this section, we provide the detail of the view change protocol, which is omitted in the main paper.
When replica  initiates a view change, it broadcasts a ⟨view-change, view-id,  ′ , epoch-cert, log⟩   to all other replicas, where view-id is its current view number,  ′ is the new view, and epoch-cert contains an epoch certificate for each epoch it has started.
When leader replica  of view  ′ receives 2 valid view-change messages for view  ′ from different replicas, it merges the logs in the view-change messages as follows: (1) It finds the view-id with the largest epoch number  that is supported by a epoch-cert.When replica  receives a view-start message with  ′ higher than its current view, it checks that view-change-msgs are properly signed by 2 + 1 different replicas, they all contain the same next view number  ′ , and their logs are valid.It then merges its log with logs in view-change-msgs using the same procedure we described above.
If the view change does not involve a epoch switch, replica  can immediately enter the new view.Otherwise, it broadcasts a ⟨epoch-start,  ′ , log-slot-num⟩   to all replicas, where  ′ is the new epoch number, and log-slot-num is the last log index after merging the logs during view change.Once a replica receives 2 + 1 epoch-start messages from different replicas with  ′ and log-slotnum matching its own, it can enter the new view.It also stores these epoch-start messages locally as an epoch certificate for future view changes.

B.2 State Synchronization
During normal operations, replicas execute client requests speculatively before they become durable.A speculatively executed request might be overwritten due to the gap agreement or view change protocols, and the replica has to roll back application state and re-executes all subsequent requests in the log.To further reduce the frequency of roll backs and the number of re-executions, we use a periodic synchronization protocol.The goal of the synchronization protocol is to produce a sync-point, where all log entries before and including the sync-point are committed.A committed log entry will never be overwritten or removed, and will be present in the log (at the same position) of all non-faulty replicas in all subsequent views.
After every  entries are added to the log ( is a configurable constant), a replica  broadcasts a ⟨sync, view-id, log-slot-num, drops⟩   to all replicas, where log-slot-num is latest log index that is a multiple of  , and drops contains gap certificates for all log slots that have been committed as no-op in the current view.Once replica  receives 2 sync messages with the same log-slot-num from different replicas, for each entry in any of the drops that has a valid gap certificate, it writes a no-op (possibly overwriting existing request) to the corresponding log position and saves the gap certificate.It then updates its sync-point to log-slot-num.

C CORRECTNESS PROOF
This section contains complete safety and liveness proofs of the NeoBFT protocol.

C.1 Safety
The main safety property guaranteed by NeoBFT is linearizability.In this safety proof, we assume the network primitive aom provides transferable authentication, ordering, and drop detection, as specified in the main paper.
Theorem 1 (NeoBFT Safety).NeoBFT guarantees linearizability of client operations and returned results.
Before proving Theorem 1, we first define a few properties of NeoBFT replica logs and client reqests.
Definition.A reqest is committed in a log slot if it is executed by 2 + 1 distinct replicas in that slot with matching view-id and log-hash.
Definition.A reqest is successful if the client receives 2 +1 valid replys from different replicas with matching view-id, log-slot-num, log-hash, and result.
It is easy to see that a successful reqest implies that the reqest is committed.
Definition.A log is stable if it is a prefix of the log of every nonfaulty replica in views higher than the current one.
Lemma 1 (Log Stability).Every successful request was appended onto a stable log at some non-faulty replica, and the resulting log is also stable.
To prove Lemma 1, we first prove the following set of lemmas.Proof.We prove by contradiction.Assume a replica inserts a no-op with a valid gap certificate in slot .The replica then has received 2 + 1 gap-commits with decision drop for slot , implying some replica has received 2 distinct gap-prepares containing a drop decision and a matching gap-decision from the leader.Since a non-faulty replica only sends unique gap-prepare for a log slot within a view, by quorum intersection, there cannot exist 2 distinct gap-prepares containing a recv decision at slot .Therefore, drop is the only possible commit decision for the gap agreement protocol.Moreover, a valid gap-prepare containing a drop decision implies that 2 + 1 replicas have sent a gap-drop-message.Out of those 2 + 1 replicas, at least  + 1 are non-faulty.Since non-faulty replicas block until they receive gap-commit and the only possible commit outcome is drop, they will not execute reqest.By quorum intersection, no 2 + 1 replicas can execute reqest, so reqest cannot be committed.This leads to a contradiction.□ Lemma 4. For any two non-faulty replicas in the same epoch, no slot in their logs contains different requests.
Proof.Within the epoch, non-faulty replicas insert reqests into their logs strictly in the order received from aom.In the absence of drop-notifications, the ordering property of aom ensures that all non-faulty replicas have identical sequence of reqests in their logs.By Lemma 2, for any epoch, all non-faulty replicas start the epoch with the same log position.Consequently, for any two non-faulty replicas, no log slot within the epoch contains different reqests.A non-faulty replica may also insert a reqest  into its log when handling a drop-notification.To fill the gap caused by the drop-notification, the replica requires the transfer of  with the corresponding ordering certificate oc (through qery-reply or gap-commit).The transferable authentication property of aom ensures that  is identical to reqests delivered by other non-faulty replicas at the same position in the aom message sequence.The case is therefore equivalent to the case in which a reqest, not a dropnotification, is received by the replica.A non-faulty replica may also insert a no-op into its log during the gap agreement protocol.Assume the replica inserts no-op at the  + th log slot where  is the starting log position of the epoch.The replica ignores the corresponding th reqest (by checking the sequence number) if it is later delivered by aom.Consequently, if the replica receives the  + 1th reqest in the aom message sequence, it can only insert the reqest at log position  +  + 1.By the above argument, the replica's log from  +  + 1 onward will not contain non-matching reqests from other non-faulty replicas.By induction on , no log slot within the epoch contains different reqests at any two non-faulty replicas.□ Lemma 5. Any request or no-op that is committed at a log position in some view will be in the same log slot in all non-faulty replica's log in all subsequent views.Proof.To enter a new view  ′ from the current view , a nonfaulty replica needs to receive a view-start message which contains 2 + 1 view-change messages from distinct replicas.There are two cases.Case 1: If a request  is committed at log position  in view , by definition,  is executed by 2 + 1 replicas at the same log position in .Therefore, at least  + 1 non-faulty replicas have inserted  into their log at position .By quorum intersection, at least one of the view-change messages contains  in the log.The log merging rule in the view change protocol ensures that the replica inserts  into its log at the same log position in view  ′ .And by Lemma 3, no no-op can be committed at the same log position, so the replica will not overwrite the slot with a no-op during log merging.Case 2: If a no-op is committed at log position  in view , at least 2 + 1 replicas have sent gap-commit with decision drop-notification for log slot .Therefore, at least  + 1 non-faulty replicas have stored 2 gap-prepare and the matching gap-decision with decision drop-notification.By quorum intersection, at least one of the view-change messages contains the 2 gap-prepare and the gap-decision.The log merging procedure ensures that slot  is filled with a no-op in view  ′ .□ We are now ready to prove the main log stability lemma.
Proof of Log Stability (Lemma 1).A successful reqest implies that reqest is committed at log slot .By definition of committed requests, at least  + 1 non-faulty replicas have matching logs up to .And since non-faulty replicas insert log entries strictly in log order (blocking before the next log entry is resolved), all log entries before  are occupied.For any log slot  ′ ≤ , if a reqest  is stored in the matching logs, by Lemma 4, no other reqest can be inserted into the same slot at any non-faulty replica.And by quorum intersection and our gap agreement protocol, there cannot exist a valid gap certificate for slot  ′ .Therefore, only  can be committed at log slot  ′ in the view.Otherwise, if a no-op is stored in the matching logs, there must exist a valid gap certificate for slot  ′ .By definition, the no-op is committed at  ′ .Lemma 5 then guarantees that log entry at  ′ (either a reqest or a no-op) in the matching log will be in the same log slot in all non-faulty replica's log in all subsequent views.Consequently, the successful reqest was appended onto a stable log at least  + 1 non-faulty replica.Since the successful reqest is also committed at log slot , by Lemma 5, reqest will be in slot  in all non-faulty replica's log in all subsequent views.The resulting log is thus also stable.□ With Lemma 1, we are ready to prove our main safety property.
Proof of NeoBFT Safety (Theorem 1).First, observe that, by definition, a stable log only grows monotonically.Combining this observation with Lemma 1, from the client's perspective, the behavior of NeoBFT is indistinguishable from the behavior of a single, correct machine that processes reqest sequentially.This implies that any execution of NeoBFT is equivalent to some serial execution of reqests.Moreover, clients retry sending an operation until a reqest containing the operation is successful.NeoBFT applies standard at-most-once techniques to avoid executing duplicated reqests.Therefore, a NeoBFT execution is equivalent to some serial execution of unique client operations.
The above argument proves serializability.When a client receives the necessary replies for a successful reqest  , by Lemma 1,  must have already been added to a stable log.For any successful reqest  ′ issued after this point in real-time,  ′ can only be inserted after  in the stable log.Since non-faulty replicas execute reqests strictly in log order, the operations issued and results returned by NeoBFT are linearizable.□

C.2 Liveness
Due to the well-known FLP result, NeoBFT can not guarantee progress in a fully asynchronous network.We therefore only prove liveness given some weak synchrony assumptions.
Theorem 2 (Liveness).requests sent by clients will eventually be successful if there is sufficient amount of time during which • the network the replicas communicate over is fair-lossy, • there is some bound on the relative processing speeds of replicas, • the 2 + 1 non-faulty replicas stay up, • there is a non-faulty replica that stays up which no non-faulty replica suspects of having failed, • there is a non-faulty aom sequencer stays up which no nonfaulty replica suspects of having failed, • all non-faulty replicas correctly suspect faulty nodes and aom sequencers, • clients' requests are eventually delivered by aom.
Proof.Since there exist non-faulty replica and non-faulty aom sequencer that stay up which no non-faulty replica suspects of having failed, there is a finite number of view changes during the synchrony period.Once the non-faulty replica that stays up has been elected as leader, and the non-faulty sequencer has been configured by the network, no view change with a higher view will start, as 2 + 1 non-faulty replicas will not send the corresponding view-change message.
Moreover, any view change that successfully starts will eventually finish.A non-faulty replica that initiates a view change keeps re-broadcasting its view-change message until the new view starts, or until the view change is supplanted by one with a higher view.Since non-faulty replicas correctly suspect faulty nodes and aom sequencers, eventually 2 + 1 non-faulty replicas will initiate the view change.As all 2 + 1 non-faulty replicas stay up, the leader for the new view, if it is non-faulty, will eventually receive the necessary 2 + 1 view-change messages to start the view.If the leader is faulty, non-faulty replicas will correctly suspect the fact, and start a higher view change which will supplant the current one.
Additionally, once a view starts, eventually all non-faulty replicas will adopt the new view and start processing reqests in the view, as long as the view is not supplanted by a even higher view.If the leader is non-faulty, it will re-broadcast view-start messages until it receives acknowledgement from all replicas.If the leader is faulty, non-faulty replicas will correctly suspect the fact, and start a higher view change which will supplant the current one.
The above arguments imply that eventually, there will be a view which stays active with a non-faulty leader and a non-faulty aom sequencer.During that view, non-faulty replicas will eventually be able to resolve any drop-notification from aom: The replica receiving a drop-notification will keep resending the qery message until receiving a qery-reply from the leader or enough gap-commits.If the leader does not have the reqest, it will continually broadcast gap-find-message to all replicas.Since 2 + 1 non-faulty replicas stay up, eventually the leader will receive either one gap-recv-message or 2 + 1 gap-drop-messages.Once the non-faulty leader starts the binary Byzantine agreement protocol with the decision, by applying the same line of reasoning, eventually non-faulty replicas blocking on the drop-notification will receive the necessary gap-commits to resolve the drop-notification.Therefore, the system will eventually reach a point where a view stays active with a non-faulty leader and a non-faulty aom sequencer, and non-faulty replicas only receive reqests from aom.After that point, every reqest delivered by aom will eventually be successful.Because clients' reqests eventually will be delivered by aom and 2 +1 non-faulty replicas stay up, clients will eventually receive the necessary replys for reqests they have sent. □

Figure 2 :
Figure 2: Folded pipeline design for generating HMAC vectors on a Tofino switch.Blue arrows denote aom packets without HMACs.Red arrows represent authenticated aom packets.Thick arrows refer to multicast.

Figure 4 :
Figure 4: Latency distribution of the HMAC variant of aom (aom-hm)

Figure 7 :
Figure 7: Comparing latency and throughput of NeoBFT and other BFT protocols.Neo-HM and Neo-PK are the HMAC and public-key version of NeoBFT.Neo-BN uses the aom variant that tolerates Byzantine network.Zyzzyva-F is Zyzzyva with a non-responding Byzantine replica.

Figure 8 :Figure 9 :Figure 10 :
Figure 8: Throughput of NeoBFT with increasing number of replicas

( 2 )
If the replica has not started epoch  yet, it finds a valid log that has started epoch .It then copies all requests and no-ops before the starting log position of epoch  to its own log.(3) From all valid logs that have started epoch , it locates the log with the largest seq-num in epoch .It then copies all requests in epoch  from the log to its own log.(4) From any valid log that have started epoch , it copies all no-ops in epoch  from the log into its own log, possibly overwriting existing requests.A ⟨view-start,  ′ , view-change-msgs⟩   is then broadcasted by the leader replica , where view-change-msgs contains the 2 viewchange messages it uses to merge the log and the view-change message it would have sent for  ′ .

Lemma 2 .Lemma 3 .
All non-faulty replicas that begin an epoch begin the epoch with the same log position.Proof.Prove by induction.In the first epoch, all non-faulty replicas start with log position 0. This proves the base case.Now assume all non-faulty replicas start epoch  with the same log position.To enter the next epoch  ′ , a non-faulty replica needs to receive 2 + 1 epoch-start messages for  ′ from distinct replicas with log-slot-num matching its own.Define these 2 + 1 epochstarts as a epoch certificate.By quorum intersection, no two nonfaulty replicas can have epoch certificates with different log-slotnum.Therefore, all non-faulty replicas enter epoch  ′ with the same log position.This proves the inductive step.□Within an epoch, if a request is committed in some log slot , then no replica can include a no-op with a valid gap certificate in slot  in that epoch.

Table 1 :
Comparison of NeoBFT to state-of-the-art BFT protocols.Here, bottleneck complexity denotes the number of messages the bottleneck replica needs to process; authenticator complexity shows the total number of signatures processed by all replicas.
If a receiver  1 forwards an aom message to another receiver  2 ,  2 can independently verify the authenticity of the message.•Ordering.For any two authentic aom messages  1 and  2 that destined to the same aom group , all correct receivers in  that receive both  1 and  2 deliver them in the same order.

Table 2 :
Switch resource usage of the aom HMAC vector switch prototype

Table 3 :
FPGA resource usage of the aom public-key cryptographic coprocessor