Optimizing Distributed Protocols with Query Rewrites

Distributed protocols such as 2PC and Paxos lie at the core of many systems in the cloud, but standard implementations do not scale. New scalable distributed protocols are developed through careful analysis and rewrites, but this process is ad hoc and error-prone. This paper presents an approach for scaling any distributed protocol by applying rule-driven rewrites, borrowing from query optimization. Distributed protocol rewrites entail a new burden: reasoning about spatiotemporal correctness. We leverage order-insensitivity and data dependency analysis to systematically identify correct coordination-free scaling opportunities. We apply this analysis to create preconditions and mechanisms for coordination-free decoupling and partitioning, two fundamental vertical and horizontal scaling techniques. Manual rule-driven applications of decoupling and partitioning improve the throughput of 2PC by 5× and Paxos by 3×, and match state-of-the-art throughput in recent work. These results point the way toward automated optimizers for distributed protocols based on correct-by-construction rewrite rules.


INTRODUCTION
Promises of better cost and scalability have driven the migration of database systems to the cloud.Yet, the distributed protocols at the core of these systems, such as 2PC [46] or Paxos [43], are not designed to scale: when the number of machines grows, overheads often increase and throughput drops.As such, there has been a wealth of research on developing new, scalable distributed protocols.Unfortunately, each new design requires careful examination of prior work and new correctness proofs; the process is ad hoc and often error-prone [2,35,51,53,57,62].Moreover, due to the heterogeneity of proposed approaches, each new insight is localized to its particular protocol and cannot easily be composed with other efforts.This paper offers an alternative approach.Instead of creating new distributed protocols from scratch, we formalize scalability optimizations into rule-driven rewrites that are correct by construction and can be applied to any distributed protocol.
To rewrite distributed protocols, we take a page from traditional SQL query optimizations.Prior work has shown that distributed protocols can be expressed declaratively as sets of queries in a SQL-like language such as Dedalus [7], which we adopt here.Applying query optimization to these protocols thus seems like an appealing way forward.Doing so correctly however, requires care, as the domain of distributed protocols requires optimizer transformations whose correctness is subtler than classical matters like the associativity and commutativity of join.In particular, transformations to scale across machines must reason about program equivalence in the face of changes to spatiotemporal semantics like the order of data arrivals and the location of state.
We focus on applying two fundamental scaling optimizations in this paper: decoupling and partitioning, which correspond to vertical and horizontal scaling.We target these two techniques because (1) they can be generalized across protocols and (2) were recently shown by Whittaker et al. [63] to achieve state-of-the-art throughput on complex distributed protocols such as Paxos.While Whittaker's rewrites are handcrafted specifically for Paxos, our goal is to rigorously define the general preconditions and mechanics for decoupling and partitioning, so they can be used to correctly rewrite any distributed protocol.
Decoupling improves scalability by spreading logic across machines to take advantage of additional physical resources and pipeline parallel computation.Decoupling rewrites data dependencies on a single node into messages that are sent via asynchronous channels between nodes.Without coordination, the original timing and ordering of messages cannot be guaranteed once these channels are introduced.To preserve correctness without introducing coordination, we decouple sub-components that produce the same responses regardless of message ordering or timing: these sub-components are order-insensitive.Order-insensitivity is easy to systematically identify in Dedalus thanks to its relational model: Dedalus programs are an (unordered) set of queries over (unordered) relations, so the logic for ordering-time, causality, log sequence numbers-is the exception, not the norm, and easy to identify.By avoiding decoupling the logic that explicitly relies on order, we can decouple the remaining order-insensitive sub-components without coordination.
Partitioning improves scalability by spreading state across machines and parallelizing compute, a technique widely used in query processing [22,25].Textbook discussions focus on partitioning data to satisfy a single query operator like join or group-by.If the next operator downstream requires a different partitioning, then data must be forwarded or "shuffled" across the network.We would like to partition data in such a way that entire sub-programs can compute on local data without reshuffling.We leverage relational techniques like functional dependency analysis to find data partitioning schemes that can allow as much code as possible to work on local partitions without reshuffling between operators.This is a benefit of choosing to express distributed protocols in the relational model: functional dependencies are far easier to identify in a relational language than a procedural language.
We demonstrate the generality of our optimizations by methodically applying rewrites to three seminal distributed protocols: voting, 2PC, and Paxos.We specifically target Paxos [59] as it is a protocol with many distributed invariants and it is challenging to verify [31,66,67].The throughput of the optimized voting, 2PC, and Paxos protocols scale by 2×, 5×, and 3× respectively, a scale-up factor that matches the performance of ad hoc rewrites [63] when the underlying language of each implementation is accounted for and achieves state-of-the-art performance for Paxos.
Our correctness arguments focus on the equivalence of localized, "peephole" optimizations of dataflow graphs.Traditional protocol optimizations often make wholesale modifications to protocol logic and therefore require holistic reasoning to prove correctness.We take a different approach.Our rewrite rules modify existing programs with small local changes, each of which is proven to preserve semantics.As a result, each rewritten subprogram is provably indistinguishable to an observer (or client) from the original.We do not need to prove that holistic protocol invariants are preserved-they must be.Moreover, because rewrites are local and preserve semantics, they can be composed to produce protocols with multiple optimizations, as we demonstrate in Section 5.2.
Our local-first approach naturally has a potential cost: the space of protocol optimization is limited by design as it treats the initial implementation as "law".It cannot distinguish between true protocol invariants and implementation artifacts, limiting the space of potential optimizations.Nonetheless, we find that, when applying our results to seminal distributed system algorithms, we easily match the results of their (manually proven) optimized implementations.
In summary, we make the following contributions: (1) We present the preconditions and mechanisms for applying multiple correct-by-construction rewrites of two fundamental transformations: decoupling and partitioning.
(2) We demonstrate the application of these rule-driven rewrites by manually applying them to complex distributed protocols such as Paxos.
(3) We evaluate our optimized programs and observe 2 − 5× improvement in throughput across protocols with state-of-the-art throughput in Paxos, validating the role of correct-byconstruction rewrites for distributed protocols.

BACKGROUND
Our contributions begin with the program rewriting rules in Section 3. Naturally, the correctness of those rules depends on the details of the language we are rewriting, Dedalus.Hence in this section we pause to review the syntax and semantics of Dedalus, as well as additional terminology we will use in subsequent discussion.
Dedalus is a spatiotemporal logic for distributed systems [7].As we will see in Section 2.3, Dedalus captures specifications for the state, computation and messages of a set of distributed nodes over time.Each node (a.k.a.machine, thread) has its own explicit "clock" that marks out local time sequentially.Dedalus (and hence our work here) assumes a standard asynchronous model in which messages between correct nodes can be arbitrarily delayed and reordered, but must eventually be delivered after an infinite amount of time [24].Dedalus is a dialect of Datalog ¬ , which is itself a SQL-like declarative logic language that supports familiar constructs like joins, selection, and projection, with additional support for recursion, aggregation (akin to GROUP BY in SQL), and negation (NOT IN).Unlike SQL, Datalog ¬ has set semantics.

Running example
As a running example, we focus on a verifiably replicated key-value store with hash-conflict detection inspired by [56].We use this example to explain the core concepts of Dedalus and to illustrate in Sections 3 and 4 how our transformations can be applied.In Section 5 we turn our attention to more complex and realistic examples, including Paxos and 2PC. Figure 1 provides a high level diagram of the example; we explain the corresponding Dedalus code (Listings 1 and 2) in the next subsection.
The running example consists of a leader node and multiple storage nodes and allows clients to write to storage nodes, with the ability to detect concurrent writes.The leader node cryptographically signs each client message and broadcasts both the message and signature to each storage node.Each storage node then stores the message and the hash of the message in a local table if the signature is valid.The storage nodes also calculate the number of unique existing messages in the table whose hash collides with the hash of the message.The storage nodes then sign the original message and respond to the leader node.Upon collecting a response from each storage node, if the number of hash collisions is consistent across responses, the leader creates a certificate of all the responses and replies to the client.If any two storage nodes report differing numbers of hash collisions, the leader notifies the client of the inconsistency.We use this simple protocol for illustration, and present more complete protocols-2PC and Paxos-in Section 5.

Datalog ¬
We now introduce the necessary Datalog ¬ terminology, copying code snippets from Listings 1 and 2 to introduce key concepts.In this example, the head literal is collisions, and the body literals are toStorage, hash, and hashset.Each body literal can be a (possibly negated) relation  consisting of multiple attributes , or a boolean expression; the head literal must be a relation.For example, hashset is a relation with four attributes representing the hash, message value, location, and time in that order.Each attribute must be bound to a constant or variable; attributes in the head literal can also be bound to aggregation functions.In the example above, the attribute representing the message value in hashset is bound to the variable val2.Positive literals in the body of the rule are joined together; negative literals are anti-joined (SQL's NOT IN).Attributes bound to the same variable form an equality predicate-in the rule above, the first attribute of toStorage must be equal to the first attribute of hash since they are both bound to val1; this specifies an equijoin of those two relations.Two positive literals in the same body that share no common variables form a cross-product.Multiple rules may have the same head relation; the head relation is defined as the disjunction (SQL UNION) of the rule bodies.
Note how library functions like hash are simply modeled as infinite relations of the form (input, output).Because these are infinite relations, they can only be used in a rule body if the input variables are bound to another attribute-this corresponds to "lazily evaluating" the function only for that attribute's finite set of values.For example, the relation hash contains the fact (x, y) if and only if hash(x) equals y.
Relations  are populated with facts  , which are tuples of values, one for each attribute of  .We will use the syntax   ( ) to project  to the value of attribute .Relations with facts stored prior to execution are traditionally called extensional relations, and the set of extensional relations is called the EDB.Derived relations, defined in the heads of rules, are traditionally called intensional relations, and the set of them is called the IDB.Boolean operators and library functions like hash have pre-defined content, hence they are (infinite) EDB relations.
Datalog ¬ also supports negation and aggregations.An example of aggregation is seen in Listing 2 Line 4, which counts the number of hash collisions with the count aggregation: 4 numCollisions(count<val>,hashed,l,t) :− collisions(val,hashed,l,t) In this syntax, attributes that appear outside of aggregate functions form the GROUP BY list; attributes inside the functions are aggregated.In order to compute aggregation in any rule , we must first compute the full content of all relations  in the body of .Negation works similarly: if we have a literal !r(x) in the body, we can only check that r is empty after we're sure we have computed the full contents of r(x).We refer the reader to [1,48] for further reading on aggregation and negation.

Dedalus
Dedalus programs are legal Datalog ¬ programs, constrained to adhere to three additional rules on the syntax.
(1) Space and Time in Schema: All IDB relations must contain two attributes at their far right: location  and time  .Together, these attributes model where and when a fact exists in the system.For example, in the rule on Line 3 discussed above, a toStorage message  and signature  that arrives at time  at a node with location  is represented as a fact toStorage(, , , ).
(2) Matching Space-Time Variables in Body: The location and time attributes in all body literals must be bound to the same variables  and , respectively.This models the physical property that two facts can be joined only if they exist at the same time and location.In Line 3, a toStorage fact that appears on node  at time  can only match with hashset facts that are also on  at time .
We model library functions like hash as relations that are known (replicated) across all nodes  and unchanging across all timesteps .Hence we elide  and  from function and expression literals as a matter of syntax sugar, and assume they can join with other literals at all locations and times.
(3) Space and Time Constraints in Head: The location and time variables in the head of rules must obey certain syntactic constraints, which ensure that the "derived" locations and times correspond to physical reality.These constraints differ across three types of rules.Synchronous ("deductive" [7]) rules are captured by having the same time variable in the head literal as in the body literals.Having these derivations assigned to the same timestep  is only physically possible on a single node, so the location in the head of a synchronous rule must match the body as well.
Sequential ("inductive" [7]) rules are captured by having the head literal's time be the successor (t+1) of the body literals' times t.Again, sequentiality can only be guaranteed physically on a single node in an asychronous system, so the location of the head in a sequential rule must match the body.Asynchronous rules capture message passing between nodes, by having different time and location variables in the head than the body.In an asynchronous system, messages are delivered at an arbitrary time in the future.We discuss how this is modeled next.
In an asynchronous rule , the location attribute of the head and body relations in  are bound to different variables; a different location in the head of  indicates the arrival of the fact on a new node.Asynchronous rules are constrained to capture non-deterministic delay by including a body literal for the built-in delay relation (a.k.a.choose [7], chosen [4]), a non-deterministic function that independently maps each head fact to an arrival time.The logical formalism of the delay function is discussed in [4]; for our purposes it is sufficient to know that delay is constrained to reflect Lamport's "happens-before" relation for each fact.That is, a fact sent at time  on  arrives at time  ′ on  ′ , where  <  ′ .We focus on Listing 2, Line 5 from our running example.
fromStorage(l,sig,val,collCnt,l',t') :− toStorage(val,leaderSig,l,t), hash(val,hashed), numCollisions(collCnt,hashed,l,t), sign(val,sig), leader(l'), delay((sig,val,collCnt,l,t,l'),t') This is an asynchronous rule where a storage node  sends the count of hash collisions for each distinct storage request back to the leader  ′ .Note the l' and t' in the head literal: they are derived from the body literals leader (an EDB relation storing the leader address) and the built-in delay.
Note also how the first attribute of delay (the function "input") is a tuple of variables that, together, distinguish each individual head fact.This allows delay to choose a different t' for every head fact [4].The l in the head literal represents the storage node's address and is used by the leader to count the number of votes; it is unrelated to asynchrony.

Further terminology
We introduce some additional terminology to capture the rewrites we wish to perform on Dedalus programs.Our discussion so far has been at the level of rules; we will also need to reason about individual facts.A proof tree [1] can be constructed for each IDB fact  , where  lies at the root of the tree, each leaf is an EDB or input fact, and each internal node is an IDB fact derived from its children via a single rule.Below we see a proof tree for one fact in toStorage:

Correctness
This paper transforms single-node Dedalus components into "equivalent" multi-component, multinode Dedalus programs; the transformations can be composed to scale entire distributed protocols.
For equivalence, we want a definition that satisfies any client (or observer) of the input/output channels of the original program.To this end we employ equivalence of concurrent histories as defined for linearizability [33], the gold standard in distributed systems.We assume that a history  can be constructed from any run of a given Dedalus program .
Linearizability traditionally expects every program to include a specification that defines what histories are "legal".We make no such assumption and we consider any possible history generated by the unoptimized program  to define the specification.As such, the optimized program  ′ is linearizable if any run of  ′ generates the same output facts with the same timestamps as some run of .
Our rewrites are safe over protocols that assume the following fault model: an asynchronous network (messages between correct nodes will eventually be delivered) where up to  nodes can suffer from general omission failures [52] (they may fail to send or receive some messages).After optimizing, one original node  may be replaced by multiple nodes  1 ,  2 , . ..; the failure of any of nodes   corresponds to a partial failure of the original node , which is equivalent to the failure of  under general omission.
Full proofs, preconditions, and mechanisms for the rewrites described in Sections 3 and 4 can be found in Appendix A.

DECOUPLING
Decoupling partitions code; it takes a Dedalus component running on a single node, and breaks it into multiple components that can run in parallel across many nodes.Decoupling can be used to alleviate single-node bottlenecks by scaling up available resources.Decoupling can also introduce pipeline parallelism: if one rule produces facts in its head that another rule consumes in its body, decoupling those rules across two components can allow the producer and consumer to run in parallel.
Because Dedalus is a language of unordered rules, decoupling a component is syntactically easy: we simply partition the component's ruleset into multiple subsets, and assign each subset to a different node.The result is syntactically legal, but the correctness story is not quite that simple.To decouple and retain the original program semantics, we must address classic distributed systems challenges: how to get the right data to the right nodes (space), and how to ensure that introducing asynchronous messaging between nodes does not affect correctness (time).
In this section we step through a progression of decoupling scenarios, and introduce analyses and rewrites that provably address our concerns regarding space and time.Throughout, our goal is to avoid introducing any coordination-i.e.extra messages beyond the data passed between rules in the original program.
General Construction for Decoupling: In all our scenarios we will consider a component  at network location addr, consisting of a set of rules .We will, without loss of generality, decouple  into two components:  1 =  1 , which stays at location addr, and  2 =  2 which is placed at a new location addr2.The rulesets of the two new components partition the original ruleset:  1 ∩  2 = ∅ and  1 ∪  2 ⊇ .Note that we may add new rules during decoupling to achieve equivalence.

Mutually Independent Decoupling
Intuitively, if the component  1 never communicates with  2 , then running them on two separate nodes should not change program semantics.We simply need to ensure that inputs from other components are sent to addr or addr2 appropriately.
Consider the component defined in Listing 1.There is no dataflow between the relations in Lines 1 and 2 and the relations in the remainder of the rules in the component.One possible decoupling would place Lines 1 and 2 on  1 , the remainder of Listing 1 on  2 , and reroute fromStorage messages from  1 to  2 , as seen in Figure 2.
We now define a precondition that determines when this rewrite can be applied: We simply add a "redirection" EDB relation to the body of each rule whose head is referenced in  2 , which maps addr to addr2, and any other address to itself.For our example above, we need to ensure that fromStorage is sent to addr2.To enforce this we rewrite Line 5 of Listing 2 as follows (note variable l'' in the head, and forward in the body): fromStorage(l,sig,val,collCnt,l'',t') :− toStorage(val,leaderSig,l,t), hash(val,hashed), numCollisions(collCnt,hashed,l,t), sign(val,sig), leader(l'), forward(l',l'') delay((l,sig,val,collCnt,l,t,l''),t')

Monotonic Decoupling
Now consider a scenario in which  1 and  2 are not mutually independent.If  2 is dependent on  1 , decoupling changes the dataflow from  1 to  2 to traverse asynchronous channels.After decoupling, facts that co-occurred in  may be spread across time in  2 ; similarly, two facts that were ordered or timed in a particular way in  may be ordered or timed differently in  2 .Without coordination, very little can be guaranteed about the behavior of a component after the ordering or timing of facts is modified.Fortunately, the CALM Theorem [32] tells us that monotonic components eventually produce the same output independent of any network delays, including changes to co-occurrence, ordering, or timing of inputs.A component  2 is monotonic if increasing its input set from  to  ′ ⊇  implies that the output set  2 ( ′ ) ⊇  2 ( )1 ; in other words, each referenced relation and output of  2 will monotonically accumulate a growing set of facts as inputs are received over time, independent of the order in which they were received.The CALM Theorem ensures that if  2 is shown to be monotonic, then we can safely decouple  1 and  2 without any coordination.
In our running example, the leader (Listing 1) is responsible for both creating certificates from a set of signatures (Lines 5 to 7) and checking for inconsistent ACKs (Line 8).Since ACKs are persisted, once a pair is inconsistent, they will always be inconsistent; Line 8 is monotonic.Monotonic decoupling of Line 8 allows us to offload inconsistency-checking from a single leader to the decoupled "proxy" as highlighted in yellow in Figure 3.
Precondition:  1 is independent of  2 , and  2 is monotonic.
Monotonicity of a Datalog ¬ (hence Dedalus) component is undecidable [40], but effective conservative tests for monotonicity are well known.A simple sufficient condition for monotonicity is to ensure that (a)  2 's input relations are persisted, and (b)  2 's rules do not contain negation or aggregation.In Appendix A.2 we relax each of these checks to be more permissive.
Rewrite: Redirection With Persistence.Note that in this case we may have relations  that are outputs of  1 and inputs to  2 .We use the same rewrite as in the previous section with one addition: we add a persistence rule to  2 for each  that is in the output of  1 and the input of  2 , guaranteeing that all inputs of  2 remain persisted.
The alert reader may notice performance concerns.First,  1 may redundantly resend persistentlyderived facts to  2 each tick, even though  2 is persistently storing them anyway via the rewrite.Second,  2 is required to persist facts indefinitely, potentially long after they are needed.Solutions to this problem were explored in prior work [17] and can be incorporated here as well without affecting semantics.

Functional Decoupling
Consider a component that behaves like a "map" operator for a pure function  on individual facts: for each fact  it receives as input, it outputs  ( ).Surely these should be easy to decouple!Map operators are monotonic (their output set grows with their input set), but they are also independent per fact-each output is determined only by its corresponding input, and in particular is not affected by previous inputs.This property allows us to forgo the persistence rules we introduce for more general monotonic decoupling; we refer to this special case of monotonic decoupling as functional decoupling.
Consider again Lines 1 and 2 in Listing 1.Note that Line 1 works like a function on one input: each fact from in results in an independent signed fact in signed.Hence we can decouple further, placing Line 1 on one node and Line 2 on another, forwarding signed values to toStorage.Intuitively, this decoupling does not change program semantics because Line 2 simply sends messages, regardless of which messages have come before: it behaves like pure functions.
Precondition:  1 is independent of  2 , and  2 is functional-that is, (1) it does not contain aggregation or negation, and (2) each rule body in  2 has at most one IDB relation.
Rewrite: Redirection.We reuse the rewrite from Section 3.1.
As a side note, recall that persisted relations in Dedalus are by definition IDB relations.Hence Precondition (2) prevents  2 from joining current inputs (an IDB relation) with previous persisted data (another IDB relation)!In effect, persistence rules are irrelevant to the output of a functional component, rendering functional components effectively "stateless".

PARTITIONING
Decoupling is the distribution of logic across nodes; partitioning (or "sharding") is the distribution of data.By using a relational language like Dedalus, we can scale protocols using a variety of techniques that query optimizers use to maximize partitioning without excessive "repartitioning" (a.k.a."shuffling") of data at runtime.
Unlike decoupling, which introduces new components, partitioning introduces additional nodes on which to run instances of each component.Therefore, each fact may be rerouted to any of the many nodes, depending on the partitioning scheme.Because each rule still executes locally on each node, we must reason about changing the location of facts.
We first need to define partitioning schemes, and what it means for a partitioning to be correct for a set of rules.Much of this can be borrowed from recent theoretical literature [8,27,28,55].A partitioning scheme is described by a distribution policy  ( ) that outputs some node address addr_i for any fact  .A partitioning preserves the semantics of the rules in a component if it is parallel disjoint correct [55].Intuitively, this property says that the body facts that need to be colocated remain colocated after partitioning.We adapt the parallel disjoint correctness definitions to the context of Dedalus as follows: Definition 4.1.A distribution policy  over component  is parallel disjoint correct if for any fact  of , for any two facts  1 ,  2 in the proof tree of  ,  ( 1 ) =  ( 2 ).Ideally we can find a single distribution policy that is parallel disjoint correct over the component in question.To do so, we need to partition each relation based on the set of attributes used for joining or grouping the relation in the component's rules.Such distribution policies are said to satisfy the co-hashing constraint (Section 4.1).Unfortunately, it is common for a single relation to be referenced in two rules with different join or grouping attributes.In some cases, dependency analysis can still find a distribution policy that will be correct (Section 4.2).If no parallel disjoint correct distribution policy can be found, we can resort to partial partitioning (Section 4.3), which replicates facts across multiple nodes.
To discuss partitioning rewrites on generic Dedalus programs, we consider without loss of generality a component  with a set of rules  at network location addr.We will partition the data at addr across a set of new locations addr1, addr2, etc, each executing the same rules .

Co-hashing
We begin with co-hashing [28,55], a well studied constraint that avoids repartitioning data.Our goal is to co-locate facts that need to be combined because they (a) share a join key, (b) share a group key, or (c) share an antijoin key.
Consider two relations  1 and  2 that appear in the body of a rule , with matching variables bound to attributes  in  1 and corresponding attributes  in  2 .Henceforth we will say that  1 and  2 "share keys" on attributes  and .Co-hashing states that if  1 and  2 share keys on attributes  and , then all facts from  1 and  2 with the same values for  and  must be routed to the same partition.
Note that even if co-hashing is satisfied for individual rules,  might need to be repartitioned between the rules, because a relation  might share keys with another relation on attributes  in one rule and  ′ in another.To avoid repartitioning, we would like the distribution policy to partition consistently with co-hashing in every rule of a component.
Consider Line 8 of Listing 1, assuming it has already been decoupled.Inconsistencies between ACKs are detected on a per-value basis and can be partitioned over the attribute bound to the variable val; this is evidenced by the fact that the relation acks is always joined with other IDB relations using the same attribute (bound to val).Line 2 and Listing 2 Line 5 are similarly partitionable by value, as seen in Figure 5.
Formally, a distribution policy  partitions relation  by attribute  if for any pair of facts . Facts are distributed according to their partitioning attributes.
partitions consistently with co-hashing if for any pair of referenced relations  1 ,  2 in rule  of ,  1 and  2 share keys on attribute lists  1 and  2 respectively, such that for any pair of facts ). Facts will be successfully joined, aggregated, or negated after partitioning because they are sent to the same locations.
Precondition: There exists a distribution policy  for relations referenced by component  that partitions consistently with co-hashing.
We can discover candidate distribution policies through a static analysis of the join and grouping attributes in every rule  in .
Rewrite: Redirection With Partitioning.We are given a distribution policy  from the precondition.For any rules in  ′ whose head is referenced in , we modify the "redirection" relation such that messages  sent to  at addr are instead sent to the appropriate node of  at  ( ).

Dependencies
By analyzing Dedalus rules, we can identify dependencies between attributes that (1) strengthen partitioning by showing that partitioning on one attribute can imply partitioning on another, and (2) loosen the co-hashing constraint.
For example, consider a relation  that contains both an original string attribute Str and its uppercased value in attribute UpStr.We return to the running example to see how CDs and FDs can be combined to enable coordinationfree partitioning where co-hashing forbade it.Listing 2 cannot be partitioned with co-hashing because toStorage does not share keys with hashset in Line 3. No distribution policy can satisfy the co-hashing constraint if there exists two relations in the same rule that do not share keys.However, we know that the hash is a function of the value; there is an FD hash.1 → hash.2.
Hence partitioning on hash.2 implies partitioning on hash.1.The first attributes of toStorage and hashset are joined through the attributes of the hash relation in all rules, forming a CD.Let the first attributes of toStorage and hashset-representing a value and a hash-be  and  respectively: a fact   in toStorage can only join with a fact  ℎ in hashset if hash(  (  )) equals   ( ℎ ).This reasoning can be repeatedly applied to partition all relations by the attributes corresponding the repeated variable hashed, as seen in Figure 6.
Precondition: There exists a distribution policy  for relations  referenced in  that partitions consistently with the CDs of  .
Assume we know all CDs  over attribute sets  1 ,  2 of relations  1 ,  2 .A distribution policy partitions consistently with CDs if for any pair of facts  1 ,  2 over referenced relations We describe the mechanism for systematically finding FDs and CDs in Appendix B.2.1.
Rewrite: Identical to Redirection with Partitioning.

Partial partitioning
It is perhaps surprising, but sometimes additional coordination can actually help distributed protocols (like Paxos) scale.
There exist Dedalus components that cannot be partitioned even with dependency analysis.If the non-partitionable relations are rarely written to, it may be beneficial to replicate the facts in those relations across nodes so each node holds a local copy.This can support multiple local reads in parallel, at the expense of occasional writes that require coordination.
We divide the component  into  1 and  2 , where relations referenced in  2 can be partitioned using techniques in prior sections, but relations referenced in  1 cannot.In order to fully partition , facts in relations referenced in  1 must be replicated to all nodes and kept consistent so that each node can perform local processing.To replicate those facts, inputs that modify the replicated relations are broadcasted to all nodes.
Coordination is required in order to maintain consistency between nodes with replicated facts.Each node orders replicated inputs by buffering other inputs when replicated facts  arrive, only flushing the buffer after the node is sure that all other nodes have also received  .Knowledge of whether a node has received  can be enforced through a distributed commit or consensus mechanism.
Precondition:  1 is independent of  2 and both behave like state machines.
We define "state machines" in Appendix A.4 and the rewrites for partial partitioning in Appendix B.3.

EVALUATION
We will refer to our approach of manually modifying distributed protocols with the mechanisms described in this paper as rule-driven rewrites, and the traditional approach of modifying distributed protocols and proving the correctness of the optimized protocol as ad hoc rewrites.
In this section we address the following questions: (1) How can rule-driven rewrites be applied to foundational distributed protocols, and how well do the optimized protocols scale?(Section 5.2) (2) Which of the ad hoc rewrites can be reproduced via the application of (one or more) rules, and which cannot? (Section 5.3) (3) What is the effect of the individual rule-driven rewrites on throughput?(Section 5.4)

Experimental setup
All protocols are implemented as Dedalus programs and compiled to Hydroflow [54], a Rust dataflow runtime for distributed systems.We deploy all protocols on GCP using n2-standard-4 machines with 4 vCPUs, 16 GB RAM, and 10 Gbps network bandwidth, with one machine per Dedalus node.
We measure throughput/latency over one minute runs, following a 30 second warmup period.Each client sends 16 byte commands in a closed loop.The ping time between machines is 0.22ms.We assume the client is outside the scope of our rewrites, and any rewrites that requires modifying the client cannot be applied.

Rewrites and scaling
We manually apply rule-driven rewrites to scale three fundamental distributed protocols-voting, 2PC, and Paxos.We will refer to our unoptimized implementations as BaseVoting, Base2PC, and BasePaxos, and the rewritten implementations as ScalableVoting, Scalable2PC, and ScalablePaxos.In general, we will prepend the word "Base" to any unoptimized implementation, "Scalable" to any implementation created by applying rule-driven rewrites, and " " to any implementation in Dedalus.We measure the performance of each configuration with an increasing set of clients until throughput saturates, averaging across 3 runs, with standard deviations of throughput measurements shown in shaded regions.Since the minimum configuration of Paxos (with  = 1) requires 3 acceptors, we will also test voting and 2PC with 3 participants.
For decoupled-and-partitioned implementations, we measure scalability by changing the number of partitions for partitionable components, as seen in Figure 7. Decoupling contributes to the throughput differences between the unoptimized implementation and the 1-partition configuration.Partitioning contributes to the differences between the 1, 3, and 5 partition configurations.
These experimental configurations demonstrate the scalability of the rewritten protocols.They do not represent the most cost-effective configurations, nor the configurations that maximize throughput.We manually applied rewrites on the critical path, selecting rewrites with low overhead, where we suspect the protocols may be bottlenecked.Across the protocols we tested, these bottlenecks often occurred where the protocol (1) broadcasts messages, (2) collects messages, and (3) logs to disk.These bottlenecks can usually be decoupled from the original node, and because messages are often independent of one another, the decoupled nodes can then be partitioned such that each node handles a subset of messages.The process of identifying bottlenecks, applying suitable rewrites, and finding optimal configurations may eventually be automated.
Voting.Client payloads arrive at the leader, which broadcasts payloads to the participants, collects votes from the participants, and responds to the client once all participants have voted.Multiple rounds of voting can occur concurrently.BaseVoting is implemented with 4 machines, 1 leader and 3 participants, achieving a maximum throughput of 100,000 commands/s, bottlenecking at the leader.
We created ScalableVoting from BaseVoting through Mutually Independent Decoupling, Functional Decoupling, and Partitioning with Co-hashing.Broadcasters broadcast votes for the leader; they are decoupled from the leader through functional decoupling.Collectors collect and count votes for the leader; they are decoupled from the leader through mutually independent decoupling.The remaining "leader" component only relays commands to broadcasters.All components except the leader are partitioned with co-hashing.The leader cannot be partitioned since that would require modifying the client to know how to reach one of many leader partitions.With 1 leader, 5 broadcasters, 5 partitions for each of the 3 participants, and 5 collectors, the maximum configuration for ScalableVoting totals 26 machines, achieving a maximum throughput of 250,000 commands/s-a 2× improvement over the baseline.
2PC (with Presumed Abort).The coordinator receives client payloads and broadcasts voteReq to participants.Participants log and flush to disk, then reply with votes.The coordinator collects votes, logs and flushes to disk, then broadcasts commit to participants.Participants log and flush to disk, then reply with acks.The coordinator then logs and replies to the client.Multiple rounds of 2PC can occur concurrently.Base2PC is implemented with 4 machines, 1 coordinator and 3 participants, achieving a maximum throughput of 30,000 commands/s, bottlenecking at the coordinator.
We created Scalable2PC from Base2PC similarly through Mutually Independent Decoupling, Functional Decoupling, and Partitioning with Co-hashing.Vote Requesters are functionally decoupled from coordinators: they broadcast voteReq to participants.Committers and Enders are decoupled from coordinators through mutually independent decoupling.Committers collect votes, log and flush commits, then broadcast commit to participants.Enders collect acks, log, and respond to the client.The remaining "coordinator" component relays commands to vote requesters.Each participant is mutually independently decoupled into Voters and Ackers.Participant Voters log, flush, then send votes; Participant Ackers log, flush, then send acks.All components (except the coordinator) can be partitioned with co-hashing.With 1 coordinator, 5 vote requesters, 5 ackers and 5 voters for each of the 3 participant, 5 committers, and 5 enders, the maximum configuration of Scalable2PC totals 46 machines, achieving a maximum throughput of 160,000 commands/s-a 5× improvement.
Paxos.Paxos solves consensus while tolerating up to  failures.Paxos consists of  + 1 proposers and 2 + 1 acceptors.Each proposer has a unique, dynamic ballot number; the proposer with the highest ballot number is the leader.The leader receives client payloads, assigns each payload a sequence number, and broadcasts a p2a message containing the payload, sequence number, and its ballot to the acceptors.Each acceptor stores the highest ballot it has received and rejects or accepts payloads into its log based on whether its local ballot is less than or equal to the leader's.The acceptor then replies to the leader via a p2b message that includes the acceptor's highest ballot.If this ballot is higher than the leader's ballot, the leader is preempted.Otherwise, the acceptor has accepted the payload, and when  + 1 acceptors accept, the payload is committed.The leader relays committed payloads to the replicas, which execute the payload command and notify the clients.
BasePaxos is implemented with 8 machines-2 proposers, 3 acceptors, and 3 replicas (matching BasePaxos in Section 5.3)-tolerating  = 1 failures, achieving a maximum throughput of 50,000 commands/s, bottlenecking at the proposer.We created ScalablePaxos from BasePaxos through Mutually Independent Decoupling, (Asymmetric)2 Monotonic Decoupling, Functional Decoupling, Partitioning with Co-hashing, and Partial Partitioning with Sealing3 .P2a proxy leaders are functionally decoupled from proposers and broadcast p2a messages.P2b proxy leaders collect p2b messages and broadcast committed payloads to the replicas; they are created through asymmetric monotonic decoupling, since the collection of p2b messages is monotonic but proposers must be notified when the messages contain a higher ballot.Both can be partitioned on sequence numbers with co-hashing.Acceptors are partially partitioned with sealing on sequence numbers, replicating the highest ballot across partitions, necessitating the creation of a coordinator for each acceptor.With 2 proposers, 3 p2a proxy leaders and 3 p2b proxy leaders for each of the 2 proposers, 1 coordinator and 3 partitions for each of the 3 acceptors, and 3 replicas, totalling 29 machines, ScalablePaxos achieves a maximum throughput of 150,000 commands/s-a 3× improvement, bottlenecking at the proposer.
Across the protocols, the additional latency overhead from decoupling is negligible.
Together, these experiments demonstrate that rule-driven rewrites can be applied to scale a variety of distributed protocols, and that performance wins can be found fairly easily via choosing the rules to apply manually.A natural next step is to develop cost models for our context, and integrate into a search algorithm in order to create an automatic optimizer for distributed systems.Standard techniques may be useful here, but we also expect new challenges in modeling dynamic load and contention.It seems likely that adaptive query optimization and learning could prove relevant here to enable autoscaling [20,58].

Comparison to ad hoc rewrites
Our previous results show apples-to-apples comparisons between naive Dedalus implementations and Dedalus implementations optimized with rule-driven rewrites.However they do not quantify the difference between Dedalus implementations optimized with rule-driven rewrites and ad hoc optimized protocols written in a more traditional procedural language.To this effect, we compare our scalable version of Paxos to Compartmentalized Paxos [63].We do this for two reasons: (1) Paxos is notoriously hard to scale manually, and (2) Compartmentalized Paxos is a state-of-the-art implementation of Paxos based, among other optimizations, on manually applying decoupling and partitioning.To best understand the merits of scalability, we choose not to batch client requests, as batching often obscures the benefits of individual scalability rewrites.BasePaxos was reported to peak of 25,000 commands/s with  = 1 and 3 replicas on AWS in 2021 [63].As seen Figure 9, we verified this result in GCP using the same code and experimental setup.Our Dedalus implementation of Paxos-BasePaxos-in contrast, peaks at a higher 50,000 commands/s with the same configuration as BasePaxos.We suspect this performance difference is due to the underlying implementations of BasePaxos in Scala and BasePaxos in Dedalus, compiled to Hydroflow atop Rust.Indeed, our deployment of CompPaxos peaked at 130,000 commands/s, and our reimplementation of Compartmentalized Paxos in Dedalus ( CompPaxos) peaked at a higher 160,000 commands/s, a throughput improvement comparable to the 25,000 command throughput gap between BasePaxos and BasePaxos.
Note that technically, CompPaxos was reported to peak at 150,000 commands/s, not 130,000.We deployed the Scala code provided by Whittaker et al. with identical hardware, network, and configuration, but could not replicate their exact result.
We now have enough context to compare the throughput between CompPaxos and ScalablePaxos; their respective architectures are shown in Figure 8. CompPaxos achieves maximum throughput with 20 machines: 2 proposers, 10 proxy leaders, 4 acceptors (in a 2 × 2 grid), and 4 replicas.We compare CompPaxos and ScalablePaxos using the same number of machines, fixing the number of proposers (for fault tolerance) and replicas (which we do not decouple or partition).Restricted to 20 machines, ScalablePaxos achieves the maximum throughput with 2 proposers, 2 p2a proxy leaders, 3 coordinators, 3 acceptors, 6 p2b proxy leaders, and 4 replicas.All components are kept at minimum configuration-with only 1 partition-except for the p2b proxy leaders, which are the throughput bottleneck.ScalablePaxos then scales to 130,000 commands/s, a 2.5× throughput improvement over BasePaxos.Although CompPaxos reports a 6× throughput improvement over BasePaxos from 25,000 to 150,000 commands/s in Scala, reimplemented in Dedalus, it reports a 3× throughput improvement between CompPaxos and BasePaxos, similar to the 2.5× throughput improvement between ScalablePaxos and BasePaxos.Therefore we conclude that the throughput improvements of rule-driven rewrites and ad hoc rewrites are comparable when applied to Paxos.We emphasize that our framework cannot realize every ad hoc rewrite in CompPaxos (Figure 8).We describe the differences between CompPaxos and ScalablePaxos next.

Proxy leaders.
Figure 8 shows that CompPaxos has a single component called "proxy leader" that serves the roles of two components in ScalablePaxos: p2a and p2b proxy leaders.Unlike p2a and p2b proxy leaders, proxy leaders in CompPaxos can be shared across proposers.Since only 1 proposer will be the leader at any time, CompPaxos ensures that work is evenly distributed across proxy leaders.Our rewrites focus on scaling out and do not consider sharing physical resources between logical components.Moreover, there is an additional optimization in the proxy leader of CompPaxos.CompPaxos avoids relaying p2bs from proxy leaders to proposers by introducing nack messages from acceptors that are sent instead.This optimization is neither decoupling nor partitioning and hence is not included in ScalablePaxos.

Acceptors.
CompPaxos partitions acceptors without introducing coordination, allowing each partition to hold an independent ballot.In contrast, ScalablePaxos can only partially partition acceptors and must introduce coordinators to synchronize ballots between partitions, because our formalism states that the partitions' ballots together must correspond to the original acceptor's ballot.Crucially, CompPaxos allows the highest ballot held at each partition to diverge while ScalablePaxos does not, because this divergence can introduce non-linearizable executions that remain safe for Paxos, but are too specific to generalize.We elaborate more on this execution in Appendix C. Despite its additional overhead, ScalablePaxos does not suffer from increased latency because the overhead is not on the critical path.Assuming a stable leader, p2b proxy leaders do not need to forward p2bs to proposers, and acceptors do not need to coordinate between partitions.[47], and flexible quorums [36], which are outside the scope of this paper as they are not instances of decoupling or partitioning.These optimizations, combined with the more efficient use of proxy leaders, explain the remaining throughput difference between CompPaxos and ScalablePaxos.

On the Benefit of Individual Rewrites
In Figure 10, we examine each rewrite's scaling potential.To create a consistent throughput bottleneck, we introduce extra computation via multiple AES encryptions.When decoupling, the program must always decrypt the message from the client and encrypt its output.When partitioning, the program must always encrypt its output.When decoupling, we always separate one node into two.When partitioning, we always create two partitions out of one.Thus maximum scale factor of each rewrite is 2×.To determine the scaling factors, we increased the number of clients by increments of two for decoupling and three for partitioning, stopping when we reached saturation for each protocol.
Briefly, we study each of the individual rewrites using the following artificial protocols: • Mutually Independent Decoupling: A replicated set where the leader decrypts a client request, broadcasts payloads to replicas, collects acknowledgements, and replies to the client (encrypting the response), similar to the voting protocol.We denote this base protocol as R-set.We decouple the broadcast and collection rules.
• Monotonic Decoupling: An R-set where the leader also keeps track of a ballot that is potentially updated by each client message.The leader attaches the value of the ballot at the time each client request is received to the matching response.
• Functional Decoupling: The same R-set protocol, but with zero replicas.The leader attaches the highest ballot it has seen so far to each response.It still decrypts client requests and encrypts replies as before.
• Partitioning With Co-Hashing: A R-set.
• Partitioning With Dependencies: A R-set where each replica records the number of hash collisions, similar to our running example.
• Partial Partitioning: A R-set where the leader and replicas each track an integer.The leader's integer is periodically incremented and sent to the replicas, similar to Paxos.The replicas attach their latest integers to each response.
The impact on throughput varies between rewrites due to both the overhead introduced and the underlying protocol.Note that of our 6 experiments, the first two are the only ones that add a network hop to the critical path of the protocol and rely on pipelined parallelism.The combination of networking overhead and the potential for imperfect pipelined parallelism likely explain why they achieve only about 1.7× performance improvement.In contrast, the speedups for mutually independent decoupling and the different variants of partitioning are closer to the expected 2×.Nevertheless, each rewrite improves throughput in isolation as shown in Figure 10.

RELATED WORK
Our results build on rich traditions in distributed protocol design and parallel query processing.The intent of this paper was not to innovate in either of those domains per se, but rather to take parallel query processing ideas and use them to discover and evaluate rewrites for distributed protocols.

Manual Protocol Optimizations
There are many clever, manually-optimized variants of distributed protocols that scale by avoiding coordination, e.g.[3,12,23,39,49,63].These works rely on intricate modifications to underlying protocols like consensus, with manual (and not infrequently buggy [53]) end-to-end proofs of correctness for the optimized protocol.In contrast, this paper introduces a rule-driven approach to optimization that is correct by construction, with proofs narrowly focused on small rewrites.
We view our work here as orthogonal to most ad hoc optimizations of protocols.Our rewrites are general and can be applied correctly to results of the ad hoc optimization.In future work it would be interesting to see when and how the more esoteric protocols cited above might benefit from further optimization using the techniques in this paper.
Our work was initially inspired by the manually-derived Compartmentalized Paxos [63], from which we borrowed our focus on decoupling and partitioning.Our work does not achieve all the optimizations of Compartmentalized Paxos (Section 5.3), but it achieves the most important ones, and our results are comparable in performance.
There is a long-standing research tradition of identifying commonalities between distributed protocols that provide the same abstraction [9,11,29,30,37,60,61,64,65].In principle, optimizations that apply to one protocol can be transferred to another, but this requires careful scrutiny to determine if the protocols fit within some common framework.We attack this problem by borrowing from the field of programming languages.The language Dedalus is our "framework"; any distributed protocol expressed in Dedalus can benefit from our rewrites via a mechanical application of the rules.Although our general rewrites cannot cover every possible optimization a programmer can envision, they can be applied effectively.

Parallel Query Processing and Dataflow
A key intuition of our work is to rewrite protocols using techniques from distributed ("sharednothing") parallel databases.The core ideas go back to systems like Gamma [22] and GRACE [25] in the 1980s, for both long-running "data warehouse" queries and transaction processing workloads [21].Our work on partitioning (Section 4) adapts ideas from parallel SQL optimizers, notably work on auto-partitioning with functional dependencies, e.g.[70].Traditional SQL research focuses on a single query at a time.To our knowledge the literature does not include the kind of decoupling we introduce in Section 3.
Big Data systems (e.g., [19,38,68]) extended the parallel query literature by adding coordination barriers and other mechanisms for mid-job fault tolerance.By contrast, our goal here is on modest amounts of data with very tight latency constraints.Moreover, fault tolerance is typically implicit in the protocols we target.As such we look for coordination-freeness wherever we can, and avoid introducing additional overheads common in Big Data systems.
There is a small body of work on parallel stream query optimization.An annotated bibliography appears in [34].Widely-deployed systems like Apache Flink [16] and Spark Streaming [69] offer minimal insight into query optimization.
Parallel Datalog goes back to the early 1990s (e.g.[26]).A recent survey covers the state of the art in modern Datalog engines [41], including dedicated parallel Datalog systems and Datalog implementations over Big Data engines.The partitioning strategies we use in Section 4 are discussed in the survey; a deeper treatment can be found in the literature cited in Section 4 [8, 27, 28, 55].

DSLs for Distributed Systems
We chose the Dedalus temporal logic language because it was both amenable to our optimization goals and we knew we could compile it to high-performance machine code via Hydroflow.Temporal logics have also been used for verification of protocols-most notably Lamport's TLA+ language [44], which has been adopted in applied settings [50].TLA+ did not suit our needs for a number of reasons.Most notably, efficient code generation is not a goal of the TLA+ toolchain.Second, an optimizer needs lightweight checks for properties (FDs, monotonicity) in the inner loop of optimization; TLA+ is ill-suited to that case.Finally, TLA+ was designed as a finite model checker: it provides evidence of correctness (up to  steps of execution) but no proofs.There are efforts to build symbolic checkers for TLA+ [42], but again these do not seem well-suited to our lightweight setting.
Declarative languages like Dedalus have been used extensively in networking.Loo, et al. surveyed work as of 2009 including the Datalog variants NDlog and Overlog [45].As networking DSLs, these languages take a relaxed "soft state" view of topics like persistence and consistency.Dedalus and Bloom [6,18] were developed with the express goal of formally addressing persistence and consistency in ways that we rely upon here.More recent languages for software-defined networks (SDNs) include NetKAT [10] and P4 [15], but these focus on centralized SDN controllers, not distributed systems.
Further afield, DAG-based dataflow programming is explored in parallel computing (e.g., [13,14]).While that work is not directly relevant to the transformations we study here, their efforts to schedule DAGs in parallel environments may inform future work.

CONCLUSION
This is the first paper to present general scaling optimizations that can be safely applied to any distributed protocol, taking inspiration from traditional SQL query optimizers.This opens the door to the creation of automatic optimizers for distributed protocols.
Our work builds on the ideas of Compartmentalized Paxos [63], which "unpacks" atomic components to increase throughput.In addition to our work on generalizing decoupling and partitioning via automation, there are additional interesting follow-on questions that we have not addressed here.
The first challenge follows from the separation of an atomic component into multiple smaller components: when one of the smaller components fails, others may continue responding to client requests.While this is not a concern for protocols that assume omission failures, additional checks and/or rewriting may be necessary to extend our work to weaker failure models.The second challenge is the potential liveness issues introduced by the additional latency from our rewrites and our assumption of an asynchronous network.Protocols that calibrate timeouts assuming a partially synchronous network with some maximum message delay may need their timeouts recalibrated.This can likely be addressed in practice using typical pragmatic calibration techniques.

A DECOUPLING
We will require the following terms in addition to the terms introduced in Section 2.
An instance  over program  is a set of facts for relations in .An immediate consequence operator evaluates rules to produce new facts from known facts.  ( ) over instance  and rule  is a set of facts  ℎ :−  1 , . . .,   that is an instantiation of , where each   is in  .For the remainder of this paper, when we refer to instance, we mean an instance created by evaluating a sequence of immediate consequences over some set of EDB and input facts.An instance is the state of the Dedalus program as a result of repeated rule evaluation.
A relation  ′ is in the proof tree of  if there exists facts  ′ ∈  ′ and  ∈  such that  ′ is in the proof tree of  .
We assume that there is no entanglement [7]; for any fact, values representing location and time only appear in the location and time attributes respectively.This allows us to modify the location and times of facts without worrying about changing the values in other attributes.Our transformations may introduce entanglement when necessary.
quorum is reached is a monotonic condition.Once a quorum is reached for a particular sequence number in Paxos, the committed value cannot change, so the votes for that quorum can be safely forgotten (no longer persisted).To allow monotonic components to forget values, we allow the user to annotate persistence rules with garbage collecting conditions.
A.2.2 Mechanism.Monotonic decoupling employs both the Redirection rewrite (Section 3.1) and the Decoupling rewrite (Appendix A.3.1) in addition to the following rewrite to persist inputs to  2 : Monotonic Rewrite: For all input relations  ′ of  2 : • Create a relation  ′′ with all the attributes of  ′ and replace all references of  ′ in  2 with  ′′ .

A.3 Functional decoupling
A.3.1 Mechanism.Functional decoupling employs the Redirection rewrite (Section 3.1) to route data from outside  ′ to  2 .Routing data from  1 to  2 requires the explicit introduction of asynchrony below.
Rewrite: Decoupling.Given a rule  in  1 with a head relation  referenced in  2 : • Create a relation  ′ with all the attributes of  , and replace all references of  in  2 with  ′ .
Consider  ′ where  ′ is a rule of  1 .If  ′ is unchanged from  in , then the inductive hypothesis implies the same facts in both  and  ′ in all relations in the body of  ′ , so   ( ′ ) implies   ( ).
If  ′ is is a newly asynchronous rule, then let  be the head of  ′ and  ′ be the fact of  in   ′ ( ′ ).
Let  be the original, synchronous rule. must be an input relation of  2 , so we must show that  ′ is at the head of   ′ ( ′ ) if and only if  is at the head of   ( ), where   ( ) = addr,   ( ′ ) = addr2, and   ( ′ ) >   ( ).Let  be the time and  be the location of all body facts in   ′ ( ′ ).We know  and  are also the time and location of all body facts in   ( ) by the inductive hypothesis. ′ differs from  with the two additional relations forward and delay added to the body.forward assigns  ′ the location addr2, while delay sets the time of  ′ to some non-deterministic value greater than .If  is asynchronous, then  ′ and  are output facts.Then  ′ and  share the same location (the destination), whereas for time, the facts only need to satisfy the inequalities   ( ′ ) >   ( ′  ) and   ( ) >   (  ).In other words, the range of possible values for   ( ′ ) is (  ( ′  ), ∞) and the range of possible values for   ( ) is (  (  ), ∞).Since   ( ′  ) >   (  ), the range (  ( ′  ), ∞) must be a sub-range of (  (  ), ∞).Therefore, given  ′ in   ′ ( ′ ), an immediate consequence   ( ) with  is always possible where   ( ) =   ( ′ ) and  =  ′ , completing the proof.

A.4 State machine decoupling
Although this decoupling technique has since been cut from the paper, we still include it since its preconditions and proofs are referenced in Appendix B.3.
In a state machine component, any pair of facts that are combined (say via join or aggregation) at time  must be the result of inputs at time  and the order of inputs prior, but the exact value of  is irrelevant.In these cases, we want to guarantee that (a) facts that co-occur at time  in  will also be processed together at some time  ′ in  2 , and (b) the inputs of  prior to  match the inputs of  2 prior to  ′ .To meet this guarantee, we collect facts from  1 to  2 into sequenced batches.
Precondition:  1 is independent of  2 , and  2 behaves like a state machine.
Before we formalize what it means to behave like a state machine, a couple of definitions are helpful.
Definition A.1 (Existence dependency).Relation  has an existence dependency on input relations   if  is empty whenever there is no input; that is,  = ∅ in any timestep when   ∈    = ∅.Formally,  2 is a state machine if (a) all referenced relations have either existence or no-change dependencies on the inputs, and (b) outputs of  2 have existence dependencies on the inputs.Condition (a) ensures that the component is insensitive to the passing of time(steps), and (b) ensures that the passing of time(steps) does not affect output content.
A.4.1 Checks for state machine behavior.We provide conservative tests on relations to identify existence and no-change dependencies.A relation  has an existence dependency on input relations   if for all rules  in the proof tree of  , (1)  does not contain t'=t+1, and (2) the body of  contains at least one non-negated relation  ′ where either  ′ is an input or  ′ also has an existence dependency on   .
A relation  has a no-change dependency on input relations   if: (1) Explicit persist.If there is an inductive rule  with  = ℎ (),  must be the persistence rule.Then  is persisted.
(2) Implicit persist.If is no such inductive rule, then for all (non-inductive) rules  where  = ℎ (), the body of  contains only EDBs and relations  ′ where  ′ has a no-change dependency on   .
(3) Change only on inputs.If there is such an inductive rule, then we also allow rules  where  = ℎ () to contain at least one non-negated relation  ′ in the body, where either  ′ ∈   or  ′ has an existence dependency on   .
A.4.2 Mechanism.To guarantee coexistence of facts, rewrites for state machine decoupling must preserve the order and batching of inputs.Similar to the rewrites above, we create new relations and asynchronous forwarding rules for relations in  1 referenced in  2 .To preserve ordering and batching of inputs, we create additional rules in  1 to track the number of previous batches and the current batch size (the number of output facts to  2 ).Then  2 ensures that all previous batches have been processed and the current batch has arrived before processing any input fact in the current batch.
Unlike any decoupling techniques described so far, we cannot reroute rules from other components  ′ to  2 ; those input facts must be batched and ordered by  1 .Intuitively, if all inputs are routed through  1 , then the batching and ordering on  1 is a feasible batching and ordering on . 2 must then process its inputs with the same batching and ordering to guarantee correctness.Were facts  to arrive at  2 without batching or ordering information from  1 , then  may happen-after some input  ′ in  1 but be processed before  ′ is processed at  2 , violating causality.
Our rewrites append a new attribute  1 to relations forwarded to  2 from  1 .This attribute represents the time on  1 when each input fact existed, allowing  2 to process facts in the same order and batches as  1 even with non-deterministic message delay.
sealed(t1,l,t') :− canSeal(t1,l,t), t'=t+1 sealed(t1,l,t') :− sealed(t1,l,t), t'=t+1 # Can process facts at time 1.rSealed(...,l,t) :− r''(..., Note that whenever a time 1 is sealed on  2 , facts in  ′′ and inputs can be garbage collected.Facts in sealed can be garbage collected if a higher timestamp has been sealed.We omit these optimizations for simplicity. A.4.3 Proof.Our proof relies on  2 processing inputs in the same order and batches as  1 .For simplicity, we denote  , as the set of facts  in  where  is a fact of relation  in  and   ( ) = .
Formally, we will prove that for each instance  ′ there exists  such that (1) for relations  referenced in  1 and output relations of  1 and  2 (excluding input relations of  2 ),  contains fact  in  if and only if  ′ contains  , and (2) for the set of relations  ′ referenced in  2 (and  corresponding to relations with rSealed replaced with  ), for any time  ′ where at least one input relation rSealed of  2 is not empty in  ′ and contains fact  ′  , let  =   1 ( ′  ).We must have  ′  ′ , ′ =  , when facts in rSealed are mapped to  , and location and time are ignored.
For rules  ′ of  1 , the inductive proof is identical to previous proofs.Now consider  ′ where  ′ is a rule of  2 and  ′ corresponds to  in .Let  ′ be the time of immediate consequence   ′ ( ′ ) such that for all facts  ′ of relations  in the body of  ′ ,   ( ′ ) =  ′ .If all input relations (rSealed, not  ) are empty for  ′ in  ′ , then  ′ cannot produce output facts at  ′ , since output relations must have existence dependencies on the input relations.
If at least one input relation rSealed is not empty for  ′ in  ′ , we must show  ′  ′ , ′ =  , .Let  =   1 ( ′  ) for some fact  ′  in rSealed with   ( ′  ) =  ′ .Let  ′ < be the time of the previous input on  ′ ; formally, for all input relations rSealed of  2 , there is no  ′′ where  ′ < <  ′′ <  ′ such that  ′ rSealed, ′′ is non-empty.Similar to how we construct  from  ′ , let  < =   1 ( ′  ) for some fact  ′  in rSealed with   ( ′  ) =  ′ < .We first show that  < is the time of the previous input on  ; formally, for input relations rSealed of  2 , for all corresponding  , there is no  ′′ where  < <  ′′ <  such that  , ′′ is non-empty.
We prove by contradiction, assuming such  ′′ exists.In order for a fact  ′  in rSealed in  ′ to have   ( ′  ) =  ′ , canSeal must contain the fact canSeal(, addr2,  ′ ), which is only possible if sealed contains the fact sealed( ′′ , addr2,  ′ ) and inputs contains the fact inputs(, ,  ′′ , addr2,  ′ ).sealed( ′′ , addr2,  ′ ) implies the fact canSeal( ′′ , addr2,  ′ < ) where  ′ < =   ( ′  ).In order for some fact  in  to join with canSeal( ′′ , addr2,  ′ < ) to create   , we must have   1 ( ) =  ′′ =   1 (  ).By definition,   1 (  ) =  ′ < , so  ′ < =  ′′ .Knowing that no inputs facts exist with times between  < and  in  and between  ′ < and  ′ in  ′ , we can use the existence and no-change dependencies of relations  in  to reason about the instances  ′  ′ , ′ and  , .By the induction hypothesis, we have  ′  ′ , ′ < =  , < .We can now reason about the instances  ′ and  at times  ′ − 1 and  − 1, respectively.For all relations  with existence dependencies on the inputs,  ′ , ′ −1 and  , −1 must both be empty.For all relations  with no-change dependencies on the inputs, Consider an immediate consequence   ′ ( ′ ) with facts  ′ in the body of  ′ where   ( ′ ) =  ′ .We find the set of facts in the proof tree of  ′ that have no proof tree (because they are inputs or EDBs) or have parents in the tree with time  ′ − 1. Inputs and EDBs are the same between  ′ and  at time  ′ and , due to our sealing mechanism.Since we know  ′ , ′ −1 =  , −1 , any parent fact in the tree with time  ′ − 1 exists in  with time  − 1 and evaluates to the same fact.Therefore, the same series of immediate consequences are possible in  to produce each fact in  ′ , ′ −1 .Thus  ′  ′ , ′ =  , .If  ′ is an asynchronous rule, then the head of  ′ is an output relation  , and we have to show that the fact  ′ in  ′ of immediate consequence   ′ ( ′ ) is equivalent to some fact  in  of immediate consequence   ( ).Output relations must have existence dependencies in  2 by assumption, so given   in the body of   ′ ( ′ ) with   ( ) =  ′ , the input relations are not empty at  ′ in  ′ , and  ′  ′ , ′ =  , . ′  of  ′ for  ′ must correspond to   in the body of  for  ,   (  ) = addr and   (  ) =  instead of   ( ′  ) = addr2 and   ( ′  ) =  ′ .The facts  ′ and  at the head of  ′ and  must be the same as well, non-deterministic with the constraints   ( ′ ) >  ′ and   ( ) > .We know  ′ >  due to the addition of an asynchronous channel, therefore   ( ) =   ( ′ ) is always possible, and  =  ′ , completing the proof.

A.5 Asymmetric decoupling
In this section, we consider decoupling where  1 and  2 are mutually dependent and where  2 is independent of  1 instead.
If  1 and  2 are both monotonic, then they can be decoupled through the Redirection With Persistence rewrite (Section 3.2), according to the CALM Theorem [32].Now consider  2 independent of  1 , where  2 exhibits useful properties for decoupling.Intuitively, although  2 forwards facts to  1 , we can treat the time in which  1 processes inputs as the "time of input arrival" while allowing  2 to process inputs first.This presents a problem:  2 might produce outputs "too early", violating well-formedness.To preserve well-formedness, we introduce a rewrite to delay output facts  ′ derived from input fact  of  2 until  1 acknowledges it has processed  .
Precondition:  2 is independent of  1 , and  2 is (1) a state machine and (2) either monotonic or functional.
A.5.1 Mechanism.To decouple, first apply the Batching rewrite (Appendix A.4.2) for all rules in  2 whose head  is referenced in  1 , replacing  with rSealed.Populate forward with both forward(addr,addr2) and forward(addr2,addr).Then apply the forwarding rewrite for all rules in another component  ′ whose head is referenced in  2 .Finally, perform the following rewrite to delay outputs of  2 : Rewrite: Batch Acknowledgement.Add the following rules to track which batches  1 has processed: batchACK(t2,l',t') :− canSeal(t2,l,t), forward(l,l'), delay ( The history of an instance  is constructed using the wall-clock times of inputs and outputs of its observably equivalent instance I. Therefore, any instances  1 and  2 that are observably equivalent to the same I share the same histories. We will prove that for each instance  ′ there exists  and the observable instance I such that ( Since  2 behaves like a state machine, the facts of   are dependent on the input facts of  2 and their ordering.Since  2 is either functional or monotonic, the facts of   are not dependent on the ordering of the input facts; this can be proven by treating each   as an output relation, then reapplying proofs from Appendices A.2.3 and A.3.2.Let  be the input relations of  2 .We can prove that facts  ′ of   in  ′ exists if and only if  of   in  exists, where  ′ equals  except   ( ′ ) = addr2 while   ( ) = addr, and if   ( ′ ) =  ′ then   ( ) = , but if   ( ′ ) <  ′ , then   ( ) < .In other words,  and  ′ share inputs at  ′ and  and all previous inputs, but previous inputs may be out-of-order.
We prove by contradiction.First consider some  ′  of an input relation of  2 in  ′ where   ( ′  ) =  ′ , but there is no   in  where   (  ) = .Since  an output of  2 and input of  1 , it is batched according to the batching rewrite, and there must be a fact   in canSeal of  ′ with   2 (  ) =  ′ signalling when facts in  can be processed.By construction of  ,  ′  exists in  ′ if and only if   exists in  , a contradiction.Now consider some  ′  where   ( ′  ) <  ′ , but there is no   where   (  ) < .Either there is some   as above, or   (  ) =   ( ′  ) by construction of  .Since  ′ < ,  ′ =   (  ) < , completing the proof.Therefore, for inputs and outputs of  1 ,  ′ and  and observably equivalent to I.
We now prove that  ′ and  are observably equivalent to I for inputs and outputs of  2 .By construction of  , the inputs of  are observably equivalent to I. Since  2 is either functional or monotonic, and  2 is independent of  1 , given the same input facts of  2 in  ′ and  , in any relation  referenced in  2 ,  ′ exists in  of  ′ if and only if  exists in  , where  ′ equals  except on time.Consider output fact  ′ derived at time  ′ in  ′ and  in  .In  ′ ,  ′ is buffered in outP until the latest input fact of  2 is acknowledged by  1 .Since we assumed that  2 is a state machine, all outputs of  2 must have existence dependencies on its inputs, and the latest input fact  ′  of  2 in  ′ must have time  ′ .There must be a fact   in canSeal of  ′ with   2 (  ) =  ′ .By construction of  ,  =   (  ).The acknowledgement for   must be sent at  in  ′ and arrive at some time   >  at  2 , when  ′ can be sent.Therefore, the range of possible times of  is (, ∞) and (  , ∞) for  ′ , where the range of  ′ is a sub-range of  , and any immediate consequence of  ′ in  ′ must be possible in  .Note that we do not separately prove correctness for aggregation and negation because they are covered by our proofs according to the definition of "sharing keys" in Section 4.1.

B.2 Partitioning with dependencies
B.2.1 Checks for dependencies.FDs in Dedalus can be created in three ways: • EDB annotation.For example, hash(M, H) is the EDB relation that simulates the hash function ℎℎ() = ℎ, so there is an FD  :  →  where () = ℎℎ().
• Variable sharing.For a relation  , if in all rules  with  as the head, attributes ,  of  always share keys, then there is an FD from  to  and from  to , where () = .
• Inheritance.The heads of rules  can inherit functional dependencies from a combination of joined relations in the body of .
In the last case, to determine which dependencies are inherited, we perform the following analysis for each relation  in the head of rule : • Attribute-variable substitution.Take the set of all FDs of all relations in the body of  and replace each domain/co-domain on attributes with their bound variables in .
• Constant substitution.If an attribute is bound to a constant instead of another variable, plug the constant into the FDs of that attribute.Now all FDs in  should be functions on variables.
• Transitive closure.Construct the transitive closure of all such FDs.
• Variable-attribute substitution.Replace each FD on variables with their bound attributes from  , if possible.The FDs that only contain attributes in  are the possible FDs of  .
Having described the process of extracting FDs for each relation  at the head of each rule , we must determine which FDs hold across rules.Since the identification of FDs for any relation assumes that all dependent relations have already been analyzed, and Dedalus allows dependency cycles, FD analysis must be recursive.We divide the process of identifying FDs for  into two steps: union and intersection.The union step recursively takes the union of generated dependencies in any rule  where  is the head; the intersection step recursively removes FDs that are not generated in some rule  where  is the head.
CDs can be similarly extracted using dependency analysis.• If  is referenced in  1 , add the body term proxy(,  ′ ) to  and bind the location attribute of the head to .
• Otherwise, apply the partitioning rewrite.
We describe the functionality of the proxy node but omit its implementation.The proxy node acts as the coordinator in 2PC, where the partitioned nodes are the participants.It receives input facts on behalf of  1 as described above, assigns each fact a unique, incrementing order, and broadcasts them to each node through rVoteReq.The nodes freeze and reply through rVote.The proxy waits to hear from all the nodes, then broadcasts the message through rCommit, which unfreezes the nodes.We describe how the nodes are modified to freeze, vote, and unfreeze (only after receiving all previously voted-for values) below: Add the following rules to : processedI(i,l,t') :− processedI(i,l,t), t'=t+1 maxProcessedI(max<i>,l,t) :− processedI(i,l,t) maxReceivedI(max<i>,l,t) :− receivedI(i,l,t) unfreeze(l,t) :− maxReceivedI(i,l,t), maxProcessedI(i,l,t), !outstandingVote(l,t) As we show below, unfreeze(l,t) will be appended to rules so they can only be executed when all previous replicated inputs have been processed.
For each relation  referenced in  1 , replace  with rSealed and add the following rules: Since input relations  to  1 now arrive at the proxy first before being forwarded to the partitioned nodes, the arrival time of input facts in the transformed component  no longer corresponds to the processing time.This poses a problem for our proof; in the original component , the arrival time of input facts is the processing time.We relax this requirement through the observation that because messages are sent over an asynchronous network, any entity that sends and receives messages from  can only observe the "send" time   of each input fact  , where   <   ( ).Intuitively, an input fact that is sent at time   and in-network for  seconds is processed identically to a fact sent at time   , in-network for  ′ seconds, and buffered for  seconds, so long as  ′ +  = .
We will prove that for each instance  ′ over the partially partitioned component , there exists  over the original component  and the observable instance I such that (1)  and  ′ are both observably equivalent to I, and (2) for the set of relations  referenced in  (and  ′ corresponding to  with rSealed replacing  ), for any time  ′ and node  where at least one input relation rSealed of  1 is not empty in  ′ and contains fact  ′ , let  be the fact in  corresponding to  ′ with the smallest time  =   ( ), and let  =   ( ′ ).We must have  ′  ′ ,, ′ =  ,, .In other words, although each replicated input is delivered at different times at different nodes, the states of each node at their differing times of delivery correspond to the original state at a single time of input delivery.
(3) Similarly, if at least one input relation rSealed of  2 is not empty, then the same condition holds with  ′ = ,  =  ( ′ ).
First, observe that claim 1 holds if (a) each output fact  of  exists in  ′ if and only if  is also in  , and (b) claims 2 and 3 hold.Let I be observably equivalent to  ; each input fact  of  in  exists if and only if there exists   in I where  equals   except   ( ) >   (  ).By claims 2 and 3, there must be  ′ in  ′ where  ′ equals  except   ( ′ ) ≥   ( ) (since   ( ) is based on the smallest   ( ′ ) across nodes), which implies   ( ′ ) >   (  ).If output facts in  and  ′ are identical, then  ′ must also be observably equivalent to I.
Claims 2 and 3 imply that output facts in  ′ are also possible in  .Output relations in  are assumed to have existence dependencies on inputs, so given the inputs  ′ and  above, if the range of possible times for any output fact  ′  in  ′ is (  ( ′ ), ∞), then the range of possible times for the same output facts   in  must be (  ( ), ∞).Since   ( ′ ) ≥   ( ), so the range of possible output times of  ′  must be a sub-range of   , and any  ′  in  ′ is possible in  .Therefore it suffices to prove claims 2 and 3.
We select the time of input facts in  such that its inputs are observably equivalent to I.For facts  ′ in  ′ over relations rSealed in : (1) If rSealed is referenced in  1 , let  ′  be the corresponding fact of  ′ in  ′ with the smallest time  ′ =   ( ′ ), and let  equal  ′  except   ( ) = addr;  of the corresponding  exists in  if and only if  ′ exists in  ′ .(2) If rSealed is referenced in  2 , let  equal  ′ except   ( ′ ) = addr;  of the corresponding  exists in  if and only if  ′ exists in  ′ .By the partial partitioning rewrite, each fact in rSealed is derived from a series of facts in r, rVote, rVoteReq, then rCommit.The fact   in  in its proof tree must have an earlier time, which must be later than the send time of   .Therefore, since the times of input facts  in  is set to the times of facts in rSealed,   ( ) must be later than the send time of  , and the inputs of  are observably equivalent to I.
We now prove claim 2 and 3 by induction on facts with time   in  ′ .We show that claim 2 holds for facts with time  ′ =   + 1, assuming claims 2 and 3 hold up to   .We will prove by induction of the inputs in  ′ and  up to time  ′ .Let  ′ be a fact in input relation rSealed of  1 in  ′ with  ′ =   ( ′ ),  =   ( ′ ), and  ′  be the fact in  ′ corresponding to  ′ with the smallest time  =   ( ′  ).Let  equal  ′  except   ( ) = addr.By definition of  above,  is an input fact in  .By the partial partitioning rewrite, there is no other co-occurring input fact  ′′ in  ′ with   ( ′′ ) =  ′ , since processedI will not contain the index of rCommit until the next timestep.Therefore,  does not contain any other input fact at .Let  ′ 1 be the most recent fact in We show claim 3 holds, using the same variable definitions.By the partial partitioning rewrite, we know that if an input relation rSealed of  2 is not empty, then input relations rSealed of  1 must be empty.Let  ′ be the input fact in  ′ at time  ′ with corresponding  in  at .Note that this proof can be generalized to a set of facts  ′ at time  ′ .By construction of  ,  ′ = .Again assuming the most recent input fact  ′ 1 of    [5] is a syntactic sugar we introduce to simulate sending multiple output facts in a single asynchronous message.Sealed relations can be partially partitioned so each partition can compute and send its own fraction of the sealed outputs.

C NON-LINEARIZABLE EXECUTION OVER PARTITIONED ACCEPTORS
CompPaxos partitions acceptors without introducing coordination, allowing each node to hold an independent ballot.In contrast, ScalablePaxos can only partially partition acceptors and must introduce coordinators to synchronize ballots between nodes, because our formalism states that the nodes' ballots together must correspond to the original acceptor's ballot.Proposers in CompPaxos can become the leader after receiving replies from a quorum of any  + 1 acceptors for each set of  nodes; the nodes across quorums do not need to correspond to the same acceptors.In contrast, the  nodes of each acceptor in ScalablePaxos represent one original acceptor, so proposers in ScalablePaxos become the leader after receiving replies from all  nodes of a quorum of  + 1 acceptors.Crucially, by allowing the highest ballot held at each node to diverge, CompPaxos can introduce non-linearizable executions that remain safe for Paxos, but are too specific to generalize.
We first define what it means for a Paxos implementation to be linearizable.A p1a and its corresponding p1b correspond to a request and matching response in the history.For an implementation of Paxos to be linearizable, the content of each p1b must be consistent with its matching p1a taking effect some time between the p1a arrival time and the p1b response time.The same statements hold for p2a and matching p2b messages.Since p1a and p1b messages are now sent to each node in CompPaxos, we must modify the definition of linearizability for CompPaxos accordingly.Assume that a p1a arriving at acceptor  in Paxos corresponds to the arrival of the same p1a messages at all nodes of  in CompPaxos, and a matching p1b arriving at proposer  in Paxos corresponds to the arrival of all matching p1b messages at  in CompPaxos.Now consider the execution shown in Figure 11: (1) Proposer  1 broadcasts p1a with ballot 1.It arrives all acceptors except partition 1 of acceptor  1 .The other acceptors return p1b with ballot 1.
(2) Proposer  2 broadcasts p1a with ballot 2. It arrives at all partitions of every acceptor, which return p1b with ballot 2. Proposer  2 is elected leader.

Fig. 1 .
Fig. 1.Dataflow diagram for a verifiably-replicated KVS.Edges are labeled with corresponding line numbers; dashed edges represent asynchronous channels.Each gray bounding box represents a node; select nodes' dataflows are presented.

Fig. 8 .
Fig. 8.The common path taken by CompPaxos and ScalablePaxos, assuming  = 1 and any partitionable component has 2 partitions.The acceptors outlined in red represent possible quorums for leader election.

5. 3 . 1
Throughput comparison.Whittaker et al. created Scala implementations of Paxos (BasePaxos) and Compartmentalized Paxos (CompPaxos).Since our implementations are in Dedalus, we first compare throughputs of the Paxos implementations between the two languages to establish a baseline.Following the nomenclature from Section 5.2, implementations in Dedalus are prepended with , and implementations in Scala by Whittaker et al. are not.

Fig. 10 .
Fig. 10.The scalability gains provided by each rewrite, in isolation.

Definition A. 2 (
No-change dependency).Relation  has a no-change dependency on input relations   if  's contents remain unchanged in a timestep when the inputs are empty.That is, if   ∈    = ∅ at timestep , then  contains exactly the same facts at timestep  as it did in timestep  − 1.

1 =
,, 1 .Since inputs are the same between  ′ and  at time  ′ and , we can similarly use the proofs from Appendices A.4.3 and B.1.2 to prove claim 3, completing the proof of correctness.B.4 Partitioning sealing B.4.1 Sealing.Sealing

Fig. 11 .
Fig. 11.Non-linearizable execution of CompPaxos.Acceptor  3 is excluded for simplicity.Lighter-gray acceptors belong to partition 1, and darker-gray acceptors to partition 2. Each set of requests and matching responses are a different color.
So far, we have only talked about facts that exist at a point in time .State change in Dedalus is modeled through the existence or non-existence of facts across time.Persistence rules like the one below from Line 2 of Listing 2 ensure, inductively, that facts in hashset that exist at time  exist at time  + 1. Relations with persistence rules-like hashset-are persisted.
hashset(hashed,val,l,t') :− hashset(hashed,val,l,t), t'=t+1 We assume that Dedalus programs are composed of separate components , each with a nonempty set of rules .In our running example, Listings 1 and 2 define the leader component and the storage component.All the rules of a component are executed together on a single physical node.Many instances of a component may be deployed, each on a different node.The node at location addr only has access to facts  with   ( ) = addr, modeling the shared-nothing property of distributed systems.We define a rule's references as the IDB relations in its body; a component references the set of relations referenced by its rules.For example, the storage component in Listing 2 references toStorage, hashset, collisions, and numCollisions.A IDB relation is an input of a component  if it is referenced in  and it is not in the head of any rules of ; toStorage is an input to the storage component.A relation that is not referenced in  but appears in the head of rules in  is an output of ; fromStorage is an output of the storage component.Note that this formulation explicitly allows a component to have multiple inputs and multiple outputs.Inputs and outputs of the component correspond to asynchronous input and output channels of each node.

:
1 and  2 are mutually independent.Recall the definition of references from Section 2.4: a component  references IDB relation  if some rule  ∈  has  in its body.A component  1 is independent of component  2 if (a) the two components reference mutually exclusive sets of relations, and (b)  1 does not reference the outputs of  2 .Note that this property is asymmetric:  2 may still be dependent upon  1 by referencing  1 's outputs.Hence our precondition requires mutual independence.Because  2 has changed address, we need to direct facts from any relation  referenced by  2 to addr2.
Cohashing would not allow partitioning, because  and  do not share keys over their attributes.However, if we know the functional dependency Str → UpStr over  , then we can partition , ,  on the uppercase values of the strings and still avoid reshuffling.This co-partition dependency (CD) between the attributes of  and  loosens the co-hashing constraint beyond sharing keys.Formally, relations  1 and  2 have a co-partition dependency  :  ↩→  on attribute lists ,  if for all proof trees containing facts  1 ∈  1 ,  2 ∈  2 , we have   ( 1 ) = (  ( 2 )) for some function .If we partition by  (the range of ) we also successfully partition by  (the domain of ).
The functional dependency (FD) Str → UpStr strengthens partitioning: partitioning on UpStr implies partitioning on Str.Formally, relation  has a functional dependency  :  →  on attribute lists ,  if for all facts  ∈  ,   ( ) = (  ( )) for some function .That is, the values  in the domain of  determine the values in the range, .This reasoning allows us to satisfy multiple co-hashing constraints simultaneously.Now consider the following joins in the body of a rule: p(str), r(str, upStr), q(upStr).
Since the original rule  was synchronous,  shares the same location addr as all other facts in   ( ), and either time  (if  is deductive) or  + 1 (if  is inductive).This proves the inductive hypothesis:  and  ′ share the same values except   ( ) = addr while   ( ′ ) = addr2, and   ( ) ≤   ( ′ ).Now consider  ′ where  ′ is a rule of  2 . ′ in  2 is unchanged from  in .Since we assumed that  2 is functional,  ′ contains at most one IDB relation in its body.If there are only EDB relations in its body, then the facts in those EDBs are the same in both  ′ and  , completing the proof.If there is one IDB relation   in its body, then by the induction hypothesis, the fact  ′  of   in  ′ implies fact   in  , where   (  ) = addr and   ( ′  ) = addr2,   (  ) <   ( ′  ), and  ′  otherwise equals   .All remaining relations in the body of  must be EDBs with the same facts across all locations and times.Therefore, any immediate consequence   ′ ( ′ ) with  ′  in its body and  ′ in its head implies   ( ) with   in its body and  in its head.If  is synchronous, then  ′ and  ′  share the same location and time,  and   share the same location and time, and  ′ and  are otherwise equal, completing the proof.
For each rule  in  2 with output relation out, create the relation outP with two additional attributes- 2 and  ′ -representing the derivation time of the fact and the destination, and add the following rules: Since input relations  to  1 are now buffered and replaced with rSealed, the arrival time of input facts in the transformed component  no longer corresponds to the processing time.This poses a problem for our proof; in the original component , the arrival time of input facts is the processing time.Note that since messages are sent over an asynchronous network, any entity that sends and receives messages from  can only observe the "send" time   of each input fact  , where   <   ( ).Intuitively, an input fact that is sent at time   and in-network for  seconds is processed identically to a fact sent at time   , in-network for  ′ seconds, and buffered for  seconds, so long as  ′ +  = .To formalize this intuition, we introduce the observable instance I, which contains a set of facts representing the send times of input facts and arrival times of output facts.An instance  is observably equivalent to I if for all facts  in relation  of  , (1) if  is an input relation, then  exists if and only if there exists   in I where  equals   except   ( ) >   (  ), and (2) if  is an output relation, then there exists  in  if and only if there exists  in I.
1)  and  ′ are both observably equivalent to I, and (2) Each fact  of relation  referenced by  1 in  ′ exists if and only if  exists in  .We select the time of input facts in  such that its inputs are observably equivalent to I.For each input relation  in : (1) If  is an input of  1 , each fact  of  in  ′ exists if and only if  exists in  , and (2) If  is an input of  2 , given  ′ of  in  ′ with  ′ =   ( ′ ), let the seal time be   =   2 (  ) in   of canSeal in  ′ ; let  equal  ′ except   ( ) = addr and   ( ) =   (  );  ′ exists if and only if  exists in  , and (3) If no such   exists, then let   ( ) =   ( ′ ).Note that in case 2,   ( ) >   ( ′ ) due to asynchrony from  2 to  1 .Therefore, input facts  in  correspond to input facts  ′ in  ′ where   ( ) ≥   ( ′ ), thus the inputs of  are observably equivalent to I.We first prove the claim that  ′ and  are equivalent for all facts over relations referenced in  1 .We prove by induction over the immediate consequence of rules  whose head  is referenced in  1 .If  is an original rule of  1 , then by the inductive hypothesis, the claim is true.If  is a rule of another component  ′ , then the rule is unchanged and the claim is still true.Now consider  ′ where the head of  ′ is rSealed corresponding to some  after transformation. ′ is a rule we introduced so there is no immediate consequence over it in  .Let  refer to the rule with  at its head instead of rSealed;  is a rule of  2 .We can show that the immediate consequence   ( ′ ) at time  ′ is possible if and only if   ( ) at some time  >  ′ is possible; this is true if for all relations   ∈  (),  ′   , ′ =    , .Then as long as  >  ′ , the facts of  of immediate consequence   ( ′ ) can be asynchronously delivered from  2 to  1 at some time   , where  ≥   >  ′ , and be sealed at time  in a an immediate consequence   ′ ( ′ ) over  ′ , resulting in the same facts in rSealed in  ′ and  in  .
In the variable-attribute substitution step, instead of only retaining FDs where variables in the domain and co-domain can all be replaced with attributes in  , retain FDs where any variable can be replaced with attributes in  .These are the CDs of  and the relations  joins with in ; they describe how attributes of  joins with other relations in .The CDs that hold across rules can be identified with the intersection step for FDs.B.2.2 Mechanism.The rewrite mechanics are identical to those in Appendix B.1.1.B.2.3 Proof.The proofs are similar to those in Appendix B.1.2,assuming a CD  exists over attributes ,  of relations  1 ,  2 only if for any  1 ,  2 of  1 ,  2 in the same proof tree, we must have  Given a rule  in another component  ′ whose head relation  is referenced in : ( 1 ) = (  ( 2 )).B.3 Partial partitioningB.3.1 Mechanism.In order to replicate relations  referenced in  1 , we create a new "coordinator" proxy node at addr', introduce a relation proxy(,  ′ ), and populate proxy with a tuple (,  ′ ).Rewrite: Replication.
For each remaining relation  in , replace  with rSealed, and add the following rules: Before stating our proof goal, we present terms to describe partitioned state.For simplicity, we denote  ,, as the set of facts  in  where  is a fact of a relation in  ,  ( ) = , and   ( ) = .A relation  is empty at time  in instance  and node  if there is no fact  of  in  where   ( ) =  and  ( ) = .The corresponding facts of a replicated fact  are all facts  ′ where  equals  ′ except   ( ) ≠   ( ′ ).