On The Suitability of Differential Dataflow For Datalog Interpretation In Highly Dynamic Settings

In the domain of knowledge representation and reasoning within AI, datalog engines play an ever-increasingly crucial role. The crux of their operation lies in materialization: the evaluation of a datalog program and its incorporation into a database. This operation becomes complex and resource-intensive, especially when the data is highly dynamic, as it is common in distributed environments. Thus, incremental materialization, adjusting the computation to new data instead of restarting it, is the norm. However, handling the deletion of data is significantly more complicated than addition due to the cascading effects of what is being removed. Differential Dataflow offers a computational model that effectively addresses this, ensuring consistent performance for both data additions and deletions. In this paper, we delve into the efficiency of materialization using three distinct datalog implementations: one based on a streamlined relational engine and two others that implement the same algorithm, but with one utilizing differential-dataflow, and another not. Our insights provide a roadmap for enhancing datalog-driven computations, particularly in dynamic data environments like the cloud.


INTRODUCTION
Datalog [8] is increasingly more relevant in the modern context.While it has traditionally been used almost exclusively for the workhorse of symbolic AI, database reasoning, more recent implementations, such as the Rego [20] language for cloud policy evaluation, showcase the continued relevance and flexibility of datalog.
The Open Policy Agent [21] (OPA) is a versatile policy engine designed to unify policy enforcement across the cloud-native stack.Central to OPA is its policy language, Rego, which is a variant of datalog.Like Datalog, it is a declarative logic programming language, where instead of writing imperative code, developers specify constraints, policies about cloud-components, and the engine figures whether these these policies have been violated or not, given a query.
The key use of Rego in OPA is to assert policies over structured data, most specifically over JSON.For instance, in the Kubernetes [18] container orchestrator, one might define a policy that prevents applications from being exposed externally unless certain annotations are present.This policy would be represented as a set of Rego rules.When a request to the Kubernetes server occurs, the request would be evaluated against these rules, and the server would allow or deny it based on the result.
Given the high velocity and volume of cloud-native data, it is paramount that policy decisions are made efficiently.This often implies adjusting an already-existing evaluation, referred to as a materialization, to more facts, instead of scraping it and starting anew.While handling the addition of more facts is known to be efficiently managed, deletions stand as an more complex challenge.
The naive approach to handling deletions, retracting a fact alongside everything else derived from it, might take longer than restarting the computation.This inefficiency led to the emergence of the delete-rederive method [10], that addresses this issue by computing the adjustment through the evaluation of new datalog programs, first calculating all possible deletions, and then determining alternative derivations.The difference between these sets represents the actual facts to be deleted.
Handling fact additions and retractions in different ways leads to drastically biased performance characteristics [14].Because of this, modern datalog engines tend to entirely avoid supporting the latter form of updates.Differential Dataflow [2] is a general dataflow programming model that lifts an iterative non-incremental algorithm into its incremental, referred to as differential, version.By outlining an algorithm over some input data with differential dataflow, it will then work over update differences of the input data, that can be either positive, such as adding new data, or negative, as removals.This could be leveraged in datalog evaluation as a way to handle updates in an uniform manner.
We note that incremental datalog evaluation is akin in process to online learning [6], and any results from the experiments presented here are also indicative of the suitability of differential dataflow for problems in that sphere as well.
Contributions.We present an algorithm for interpreting datalog programs as a dataflow of facts and rules, with precise semantics.The semantic medium for that algorithm is the DBSP language, that is able to formalise a limited subset of differential dataflow (DD).We then implement it in rust with the differential dataflow library, and compare it with the most common methods of datalog evaluation, semi-naive evaluation, and the delete-rederive method.
Our comparison provides a thorough overview on performance and memory usage over multiple programs, datasets and evaluation methods.We compared our algorithm with one state of the art engine, and two off-the-shelf datalog engines, out of which one uses the method of converting a datalog program to relational algebra, and another that uses a less common term rewriting method.
The state of the art engine that we chose to compare with is Soufflé [22].As most datalog engines, it does not support incremental evaluation, therefore being a good example on how subpar their performance can be in highly dynamic scenarios.
Structure of the paper.The paper is organized as follows: • Related Works provides context to the paper by referencing recent high-profile works in the subfield of incremental datalog evaluation, alongside other attempts at datalog evaluation that have used differential dataflow in some way.• Background contains an overview of Datalog and its evaluation methods, including a detailed explanation of the seminaive evaluation and how it can be incrementally maintained.• Proposal brings out our approach for modelling datalog interpretation with DBSP, how it relates to Differential Dataflow, and why that could be beneficial for evaluating datalog programs in low-latency environments.• Experiments details the empirical evaluation setup, including the datasets used, benchmarks, and a comparative analysis with existing systems.It evaluates the performance of our approach in terms of runtime efficiency and memory usage.• Conclusion.The final section concludes the paper by summarizing our findings, discussing the implications of our work, and suggesting future directions for research in this area.

RELATED WORKS
The state-of-the-art of incremental evaluation of datalog programs is [15].It provides a thorough overview of the two most relevant classical methods, delete-rederive and counting, alongside substantial empirical evaluations that provide key observations on their practical performance.
As the main novelty of this paper is in implementing a datalog interpreter with differential dataflow, we acknowledge the existence of two other datalog engines.The most high profile one is DDLog [19].It compiles a datalog program into a Rust DD program that executes recursive relational algebra, therefore being similar in strategy to Soufflé.Our system is orthogonal to it, being an interpreter, hence not suffering from long compilation times, and in programs being freely changeable in runtime.
Laddder [24] is an incremental evaluator of lattice-based datalog programs that extends differential dataflow.Our reasoner diverges in fundamental principles, as it does not have lattice semantics, it is an interpreter, and focuses on evaluating general datalog programs.

BACKGROUND
Datalog [8] is a programming language with semantics denoting the evaluation of a set of possibly-recursive restricted horn clauses, a program, over a fact store while remaining not turing complete.Evaluating a program Π entails computing all implicit consequences over a fact store , yielding new facts.A Program is a set of rules of the following form: with ℎ as the head atom, containing   terms that can be either constant or variable, and  = =1 b  as the body with each   being an atom.
Materialization, the incorporation of all of a program's consequences  = Π() ∪ , eliminates the need for reasoning during query answering.Maintaining this computation in face of additions  + and deletions  − ,  ′ = Π( ∪  + / − ) ∪ ( ∪  + / − ) is the crux of modern Datalog, as it relates to the broader problem of incremental view maintenance [13] (IVM).

Monotone evaluation
Computing the LFP of  can be done intuitively, by applying  until there are no more new facts: Π() =  =1  (  ), with   representing the intermediate materialization.This method is called naive evaluation.Semi-naive evaluation differs in that the program is transformed into a delta program ΔΠ, with new rules such that only the most recently inferred facts can progress the computation.Given a Π with rules  0 , ...,   each having  body atoms, up to  new rules are created for each, with every -th new rule containing the same body atoms sans   , which is substituted for Δ  , a set that only ever contains what has been inferred in the previous recursion step.

Fact retractions
Due to semi-naive evaluation's naturally incremental nature, it suffices for efficiently handling cases where  − = ∅.The deleterederive method [10] is the standard for handling otherwise.

Example 3.2. The problem of overdeletion
2 demonstrates an instance of the problem of overdeletion, the first step of the two-step delete-rederive method.Here, program Π comprises two rules that together infer  either from the presence of facts in  or .Initially  contains  facts from  and , however at some point, a deletion update  − arrives, containing all  until  /2 .DRED's expensive first step would unnecessarily delete all facts in  up until   , in spite of them still clearly holding due to .This is partially addressed by the second part of the algorithm, that computes alternative derivations of all overdeletion candidates.

PROPOSAL
A promising way to compute updates to Π() in a unified manner is to formalise it as a DBSP circuit, and incrementalize it.The main premise of DBSP is that any algorithm described with it can be incrementalised with a deterministic algorithm.By doing so it is possible to take advantage of the efficient rust implementation of DD[1], hence implementing naive datalog evaluation would yield semi-naive evaluation that could support retractions too.

Substitution-based immediate consequence
The most impactful aspect of evaluation is the implementation of  ().There are two main methods to do so.The first is to evaluate rules as a term rewriting problem, named as the substitution-based method, and the second is to rewrite a datalog rules as relational algebra equations, and delegate their execution to an efficient relational engine.The latter is the most popular, as it offloads significant amounts of complexity away.We focus however on the former, since it has had less recent research interest, and is significantly easier to implement and parallelize.
The basis of the substitution-based method are substitutions with all  being variable terms and all  constant.There are three fundamental operations, with  and  as atoms, and   as substitutions: • Application.apply(, ): The application of  to  results in replacing every variable term  in  with its corresponding substitution in , if it exists: Let  be the symbol of : The result of application is then an atom with possibly less variables.
Extending  1 with  2 implies combining the mappings of both substitutions: . The first case is a simple union of the two substitutions because they're consistent (they don't provide different mappings for the same variable), and the second case handles that if there's any variable for which the two substitutions provide different mappings, the extend operation fails.
• Unification.unify(, ): Unification is an attempt to find a  such that when it is applied to some atom  and another atom  that only has constants, the resulting atoms are identical.
The substitution-based method relies on computing the union of the immediate consequence of each rule.This is done as follows: Define the initial set of substitutions as Σ 0 = { 0 }, where  0 is an empty substitution.For each body atom   , find the set of ground facts   ⊆  that match   .
Algorithm 1 is the standard substitution-based method.We exemplify its execution on table 1.
The non-trivial program is: Π = {S(?, ?) ← R(?, ?),T(?, ?),T(?, ?)} with fact store  = {T(, ), T(, ), T(, ), R(, ), R(, )}.The final substitutions(those that will be applied to the head of their respective rules) are As it can be seen, there is a large amount of wasteful attempts.This can however, be partially remedied by extensive parallelism, since every attempt in the same iteration  can happen independent of each other.Aside from incrementalisation, DD's implementation offers extensive facilities for computation distribution, leveraging from its underlying model, Timely Dataflow [17].

The issue
We can then summarise Algorithm 1's problems: (1) Complexity: The nested loops combined with recursive calls can lead to high computational overheads.For each body atom, the algorithm attempts to unify against every fact in the database and every existing partial substitution.
As the number of facts or the length of the rule body grows, this can become extremely costly.( 2 Relational engines do not suffer from from almost every single mentioned limitation.By utilizing DD however, we can retain the simplicity of substitution-based evaluation while overcoming a significant portion of the downsides, and obtaining the following advantages: • Incremental Computation: DD processes changes to data rather than recomputing results from scratch.This is particularly beneficial in the context of Datalog evaluation, where small updates to the facts or rules might necessitate large recalculations.• Optimized Memory Usage: Instead of storing entire sets of substitutions, DD could maintain differences between sets of, which can lead to significant memory savings.• Efficient Indexing: Arrangements are partitioned indices, implemented as in-memory LSM trees, designed for efficient data organization, enabling rapid lookups, joins, and aggregation.These indices facilitate the persistent reuse of indexed views.

A DBSP circuit for substitution-based evaluation
DBSP is a language that provides a unified, formal and sound foundation for manipulating non-streaming, streaming and incremental computations.It has been shown [5] to model many rich query languages, such as relational algebra extended with aggregates and both monotonic and non-monotonic recursion.The main assumption is that inputs must arrive in time order, which allows for nested time domains.This restricts it to a subset of DD.
Streams  are maps N →  with the domain representing time, and the codomain a value from an abelian group , within the context of this work.Stream operators are functions   →   , extending to multiple inputs and outputs.The most elementary operator is stream lifting ↑.Given a function  :  → , ↑ :   →   .
The delay operator  −1  :   →   produces a stream that is one step behind its domain, and is fundamental to implementing both naive and semi-naive evaluation.Stream operators are time invariant, if  • −1 ≡  −1 •.The most notable property is causality.An operator is causal if it's output at time  only ever relies on all inputs with time  ′ such that  ′ ≤ , with it being strict if  ′ < .
Lifted operators are not necessarily strict, but are causal.The main requirement for implementing algorithm 1 is in computing the least fixed point of the immediate consequence operator.
Given that all strict operators are guaranteed to have a unique solution to their fixpoint, and assuming that the immediate consequence could be outlined as a causal operator, then due to the fact that for every  :   →   that is strict, and  :   ×   → that is causal, it follows from fix  .(•  )  being well-defined and causal, that if algorithm 1 is  , and the delay operator is  , then the least fixed point of the immediate consequence is well-defined.
Linear operators are those that are group homomorphisms, with Bi-linear ones extending that concept to operators with multiple inputs.All linear operators are causal.We compose algorithm 1's circuit entirely with causal operators and functions, therefore it being causal itself.Furthermore, as referenced in [5], we model facts, rules and substitutions as the Z-set abelian group, which allows for the usage of relational operators with correct set semantics, by utilizing the distinct operator.distinct's Z-set semantics coalesces values with negative multiplicities into set removals, and ensures that every value has multiplicity of at most one.

𝐼
Figure 1 showcases the DBSP circuit of algorithm 1.It implements naive evaluation, therefore not being incremental.DBSP however stands out for its unique capability to deterministically incrementalize any circuit.When this circuit is outlined with the DD library, it undergoes automatic incrementalization and parallelization, enhancing its efficiency by handling differences of facts, rules, and substitutions.

apply(𝑠𝑢𝑏, ℎ𝑒𝑎𝑑))
unlifted versions translating the imperative style of algorithm 1 into a functional one, more suitable to dataflow computation.We employ extensive usage of standard scalar functions from the functional programming literature, readily available in the DD library.Operator m 0 processes a rule identifier and a rule, enumerating the atoms in the rule and associating each with its position and the rule identifier, each combination having a multiplicity of 1, signalling that iteration for new substitutions starts at the first body atom, and procedes atom-by-atom.m 1 applies it to all incoming rules by flatmap-ing them with m 0 .m 2 pairs each rule's identifier with a tuple containing the head of the rule and the length of its body, for every rule in a given Z-set of rules.m 3 is tasked with taking pairs of heads and substitutions and mapping over them to produce a new fact, as the result of applying a substitution of a rule head.
The u operator is responsible for unifying rewrites with facts, while e extends a substitution with a new one.o 1 and o 2 are more complex, involving the unification of atoms with facts and the extension of original substitutions, with o 2 also incrementing the position and associating it with the rule identifier, therefore either invalidating substitutions, or forwarding them to the next iteration.m 4 maps over pairs of old facts and heads/substitutions tuples, applying o 2 to produce a sequence of extended substitutions.m 5 works with pairs of heads and new substitutions, applying each substitution to the respective head, getting ground facts.
With the circuit and its constituents outlined and described, we benchmark its implementation against multiple datalog engines.

EXPERIMENTS
Two experiments were conducted in order to compare incremental reasoning performance.The first, measured materialization adjustment latency under both additions and deletions.The second recorded the maximum memory resident set size under full materialization.We refer to the reasoner implemented with DD as Diff.We also compare with other reasoners, out of which two were implemented by us, due to a lack of open-source incremental reasoners.We name the first Chibi, and the second Relational.The first uses the same substitution-based method as Diff, but not DD.The second, as its name hints at, is built on top of a relational engine implemented from scratch.Both are written in Rust, and share as much code as reasonable between themselves and Diff, for the sake of fairness.
We also compare against a state of the art reasoner, Souffle.It can be run both in interpreted and compiled modes.We run it as an interpreter, as all of our reasoners are interpreters too.Soufflé does not support incremental updates.Benchmarking it in how it could be used in a dynamic setting is very valuable however, given that most reasoners are not incremental, therefore being exemplary as to how inefficient those are in dynamic settings, which are more realistic than static scenarios.
Setup.Experiments were run on a amazon-web-services-provisioned m1 mac machine, with 8 cpu cores and 16 gigabytes of RAM.DDLog was version 1.2.3,Souffle 2.4, and Rust 1.72.Each benchmark measurement was taken a sufficient number of times to ensure that variance was low.No other user-space processes aside from the benchmark was running.Datasets.On table 2 all datasets and program names are shown.There are two areas of interest.The semantic web has pushed the datalog envelope by being an extensive source of improvements to already-established algorithms [14], and in providing a myriad of both synthetic and real datasets.The second area of interest is of graph benchmarks that follow some distribution, allowing for the measurement of one of the most important aspects of a recursive datalog evaluation, the cost of iteration.
• LUBM is a known benchmark dataset for both RhoDF [16] and OWL2RL rulesets.The data is divided in two parts, the ontology that describes universities, and the assertional data, that contains facts about universities describes with the ontology.RhoDF is a single-relation program that realizes graph entailment of a subset of the RDFS [3] metamodel.All rules are mutually recursive and have the same head, hence being inefficient to evaluate and parallelise.OWL2RL [7] is a description logic, a subset of the OWL language, that is meant to be implemented with rules-based languages such as datalog.This program was generated by converting OWL2RL entailments to datalog rules [9].It has over 100 rules, being a canary for reasoner efficiency.• RMAT10k is a standard benchmark used by various other reasoners [11,12,23,25].It is a dense graph that follows the rmat profile of the GT [4] synthetic graph benchmark.It contains 10x as many edges as vertices, that follow a powerlaw distribution.• RAND1k is a graph from GT as well, with 1000 nodes that each have a 1% chance of being connected to every other.

Runtime comparison
Tables 3 and 4 show the results of the first benchmark.Three measurements were taken, that each recorded the time to materialize, respectively, , the initial materialization batch,  + , the adjustment to an update with additions, and  − , adjusting to deleting the same data that was added with  + .All measurements are in seconds.As an example, If the initial materialization batch size is 75%, then  represents the amount of time taken to materialize 75% of the data,  + is how much the incremental materialization of the remaining 25% of the data took, and lastly,  − is how long did it take to adjust the materialization to the removal of that same update, that is, back to the same state that it was with .
The choice of facts can significantly impact reasoning performance.However, conducting comprehensive performance assessments over all subsets is not feasible.This comes from the extensive time required to execute the complete benchmark and the overwhelming factorial quantity of potential data permutations.To address this, we opted for a streamlined approach.We randomly selected subsets, specifically choosing sizes that constitute 50%, 25%, 10%, and 1% of the original data.We note that this was done differently for LUBM, since it has two components, TBox and ABox.We ensured that the TBox was always fed (fully materialized) first.
Since Soufflé does not support incremental additions nor deletions, it must rematerialize with each update.In order to compute the time taken to adjust to an update  + , it is necessary to compute the full materialisation of  ∪  + .Conversely, to do so in place of  − , would be the same as measuring  ∪  + / − , which is just .
The first benchmark group consists of the two entailment programs that run on the LUBM dataset.In spite of RDFS having six rules and OWL2RL almost 100, the size of output  is similar.For RhoDF, Diff exhibits the expected uniformity in the handling of updates.The time taken for  + and  − are almost the same, with scalability seeming to be linear to the impact of the size of the update.Soufflé materializes fast, and is able to be faster than all reasoners, including those that use DRED.Rematerialization however is up to 25x slower than incremental addition for OWL2RL.We posit that this happens for Diff because OWL2RL is friendlier to parallelization than RhoDF, with it being guided by the number of relations, out of which RhoDF has none, and OWL2RL has many.
We contrast this to Relational and Chibi, that each have parallelization on the level of rules, with each rule being applied in parallel.These two reasoners furthermore, perform in opposite directions on OWL2RL, with Chibi running out of time (OOT) to compute additions and deletions, and Relational demonstrating uniform behavior only with very small update sizes.
The second benchmark group focuses on measuring performance over a simple reachability program on graphs that are highly connected,x offering an overview on the effectiveness of every reasoner's handling of long iterative chains.RAND-1k yields a small output, with only approximately 12000 facts being emitted.It does not pose a challenge to any reasoner.RMAT-1k on the other hand, in spite of being ten times bigger, producs over 25x more facts with certain simple paths being thousands of edges long.This causes a large amount of timestamp flux in differential dataflow, due to too-fine-grained difference tracking.Soufflé shines in this situation, with rematerialization reigning over all update sizes aside from 1% of less.

Peak memory usage comparison.
In this subsection we compare the maximum resident memory set during full materialization.Table 5 presents the maximum resident memory set for each of the methods and programs across different datasets.Memory usage is presented in Mbs.LUBM1 takes up 20 Mbs of disk space, RAND-1k and RMAT-1k, respectively, 0.10Mbs and 1 Mb.Soufflé consistently shows the lowest memory usage, highlighting the cost of incrementalization.It is at two times more memory efficient than all others, up to 25 times in RMAT.Diff's poor handling of highly incremental scenarios is furthered by RMAT's memory consumption.Diff consumes 5000x more memory than the input size.We profiled that the difference set that stores substitutions starts to grow quadratically as the graph is traversed.We posit that this is expected behavior, necessary to continue ensuring that update times will remain linear.That techniques to reduce memory usage in DD are warranted.
The most surprising result is OWL2RL, in which both Chibi and Relational's peak memory usage is 10x higher than Diff's.This is explained by the fact that Diff shares as much data as it is reasonable between rule executions.The other reasoners on the other hand, when executing the almost 100 rules independently of each other, will have all of their rule execution indexes and data be in memory possibly at the same time, hence having a much larger memory footprint.

CONCLUSION
In this article we explored implementing an efficient datalog engine that uniformly handles additions and deletions with DD.We strayed from the state of the art by not translating a datalog program to relational algebra, but in directly interpreting it.In order to give precise semantics about our proposal, we outlined it as a non-incremental DBSP circuit, with the premise being that implementing it with DD's rust library will yield its incremental version.
Compared to the other reasoners that handled updates through delete-rederive, or with rematerialization in the case of Souffle, results showed that Diff was the only reasoner to exhibit uniform performance across updates, while using less memory than traditional methods.Relative to Souffle however, it used in the best case 2x more memory, and up to 500x more in the worst case.This highlighted the cost of DD, proved itself to being prohibitive in highly iterative scenarios.
We conclude that our contributions provide further evidence that DD is a promising platform for datalog.This work could be expanded in multiple directions, as in supporting expressive variants of datalog, and in making it distributed, which is already supposed by DD's underlying distribution logic, Timely Dataflow [17].
{? → , ? → , ? → }, {? → , ? → , ? →  } Algorithm 1: Substitution-based Immediate Consequence Input : 5 Σ 15 else 16 return Σ  /* Return the final set of substitutions */ 17  ← ∅ /* Initialize the immediate consequence set */ 18 21  ←  ∪ {apply(  , ℎ)} /* Apply the substitutions */ {? ↦ → , ? ↦ → , ? ↦ →  }  (,  ) ) Memory Usage: Σ  , can grow rapidly, especially if there are many possible substitutions that satisfy the body atoms.(3) Redundancy: Many of the computed substitutions might be redundant or irrelevant for deriving the final consequence, leading to unnecessary computational work.(4) Lack of Indexing: The method as presented does not leverage any indexing on the facts.In practice, efficient indexing techniques can vastly speed up the lookup operations, making rule evaluations much faster.(5) Lack of Optimization: Traditional relational database systems employ various optimization techniques, like query rewriting, join ordering, and cost-based optimization.The pure substitution-based method does not inherently take advantage of these techniques, since it is not built on top of a relational engine.(6) Scalability Issues: Techniques like data partitioning and distributed computation, which are often used in modern Datalog systems, aren't straightforwardly integrated into this model.

Table 3 :
Runtime Experimental Results -I

Table 4 :
Runtime Experimental Results -II

Table 5 :
Memory usage experimental results