Verification of Unary Communicating Datalog Programs

We study verification of reachability properties over Communicating Datalog Programs (CDPs), which are networks of relational nodes connected through unordered channels and running Datalog-like computations. Each node manipulates a local state database (DB), depending on incoming messages and additional input DBs from external services. Decidability of verification for CDPs has so far been established only under boundedness assumptions on the state and channel sizes, showing at the same time undecidability of reachability for unbounded states with only two unary relations or unbounded channels with a single binary relation. The goal of this paper is to study the open case of CDPs with bounded states and unbounded channels, under the assumption that channels carry unary relations only. We discuss the significance of the resulting model and prove the decidability of verification of variants of reachability, captured in fragments of first-order CTL. We do so through a novel reduction to coverability problems in a class of high-level Petri Nets that manipulate unordered data identifiers. We study the tightness of our results, showing that minor generalizations of the considered reachability properties yield undecidability of verification, both for CDPs and the corresponding Petri Net model.


Introduction
Declarative approaches to the specification of distributed data-aware systems have been extensively studied in many different contexts [1,2,3,4,5].These approaches share the general idea that the overall behavior of the system emerges from the interaction of a number of local components (hereafter called nodes), mutually connected in a given topology, each running a declarative program that describes at once the input/output behavior to exchange messages with the other nodes, and the update of the node internal state.Both the state and the exchanged messages are relational, thus making the overall system a distributed version of so-called data-aware SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy ˚Corresponding author.aiswarya@cmi.ac.in (C.Aiswarya); diego.calvanese@unibz.ir(D.Calvanese); francesco.dicosmo@unibz.it(F.Di Cosmo); marco.montali@unibz.it(M.processes, extensively studied within the foundations of data management from the modelling and static analysis point of view [6,7,8,9,10]. In this work, we are interested in the static analysis of such distributed declarative dataaware processes, in the style of [6,7,8].We focus in particular on the D2C language originally introduced in [4], which employs a suitably extended version of Datalog equipped with communication primitives and the possibility of referring to the previous and current node state.On top of the resulting model of what we call Communicating Datalog Programs (CDPs), two aspects become particularly important in the light of static analysis: (i) the presence of communication channels with different properties on faithfulness and ordering; (ii) the distinction between closed systems where new data are never created, but only the data present in the initial node states can be used and exchanged, and interactive systems where new data can be acquired and exchanged during the computation.
Declarative distributed systems with asynchronous communication occurring over multiset channels (where multiple copies of the same message may exist, even when the sender and receiver coincide) were considered in seminal works in the area, but only studied in connection with static analysis in the presence of external data in [11].Such systems are infinite-state, with the consequence that even for very simple reachability properties, static analysis is undecidable [11].Decidable subclasses have been singled out by importing and adapting the notion of state-boundedness originally introduced in [12,13,14], and applied in [13,15,16] to obtain decidability of verification of data-aware processes against rich variants of first-order branching-time temporal logics.In a state-bounded system, infinitely many objects may be seen within and across runs of the system, but in each single configuration reached during the computation, their number remains bounded.In the context of CDPs, this notion has a twofold effect: it essentially bounds the number of constants that can be simultaneously stored in each node state, as well as the size of each communication channel.Under such restrictions, it has been shown that model checking first-order CTL properties is decidable [11].
In this work, we start from the observation that bounding communication channels is a severe restriction, as it cannot be enforced even by suitably controlling how nodes are programmed.At the same time, [11] has shown that even propositional reachability is undecidable to check over severely restricted CDPs that employ messages with a binary signature.We consequently focus on the verification of unary CDPs, i.e., CDPs where the messages range over a signature of at most unary relational symbols, and while the local memory and interaction with external services is bounded, the channel capacity is not.This is also interesting to study in the light of multiset channels, since adopting queues, as in [9,10], would immediately yield undecidability for unary, unbounded channels.We show that the resulting model is still powerful enough to model real-case scenarios, and engage in a fine-grained study of CDP verification against variants of reachability, expressed as fragments of first-order CTL.Specifically, we establish an equivalence between this problem and that of verifying coverability over a variant of Petri nets with unordered data [17], a property that is decidable to check despite the fact that these nets are essentially infinite-state.This yields decidability for positive nested reachability queries over unary CDPs, even in the case where the logic has not only the ability of querying the states, but also that of inspecting communication channels.We finally investigate the tightness of our decidability results, showing that minor generalizations fall back into undecidability.

The CDP Model
In this section, we informally introduce the CDP model by Ma et al. [4].However, for simplicity, we formalize only the fragment relevant for our study, which is the one over unordered channels, non-deterministic bounded inputs, and single-node networks.
A CDP is a fixed network of data-centric nodes sharing messages via point-to-point channels.Each node (1) runs a Datalog-like program, written in the language D2C, (2) updates its internal state, which is maintained as a state DB over a dedicated state signature, (3) receives information from the external environment, in the form of an input DB over a dedicated input signature, and (4) shares messages, i.e., single relational facts over a dedicated transport signature.The nodes react to incoming messages: when a message  from a node  is delivered to a node , the latter gets activated and runs the program on its data-sources.In fact, the program input consists of the node state DB, the current input DB, the message  itself, and the local structure of the network at , in the form of a network DB.The output provides a new state DB for  and a set of outgoing messages, each labeled by its recipient, which, in turn, are sent on the respective channels (without labels).
We assume that communication is asynchronous and channels are reliable but unordered, that is, at each time-step, only one message is delivered (and, thus, only one node gets activated), no message can be lost, but the reception order is non-deterministic.These assumptions are useful, e.g., to model communication networks where message loss is ignored but order cannot be guaranteed (e.g., because of an underlying UDP transport protocol).Since nodes react only to incoming messages, the communication network has, for each node, a self-loop channel (from the node to itself), which initially contains a special message dedicated to node activation.
CDP nodes are exposed to information from the external environment, which represents users and/or external services.Environment interaction is abstracted away by input policies, i.e., rules to provide a new input DB.In this paper, we focus on the -bounded interactive input policies, where  P N: each time a node receives a message, the current input DB is substituted by a non-deterministically chosen new one with active domain of cardinality at most .This policy is relevant to model interaction with external users that continuously provide new information, e.g., text messages for a chat application.
All these information sources are manipulated by a D2C program, i.e., a set of Dataloglike rules specialized to the interactive and distributed setting of CDPs.The specialization is achieved by (1) organizing relation symbols in dedicated signatures (state, input, and transport), (2) using the in-rule flag prev to distinguish queries over the previous state DB and the new one under computation, and (3) labeling transport literals with terms representing senders or recipients (see [11] for a non-deterministic extension of D2C).However, in this paper, we focus on a simple D2C fragment, specialized for single-node networks.In fact, while inconvenient for modelling, single-node CDPs are enough for the technical study of verification of CDPs employing unordered channels, since each such CDP can be encoded over a single node network.

Single-node CDPs
We now formalize bounded-interactive, single-node CDPs.With a slight abuse, we refer to this fragment as, simply, CDPs and ignore all other CDP variants (see [11] for the full model).In the head, transport atoms deduce the outgoing messages and state atoms the facts in the new state DB.At the end of the computation, the new state DB substitutes the previous one and the outgoing messages are sent on the channel.Transport consistency states that data from the input DB cannot directly flow to the channel.This matches with the assumption that only nodes have the power to send messages, which have to be preliminarily gathered in an out-buffer that contributes to the node configuration (state DB, possibly affecting its boundedness, cf.Def.2.4).Note that state literals in the scope of prev and transport literals in the body are not involved in forming recursive dependencies.While this feature appears as a major difference with Datalog, actually, it is just a matter of making the syntax convenient for the CDP semantics.In fact, one can provide a Datalog encoding  of a set  of D2C rules where this difference is ironed out.Definition 2.3.A D2C program  over a CDP signature Λ is a finite set of safe and transport consistent D2C rules s.t. is stratified.Given such Λ and , a CDP is a tuple pΛ, ,  0 ,  0 q, where  0 is a state DB denoting the initial state and  0 is an initial message.Ÿ The semantics of CDPs is given in terms of configuration graphs, which connect CDP configurations via transitions.Each configuration p, , q describes a snapshot of the system, including the state DB , the input DB , and the channel , represented as a multiset.The configuration is -input, -state, or -channel bounded, for some , ,  P N, if the cardinality of the active domain of , of the active domain of , or of , is at most , , or , respectively.Definition 2.4.Given , ,  P N, we call -CDP a CDP  interpreted under the configuration graph Υ  consisting of all -input bounded configurations of .A -CDP is -state or channel bounded if all configurations reachable in Υ  from an initial configuration are -state or -channel bounded, respectively.Ÿ

The Verification Problem for CDPs
We study the problem of formal verification of CDPs.Previous work [11] showed that controlstate reachability (that is, whether there is an initial configuration from which the target state DB is reachable -ignoring the configuration of the channel) is undecidable even for restricted CDPs that (i) have a single-node network, (ii) use the channel solely to (re)activate the node, and (iii) employ a unary state signature.Decidability can be gained by imposing boundedness conditions on the various CDP data sources [11].In fact, for state-and channel-bounded CDPs, decidability holds for temporal model checking againts formulae in CTL CDP , a branching-time logic mixing CTL operators to analyze the system evolution, and FOL to query the data sources.
Unfortunately, boundedness is a semantic property, undecidable to check.In addition, while there are different techniques to enforce state boundedness [14,18], the same does not hold for channels.Furthermore, as pointed out in the introduction, imposing boundedness is particularly restrictive for communication channels.Interestingly, while undecidability of control-state reachability over state-unbounded CDPs already holds for unary signatures, in the case of CDPs with bounded states and unbounded channels it has been proved only for binary transport signatures [11].This makes CDPs that are state-and input-bounded, but operate over unbounded channels carrying unary messages, worth investigating.We call such CDPs unary CDPs (uCDPs).
In the following, we study the problem of model checking variants of uCDPs against selected fragments of CTL CDP .The base-level fragment we use to express reachability-like properties called EFp´, , q, essentially, mixes EF CTL temporal operators with closed FO formulas over the state signature.Specifically, given a CDP  " pΛ, ,  0 ,  0 q, where Λ " p, ℐ,  q, the language EFp´, , q over  is defined by the rules where Φ are temporal formulas,  are closed FO formulas over ,  1 and  2 are terms, and t is a tuple (of proper size) of terms.Such formulas are interpreted as follows: ptq queries whether ptq is in the current state DB; D is an existential quantifier over the active domain of the state DB and the support of the multiset channel; and EF is interpreted as in standard CTL, i.e., there exists a path, in the CDP configuration graph, on which  eventually holds [19].For example, reachability of a state DB containing the fact paq is expressed by the formula EFpaq, while reachability of the state that contains only that fact by EFppaq ^␣D.pq^␣ " aq.
We study variants of reachability properties starting from EFp´, , q and considering formulae consisting of (positive boolean combinations of) sentences starting with an EF operator, tuning them along three dimensions: the available temporal operators beyond EF, the presence of negations in the FO queries, and which components they inspect (state DB or also channels) 1 .We identify such fragments with notation EFpl, △, ˝q, where: ‚ l indicates which temporal operators can be nested; it can be one of: -´(no nesting), as in the grammar above, -EF `(nesting of multiple EF), obtained by adding to the grammar rule Φ ::" EFΦ, -AG (nesting of a single AG), obtained by adding to the grammar rule Φ ::" EFAG, or -pEF, AXq `(nesting of multiple EF and AX, possibly interleaved), obtained by adding to the grammar rule Φ ::" EFΦ | AXΦ; ‚ △ indicates how negation is supported by FO formulas; it can be either -, as defined by the grammar above, or - (no negation), obtained by dropping from the grammar rule  ::" ␣; ‚ ˝indicates whether formulas can only query state DBs, or also channels; it can be either: - (queries only over node states), as defined by the grammar above, or -`ℎ (queries also over the support of channel multisets), obtained by adding to the grammar rule  ::"  ptq, where  is in the transport signature.For the formal syntax and semantics of these languages, we refer to the full paper.
We study the following model-checking problem variants.
Problem 3.1 (EFrl, △, ˝s-MC).Let l P t´, EF `, AG, pEF, AXq `u, △ P t, u, and ˝P t, `ℎu.The EFrl, △, ˝s-MC problem is defined as follows: Input: A  P N,  P N, -state bounded single-node uCDP , initial configuration  0 , and closed formula Φ P EFpl, △, ˝q.Output: Whether the configuration graph Υ  satisfies Φ from  0 .Ÿ Verification w.r.t.all initial configurations reduces to finitely many instances of EFrl, △, ˝s-MC.Indeed, due to state and input boundedness, the initial configurations are finitely many up to isomorphisms, and FO formulas are invariant under isomorphisms that fix the constants in them.Establishing the decidability status of the different variants of this problem is challenging, due to the subtle interplay of the CDP components, e.g., how the node state is affected by the content of the multiset channel, whose access is limited by aysnchronous communication.To attack this problem, we provide a bridge with models and techniques for the verification of data-aware extensions of PNs, in particular -PNs.PNs are one of the most widely studied models for concurrent computations, particularly suited to handle asynchronous threads and message passing.Specifically, -PNs lend themselves to be connected to uCDPs.In fact, tokens carrying single data elements match constants used in unary messages, and places match unary relation symbols -so that inserting a token carrying constant  in a place  naturally corresponds to having message Mpcq in the channel.What is not at all clear, instead, is how to encode in the -PN the infinitely many input and state DBs that may be encountered along a computation.Recall, in fact, that even under state-boundedness, a CDP can encounter infinitely many, genuinely distinct state DBs.
To address this issue, we represent state and input DBs up to isomorphism.This can be done by introducing dedicated places for the following purposes: (1) to encode the relation symbols of messages; (2) to represent the isomorphism types of bounded input and state DBs, over a fixed representative bounded domain; (3) to specify a mapping from the representative domain to the infinite domain of data values used to form input and state DBs; (4) to deal with the special constants that are distinguished in the CDP program, ensuring that each one of those forms a singleton isomorphism type.This constitutes the basis for reducing EFrl, △, ˝s-MC problems over uCDPs, to coverability checks over -PNs.
We proceed as follows.We first investigate the decidability status of variants of control-state reachability for -PNs (Sec.4).We then transfer these results to uCDPs, showing reductions from variants of uCDP model checking to -PNs control-state reachability (Sec.5).

𝜌-PN Verification
We introduce now the language  -CTL to express coverability properties on -PNs and study the decidability of the related model-checking problem. -CTL features the CTL EF temporal operator, boolean conjunctions and disjunctions, and replaces propositions with markings interpreted, each on its own, up to isomorphisms.Definition 4.1.Given a set  of places,  -CTL is the language of formulas  defined by the following grammar, where the atomic  -CTL formulas are markings  over the place set  : The semantics of  -CTL is defined as for CTL, with the provision that the current marking  of a -PN  satisfies an atomic formula  1 , if  covers, up to isomorphisms,  1 .

Problem 4.2 (𝑃 -𝜌CTL-MC). The 𝑃 -𝜌CTL-MC problem is defined as follows:
Input: A -PN  " p, ,  q, marking  0 , and  -CTL formula .Output: Whether  satisfies  from  0 , denoted by ,  0 |ù .Ÿ Since atomic formulas perform coverability checks,  -CTL-MC can be reduced to plain -PN coverability.This is done by induction on the structure of the  -CTL formula.First, a given formula , to be checked on a -PN  and initial marking  0 , is represented as a syntax tree.Its leafs are the occurrences of atomic formulas and the other nodes are obtained by applying to the children the corresponding boolean or temporal operator.Second, from leafs to the root, each node  is mapped to a -PN   and initial marking   0 where (1) the net   contains, as sub-nets, the nets   , for each sub-formula  of , (2)   contains the places check  and cover  , (3)   0 places at least a distinguished identifier on check  , and (4) transitions are added so that the place cover  can be marked with a distinguished identifier iff   ,   0 |ù .The construction for non-leaves take into account the children nets and the semantics of the respective conjunction, disjunction, or EF operator, where the latter case is the most involved one.We refer to the full paper for the details.
From decidability of -PN coverability we obtain: Theorem 4.3.For each finite place set  ,  -CTL-MC is decidable.

uCDP Model Checking
To reduce uCDP model checking to -PN model checking, we encode an arbitrary -state bounded -uCDP  " pΛ, ,  0 ,  0 q with Λ " p, ℐ,  q, into a -PN  " p, ,  q. 1. Configuration encoding.We use identifiers to represent the domain of DBs: ID " ∆ Y t‚u.We use places of  (and related markings) to encode configurations of , reorganized in the following way: (i) channel configuration, (ii) extension, up to isomorphisms, of the state and input DBs, over a fixed active domain of representative constants, (iii) mapping, via a partial function, of the representative constants to the represented state and input DB constants.