GenSQL: A Probabilistic Programming System for Querying Generative Models of Database Tables

This article presents GenSQL, a probabilistic programming system for querying probabilistic generative models of database tables. By augmenting SQL with only a few key primitives for querying probabilistic models, GenSQL enables complex Bayesian inference workflows to be concisely implemented. GenSQL's query planner rests on a unified programmatic interface for interacting with probabilistic models of tabular data, which makes it possible to use models written in a variety of probabilistic programming languages that are tailored to specific workflows. Probabilistic models may be automatically learned via probabilistic program synthesis, hand-designed, or a combination of both. GenSQL is formalized using a novel type system and denotational semantics, which together enable us to establish proofs that precisely characterize its soundness guarantees. We evaluate our system on two case real-world studies -- an anomaly detection in clinical trials and conditional synthetic data generation for a virtual wet lab -- and show that GenSQL more accurately captures the complexity of the data as compared to common baselines. We also show that the declarative syntax in GenSQL is more concise and less error-prone as compared to several alternatives. Finally, GenSQL delivers a 1.7-6.8x speedup compared to its closest competitor on a representative benchmark set and runs in comparable time to hand-written code, in part due to its reusable optimizations and code specialization.


INTRODUCTION
Building generative models of tabular data is a central focus in Bayesian data analysis [28], probabilistic machine learning [55] and in applications such as econometrics [4], healthcare [38] and systems biology [84].Motivated by these applications, researchers have developed techniques for automatically learning rich probabilistic models of tabular data [1,30,36,50,69].To fully exploit these models for solving complex tasks, users must be able to easily interleave operations that access both tabular data records and probabilistic models.Examples computations include (i) generating synthetic data records that satisfy user constraints; (ii) conditioning distributions specified by probabilistic models given observed data records; and (iii) using database operations to aggregate the results of combined queries against tabular and model data.However, the majority of existing probabilistic programming systems are designed for specifying generative models and estimating parameters given observations.They do not support complex database queries that combine tabular data with generative models specified by probabilistic programs.
GenSQL.This article introduces GenSQL, a novel probabilistic programming system for querying generative models of database tables.GenSQL is structured as a declarative extension to SQL which seamlessly enables queries that integrate access to the tabular data with operations against the probabilistic model.Examples include predicting new data, detecting anomalies, imputing missing values, cleaning noisy entries, and generating synthetic observations [25,29,46,73].GenSQL introduces a novel interface and soundness guarantees that decouple user-level specification of high-level queries against probabilistic models from low-level details of probabilistic programming, such as probabilistic modelling, inference algorithm design, and high-performance machine implementations.GenSQL extends SQL with several constructs: • To complement SELECT clauses that retrieve existing records from a table, GenSQL includes the clause GENERATE UNDER to generate synthetic records from a probabilistic model .• To complement WHERE clauses that filter data via constraints, GenSQL introduces the clause GIVEN to condition a probabilistic model on an event (i.e., a set of constraints) .• To complement joins between tables, GenSQL introduces a new mixed join clause GENERATIVE JOIN to join each row of a data table with a synthetic row generated from a probabilistic model , whose generation can be conditioned in a per-row fashion on the values of .
• To complement arithmetic expressions, GenSQL introduces PROBABILITY OF UNDER expressions, which compute the probability (density) of an event under a probabilistic model .
In this work, we assume that an existing probabilistic program synthesis tool has been used to automatically generate a probabilistic model of the user's data satisfying a certain formal interface.The user then uploads the data and model to GenSQL which automatically integrates them.The user can then issue queries for a variety of tasks, as illustrated in Fig. 1.Although we envision most users using automatically discovered models on their data, the GenSQL implementation also supports hand-implemented or partly-learned probabilistic models.For instance, a user can develop custom models for harmonization across different sources, as shown in Appendix A. 3.
The core of GenSQL is formalized as a simply-typed extension of SQL (Section 3.1).This extension includes standard SQL scalar expressions and tables as well as rowModels (probabilistic models of tables) and events (a set of constructs that allow users to issue probabilistic queries that leverage Bayesian conditioning).Together, rowModels and events enable a seamless integration of standard SQL databases with probabilistic models, which include queries that interleave accesses to the database records and probabilistic models.
The GenSQL query planner (Section 4) lowers queries into plans that execute against a new model interface for probabilistic models of tabular data.This Abstract Model Interface (AMI) (Section 4.1) provides a unifying specification of probabilistic models that are compatible with GenSQL.To implement the AMI, the model must be able to: (i) generative samples from a (potentially approximate) conditional distribution; (ii) compute probability densities for specified points; (iii) compute probabilities of sets in the support of the conditional distribution.
The open source GenSQL system includes a number of implementations of the AMI, including • a Clojure implementation [61] of Gen [20], a general purpose probabilistic programming language; see Appendix A.3 for an example.• models produced by CrossCat [50], a probabilistic program synthesis tool; • SPPL [73], a probabilistic programming language for exact inference.
We provide a measure-theoretic denotational semantics for the language (Section 3.2).This semantics captures the interaction between deterministic SQL operations and probabilistic operations on the probabilistic model, enabling us to prove several correctness guarantees that query results satisfy.Specifically, we prove guarantees for (i) the exact case, where exact inference about marginal and conditional distributions of the probabilistic model is available (Theorem 4.2); and (ii) a range of approximate cases, where answers to marginal and conditional queries are obtained via approximate inference algorithms (Theorem 4.3).
We benchmark GenSQL on a set of representative queries, testing the runtime performance, overhead of the query planner, and effect of our optimizations.The results show that all queries execute in milliseconds against data tables of sizes up to 10,000 rows, with a speedup in the range 1.7-6.8xagainst the most closely related baseline, and that the query planner's overhead as compared to hand-written code is small.We evaluate our system on two case studies to test its applicability to solving real-world problems (conditional synthetic data generation for a virtual wet lab and an anomaly detection in clinical trials), comparing against a generalized linear model (GLM) and a conditional tabular generative adversarial network (CTGAN [88]) baseline.Contributions.This paper makes the following contributions: (1) The GenSQL language (Section 3.1), an extension of SQL with probabilistic models of tabular data as first-class constructs and probabilistic constructs to allow the integration of queries on these models with queries on the data.(2) A unifying abstract interface for models of tabular data (Section 4.1), which bridges the query language and probabilistic models of database tables, to which all models must conform.The query planner lowers GenSQL queries on models to queries on this interface.(3) Soundness theorems, which fall into two classes: • Exact: We show that if models satisfy the exact interface, all deterministic computations will be exact (Theorem 4.2).This theorem works with an exact denotational semantics (Section 4.3) that precisely characterize the behavior of exact models.• Approximate: If approximate models implement consistent estimators (i.e., estimators that converge to the true value), we prove that all queries return consistent results (Theorem 4.3).This theorem works with a novel denotational semantics that combines measure-theoretic aspects with sequences of random variables.Together, these guarantees highlight some of the tradeoffs between using an exact model (which deliver stronger guarantees but may be difficult to obtain in some use cases) and an approximate model (which deliver weaker guarantees but are more easily available).( 4) An open-source implementation of GenSQL in Clojure (https://github.com/OpenGen/GenSQL.query),which can be compiled into JavaScript and run natively in the browser.(5) A performance evaluation of our approach (Section 5) which establishes that GenSQL is competitive with hand-coded implementations and gives improved performance over a competitive baseline.Two case studies further demonstrate the utility of GenSQL.

EXAMPLE
Figure 2 presents an example GenSQL query.In this example, we work with a probabilistic model (health_model) derived from a national database of patient information, as well as a data table (health_data) from a set of local hospitals.The query uses the probabilistic model to estimate the mutual information-an information-theoretic measure used in data analysis-between the age and bmi columns (from the probabilistic model) for specific valueso f patient weights (selected from the data table).The mutual information is a statistical measure of the strength of the association between these two columns, defined as a sum or integral, over the joint distribution of age and bmi, of the logarithm of the ratio of the joint density and the product of the marginal density.
The query estimates the mutual information by Monte Carlo integration, i.e., it approximates the integral by sampling.We first generate 1000 copies of each row in the health_data table (line 15) and then use the GenSQL generative join construct (line 16) to complete each row as follows.
For each such row : GenSQL: A Probabilistic Programming System for erying Generative Models of Database Tables 179:5 (1) a row ′ is sampled from a version of the model conditioned on the weight value of row ; (2) the rows and ′ are concatenated.

The resulting intermediate table is called table (line 17)
. Each synthetic row ′ is used as a sample for the Monte Carlo integration of the conditional mutual information for the corresponding weight value.From this intermediate table, we select the weight, age, and bmi columns (line 13).Note that the weight column comes from the patient data while the age and bmi columns come from the rows sampled from the probabilistic model.
For each weight in the patient data, we compute the Monte Carlo approximation of the mutual information between age and bmi for that weight as where is the number of patients with that specific weight, and (age , bmi ) is a sample from the model for that weight.To do so, lines 6-11 compute the probability densities (age , bmi ) (lines 6-7), (age ) (lines 8-9), and (bmi ) (lines [10][11].For instance, the GIVEN clause conditions the model on the weight column of the model being equal to the weight column of table (line 11).Line 10 then computes the probability density that the bmi column of the conditioned model is equal to the corresponding column in table.GenSQL computes these probability densities by invoking the logpdf function in the probabilistic model interface (Section 4.1).
A traditional SQL select statement (line 5) propagates the patient weights and corresponding probabilities pxy, px, and py to generate a table with four columns: the weight and the corresponding probability densities for that weight.Line 3 computes log (age , bmi )−log (age ) (bmi ) for each of the rows, naming this ratio log_pxy_div_px_py.Note that there are 1000 log_pxy_div_px_py values for each weight in the local patient data, where is the number of patients with that specific weight.Finally, line 1 computes the mutual information estimate between age and bmi, for each weight, as the average of the log_pxy_div_px_py values for that weight.This example illustrates the expressivity of GenSQL, but we note that our implementation has a primitive which directly estimates conditional mutual information without the need to materialize intermediate tables.

Language
The core calculus extending SQL for querying from probabilistic models of tabular data is given in Fig. 3, and the type-system is given in Fig. 4. 1 As SQL is a subset of GenSQL, this calculus also includes a simply-typed formalization of SQL where terms are given in a pair of context: a local and a global one.We found this formalization interesting in its own right, as we could not find an equivalent formalization in the programming languages literature.
There are several noteworthy differences with variables and contexts from traditional simplytyped languages based on the lambda-calculus, which are explained below.We note early that we distinguish two types of conditioning through constructs called events and events-0.Events-0 come from a technical difficulty well-known in the PPL and measure theory literature [78,86] when conditioning on a continuous variable taking a specific value.This creates a possible event of probability 0, and requires special treatment.
Names and Identifiers.We assume a countable set C of names col ∈ C for the columns of tables and rowModels, as well as a countable set I of identifiers id ∈ I for naming tables and rowModels.a new identifier, therefore changing the identifier in its type and the way to access their column in a select of event clause.
Event and event-0 expressions.Events are Boolean expressions on tables and rowModels, which include equality on discrete values but not on continuous values, which is reserved for events-0.The only probability 0 events are impossible under a given model, e.g.> 6 ∧ < 3, and those do not require a separate treatment.Events and events-0 are used in the PROBABILITY OF and GIVEN constructs.
Type system of GenSQL.
PROBABILITY OF takes an event (or event-0) expression and a rowModel to query and returns the probability (or probability density) of the event under the model. 2owModel expressions.GIVEN takes a rowModel and an event (or event-0) expression, and returns a new rowModel, the conditional distribution of the original rowModel on the event.The event expression can be given by a list of inequalities on arbitrary variables and equalities on discrete variables, in which case GIVEN acts as a set of constraints on the possible returned values of the model.Otherwise, the event expression can be a set of equalities on possibly continuous values and is understood as conditioning the model on the given values.
Contexts.Expressions are typed in a pair of contexts Γ; Δ containing table and rowModel types.As these types include identifiers, there is no need for the more classical notation : pairing a variable with its type.Γ is a set of types, while Δ is an ordered list of types.In the premise of a typing rule such as PROBABILITY OF , only the last element of Δ will be accessible to an expression.We denote the empty context by [].Intuitively, Γ contains the ambient tables in the database schema and any loaded models, and within Γ, all identifiers id are assumed distinct.Δ is the value environment, and contains only tables that are "in scope" for a particular expression.Scalar, event, and event-0 expressions all depend on the value environment.If an expression has a table in scope, it will be iterated over the rows of that table and can only access the current row.If a PROBABILITY OF expression has a rowModel in scope, it will query the model for the probability of an event under that model.If it's a GIVEN expression, it will condition that model on an event.
Typing rules.Judgments are of the form Γ; Δ ⊢ e : t where e is an expression ( , , , 1 , 0 in Figure 3); t is a type ( , T , E, M in Figure 3), and Γ; Δ is a context.Given some loaded tables and rowModels forming environment Γ, the objects of interest are "closed expressions" of table type, i.e., expressions of the form Γ; [] ⊢ : [?id]{cols}."Closed" here refers to Δ being empty, not Γ.Notable rules include those that need the same identifier twice, such as the PROBABILITY OF or the WHERE rule.For instance, in the WHERE rule, where has identifier id, a valid SQL would be id.col= 3 where col is a column of .This reflects the fact that the expression should have access to the identifier id in its local environment, and that the column col of will be iterated over by the expression .
Notations used in the type system.?id indicates an optional identifier and id."id ′ fresh" means that id ′ is not in the contexts Γ, Δ or in the type of a subterm of the expression.We will often abbreviate {col 1 : 1 , . . ., col : } as {cols}.We write cols ∩ cols ′ = ∅ when the set of column names in cols and in cols ′ should be disjoint.In the first typing rule for events, we write ∀ .(, op) ≠ ( , =) to mean that op cannot be an equality on a continuous type.We recursively define the following two macros: Restrictions imposed by the type system.If the same identifier id appears twice in the premise of a typing rule, the two identifiers must equal, and two different identifiers id and id ′ must be distinct.The JOIN and GENERATIVE JOIN operations require that the columns of the two tables have disjoint names.As explained above, events are disallowed to be equalities on continuous types.A model can only be conditioned once on an event-0, which is enforced by the restriction id GIVEN 0 .Events-0 follow a linear typing system to avoid contradictory statements such as id.col = 1.0 ∧ id.col = 2.0.Events-0 in a PROBABILITY OF query on a conditioned model cannot refer to the conditioned columns of the model, which is enforced by the restriction vars( 0 ) ∩ condvars( ) = ∅. 3   Syntactic sugar.Our implementation includes various syntactic sugars that are not present in the formalization but which are used in several figures.Given : [id]{col 1 : 1 , . . ., col : }, : [id ′ ]{col ′ 1 : 1 , . . ., col ′ : }, we have the following equivalences: , where the col are columns that do not appear in the SELECT clause.
• * EXCEPT id.col removes the column col from list of columns that * selects.GenSQL: A Probabilistic Programming System for erying Generative Models of Database Tables 179:9 3.2 Semantics We define denotational semantics using measure theory, shown in Fig. 5.Even though the SQL subset of GenSQL is not probabilistic, our probabilistic semantics ensures compositional reasoning about the semantics of SQL queries combined with probabilistic GenSQL expressions, such as synthetic tables generated by rowModels.Per usual, the semantics of expressions is defined compositionally on typing judgement derivations, and ⟦e⟧ is a shorthand for ⟦Γ; Δ ⊢ e : t⟧.
Base types (Fig. 5c).We assign to each type a measure space ⟦ ⟧ := ( , Σ , ) consisting of a set , a sigma-algebra Σ , and reference measure .Z denotes the set of integers, N natural numbers, and B Booleans, which are equipped with the discrete sigma-algebra.We equip the reals R with the Borel sigma-algebra.We interpret Null by adding a fresh element {★} to the standard interpretation of each base type, equipped with the discrete sigma-algebra.The semantics of a base type is then given by the smallest sigma-algebra making {★} measurable, as well as ensuring that every previously measurable set remains measurable.(This construction is also called the "direct-sum sigma-algebra" [24, 214L]).
The base measure on discrete types such as Int, Nat, Bool, Str is the counting measure.On continuous types such as Real, the base measure is the Lebesgue measure .These are extended to base measures on ⟦ ⟧ by using the dirac measure {★} on {★}, e.g. the base measure on ⟦R⟧ is R + {★} .We write ⊗ for the product of measures.We extend the reference measure to the product space 1≤ ≤ ⟦ ⟧ by taking the product of the reference measures ≔ 1≤ ≤ .
Table types (Fig. 5a).Our semantics has two modes of interpreting table types, a "tuple mode" Tup⟦−⟧, and a "table mode" Tab⟦−⟧.Tab⟦−⟧ interprets tables as measures on bags of tuples, while Tup⟦−⟧ interprets a table as a tuple, representing the current row of the table being processed by a scalar, event or event-0 expression.More precisely, we denote by P ( ) the measurable space of probability measures on the standard Borel space [32].The table semantics interprets table types as measures on bags of tuples Tab⟦ [?id]{cols}⟧ = Bag(Tup⟦ [?id]{cols}⟧), where Bag( ) = { : → N | ( ) = 0 except for finitely many }.Bag( ) is equipped with the least sigma-algebra containing the generating sets { ∈ Bag( ) | contains exactly elements in } for measurable sets of [21].
Contexts (Fig. 5b).We interpret the global context Γ with the table semantics Tab⟦−⟧ and the local context Δ with the tuple semantics Tup⟦−⟧.We write for an element of Tab⟦Γ⟧, and see it as a finite map from identifiers to values.Likewise, we write for an element of Δ.We write [id ↦ → ] for the extended finite map mapping id to .
Event expressions (Fig. 5g).An event expression ⟦ 1 : 1 {cols}⟧( , ) is interpreted as a measurable subset of ⟦cols⟧ (disjoint union of hyper-rectangles [73]).Depending on the expression, this set is used in different ways.We interpret the probability clause PROBABILITY OF 1 UNDER as ∫ ⟦cols⟧ 1 , where is the measure denoting the model , i.e. is used in an indicator function 1 .
When used in a GIVEN clause, we constrain the model to the event , which is then renormalized.
If the event has probability 0, we instead return a row of Null.A similar situation to WHERE Null arises for GIVEN , e.g. in GIVEN id.col op Null.Following the principle of least surprise, Null acts by convention as a unit for conditioning, i.e. id GIVEN id.col op Null behaves the same as id.To ensure this we interpret boolean expressions op differently in the semantics of events, and write op for the extended version of op which sends ★ to true.The denotation of id.col op Null will therefore be the entire space, and conditioning a model on this event will not change its denotation.

e) Semantics of Table Expressions
⟦id : ) Semantics of Event Expressions Fig. 5. Denotational semantics of GenSQL.
We consider measures on spaces with chosen disintegrations and (marginal) densities w.r.t. the reference measure.More precisely, we interpret a rowModel id from the global context Γ as a quadruple Tab⟦id⟧ := ( , , { } , { } ).Here, is a measure denoting the unconditioned model, and a density of w.r.t. the reference measure.For each valid decomposition ( , ) of the columns of id, the kernel is an ( , )-disintegration of .For all ∈ , (− | ) is a density for ( ) w.r.t. the reference measure .If is a partial assignment of the variables in , we also write ( | ) for the marginal density of ( ) at obtained from (− | ) by integrating out the missing variables in .We denote by P adm ( ) the set of such quadruples ( , , { } , { } ), where is a measure on .Given ∈ P adm ( ), we write .measfor its first component , .pdf for the density , .for the kernel , and ..pdf for the density .Using this notation, given an event-0 0 denoting a projection and value , the expression .pdf( ) gives a marginal density of the model at ; i.e. .pdf( ) is a version of the density of * .measevaluated at .We assume that all the models in the context are admissible, which is enforced in the semantics of contexts.
The models used in queries are built from admissible models and will carry chosen densities, which is enforced in the semantics of rowModel expressions.We write P dens ( ) for the set of pairs ( , ) where is a measure on := 1 × . . .× and is either a density of w.r.t. the reference measure, or of the form ( 1 , . . ., ). ( 1 , . . ., ) for some 1 , . . ., , and where is a marginal density of on 1 × . . .× w.r.t. the reference measure.The second case is used to represent the density of a model conditioned on an event-0 expression.
For conditioning on events, given ∈ P dens ( ) and a measurable ⊆ , we define cond( , 1 ) : Real Fig. 6.A selected subset of the syntax and type system of the lowered language.
Lowering Language (Fig. 6).It is a first-order simply-typed lambda calculus with second-order operations acting on bags, and primitives for the AMI.It also contains a version of events and events-0 which can be used by AMI primitives.Operations like map , filter and exp have their usual meaning, and their typing along with those for constants, tuples, projections, and arithmetic operations are standard and recalled in Appendix D (Fig. 20).join takes two bags of tuples and returns their Cartesian product.replicate evaluates its bag argument times and returns the union of all the resulting bags.mapreduce takes a bag of tuples and a function from tuples to bags, and returns the union of all the bags obtained by applying to each tuple in the input bag.
Lowering program transform (Fig. 7).After obtaining a normal form query, the planner applies a program transformation T {•} from normalized GenSQL queries to the lowered language, defined by pattern matching on the structure of the query.It carries a local context of variables (a finite map from identifiers to variable names) which are bound in the surrounding program.Similarly to the local context Δ in GenSQL, will start empty [] at the root of the syntax tree.It is used to rename variables in the lowered query.The rationale is that a table identifier id in Δ will be transformed to a variable representing a tuple being iterated over by a map or fold primitive.A rowModel identifier id, on the other hand, will be uniquely accessible and identified from the global context Γ, thanks to the normalization procedure which ensures that no rowModel is renamed in the normalized query.A simple proof by induction shows that the transformation preserves typing.

Lowering Guarantees for Exact Backend
A large class of models supports exact inference, e.g.those expressible in SPPL [73] and truncated multivariate Gaussians.These models satisfy the exact AMI and are able to return exact samples from simulate, and compute exact marginal logpdf and prob queries, even for conditioned models.We make this precise by giving a measure semantics on the lowered language (Fig. 21) and show that the program transform T {•} preserves the semantics of the lowered query (Theorem 4.2).
In particular, all the scalar computations in the query are deterministic and that the generated synthetic data comes from exact conditional distributions.The denotational semantics (Appendix D, Fig. 21) of the lowered language is mostly standard and resembles the measure-theoretic semantics of GenSQL given in Fig. 5. Terms Γ ⊢ : are (c) Translating RowModel eries interpreted as deterministic measurable functions ⟦Γ⟧ exact → ⟦ ⟧ exact .Terms Γ ⊢ : Bag[ ] are interpreted as probability kernels ⟦Γ⟧ exact → PBag(⟦ ⟧ exact ), where substitution for these programs is interpreted using the Kleisli composition for the point process monad [21].By induction on the structure of GenSQL programs in context Γ; Δ, we can show (proof in Appendix D.3):

Approximate Backend Guarantee
By relying on approximate probabilistic inference, general-purpose PPLs can express large classes of models in which exact inference is intractable.In addition, programmable inference [51] ensures PPLs can support a diverse class of probabilistic models without sacrificing inference quality.We give a new denotational semantics for the lowered language that is appropriate for reasoning in scenarios where the rowModels are implemented in PPLs with approximate Monte Carlo inference.Monte Carlo algorithms are typically parameterized by a positive integer specifying a compute budget, such as the number of particles in a sequential Monte Carlo (SMC) algorithm [18] or the number of samples in a Markov Chain Monte Carlo (MCMC) algorithm [64].The algorithm specifies a sequence of distributions or estimators that converge in some sense to a quantity of interest as → ∞.In the case of approximate sampling algorithms, most typically the distribution of the generated samples converges weakly to the target distribution, and in the case of parameter estimation the algorithm produces a strongly consistent estimator of the target parameter [18,64].Random variable semantics.Our denotational semantics for approximate AMI implementations is motivated by the above discussion.We assume the existence of an ambient probability space (Ω, F , P) and associate with each term a sequence of random variables approximating the term in the exact semantics.As an example, the approximate semantics of ⟦map ( . 1 ) 2 ⟧ approx in the context and at the "random seed" ∈ Ω is given at the -th approximation by . This means that we first obtain the -th approximation of the input 2 , which is a measure on tables, which we then evaluate to obtain a concrete table, .We then apply the function to each row obtained by the -th approximation of 1 .The full semantics is given in Appendix D.4, Fig. 23.We assume the following hold: • For each rowModel identifier id : [( 1 , . . ., )] in environment , event 1 : 1 [( 1 , . . ., )], and event-0 0 : 0 [( 1 , . . ., )], there exists a sequence of probability measures { id;⟦ 0 ⟧ approx ( ) ,⟦ 1 ⟧ approx ( ) } on Bag =1 ⟦ ⟧; • for id, , 1 and 0 as above, and 1 2 : 1 [( 1 , . . ., )], there exists a sequence of real random variables { id;⟦ 0 ⟧ approx ( ) ,⟦ 1 ⟧ approx ( ) ,⟦ 1 2 ⟧ approx ( ) } which takes values in [0, 1] P-almost surely; • for id, , 1 and 0 as above, and 0 2 : 0 [( 1 , . . ., )], there exists a sequence of real random variables { id;⟦ 0 ⟧ approx ( ) ,⟦ 1 ⟧ approx ( ) ,⟦ 0 2 ⟧ approx ( ) }.These random sequences represent the sequences of approximations produced by the implementation of the AMI.In general, for a given term the convergence of sequences associated with its sub-terms do not imply that the sequence associated with converges.For instance, consider evaluating the following query in an appropriate context ( , ): If the value of the term PROBABILITY OF id ′ .col′ = 7 is approximated, even if we can make this approximation arbitrarily accurate, the output of the query need not converge.For example, if the table id contains a row in which the value of col is exactly ⟦ PROBABILITY OF id ′ .col′ = 7⟧( , ) but the approximation converges to the true value from below, this row will not be included in the query result no matter the accuracy of the approximation.Intuitively, this arises from the fact that the indicator functions of half intervals are not continuous.
In order for the lowered queries to denote asymptotically sound estimators for the original queries, we require that the implementation of the AMI methods are asymptotically sound, and write lim to denote an evaluation of the context in which each random variable is replaced by its limit as → ∞.In Appendix D.4, we formalize the notions of safe queries and asymptotically sound AMI implementations and details of the proofs.We then give the following guarantee.

EVALUATION
The performance of an open-source Clojure implementation of GenSQL is evaluated against other systems that have similar capabilities.We test runtime, the effect of optimizations, and runtime overhead of our system over alternative implementations of the same task.Experiments were run on an Amazon EC2 C6a instance with Ubuntu 22.04, 4 vCPUs and 8.0 GiB RAM.
The probabilistic models used in the evaluation are obtained using probabilistic program synthesis [74,Chapter 3].Each model is an ensemble of "MultiMixture" probabilistic programs [69, Section 6], which are posterior samples from the CrossCat model class [50], generated using ClojureCat [16].An ensemble of 10 probabilistic programs is used in Section 5.1 and 12 programs in Section 5.2.

Performance and Usability
Runtime comparison.Table 1 compares the runtime on 10 benchmark queries (Appendix F) adapted from Charchut [16, Tables 4.2 and 4.3] using GenSQL (with the ClojureCat backend) and BayesDB (with the CGPM backend [66]) for evaluating exact probability densities.Section 5 compares the runtime and standard deviation for computing the probabilities of positive measure events.GenSQL (with the SPPL backend [73]) delivers exact solutions, whereas BayesDB delivers approximate solutions using rejection sampling.Two rejection strategies in BayesDB are shown in Section 5: a fixed number of samples (faster but higher variance) or a fixed acceptance rate (slower but lower variance), which both are inferior to exact solutions from GenSQL.
The performance gains in GenSQL are due to three main reasons: the ClojureCat backend is faster than the CGPM backend in BayesDB, GenSQL has optimizations (discussed below) that exploit repetitive computations, and GenSQL itself is implemented in Clojure, a performant language.Optimizations and system overhead.GenSQL leverages two classes of optimizations: caching (of the likelihood queries and conditioned models) and exploiting independence relations between columns.The latter allows us to simplify a query such as PROBABILITY OF id.> 42 UNDER id GIVEN id.= 17 to the semantically equivalent query PROBABILITY OF id.> 42 UNDER id if the columns and are independent.Appendix B gives a detailed account of the optimizations.
In Fig. 9, the unoptimized GenSQL queries have a 1.1-1.6xoverhead compared to the pure ClojureCat baseline.The optimizations reduce the overhead and can sometimes drastically improve performance, while caching significantly reduces the variance in the runtime of the queries.In Fig. 9b, the effect of the independence optimization varies between replicates, as these are different CrossCat model samples, which explains the higher variance in query runtime.Code comparison.Figure 10 compares the code required in GenSQL, pure Python using SPPL [73], and pure Clojure using ClojureCat [16], for a conditional probability query.Figure 10a shows how GenSQL gains clarity by specializing in data that comes from database tables.In contrast, both SPPL and ClojureCat require users to hand-write the looping/mapping over the data, which is error prone.For instance, the code in Fig. 10c will crash if the table has missing values.In Fig. 10b, ClojureCat requires conditions to be maps.Users can decide if they encode columns with strings, symbols, or keywords.If this choice does not align with the key type returned by the CSV reader, the query will run but conditioning will result in a null-op.
In Appendix A.1, we compare a single line query on a conditioned model in GenSQL to the equivalent code in Scikit-learn [60] on the Iris data from the UCI ML repository.The model querying (a) Varying number of conditions and targets in the PROBABILITY OF queries shown in Table 1.alone in Scikit-learn is more than 50 lines long and clearly error prone, and we find that GenSQL offers a significant advantage in simplicity over such baselines.
Code comparison with BayesDB. Figure 11 shows GenSQL and its closest relative, BayesDB [52], on a GENERATIVE JOIN query on synthetic data.The GenSQL code is more concise and simpler than BayesDB's code, which is possible due to the language abstractions for manipulating models.In BayesDB, the user must exit to SQL and hand-code column manipulations to fit the expected fixed pattern to query a model.Section 6 provides a detailed comparison of GenSQL and BayesDB.

Case Studies on Real World Data
We present two case studies to demonstrate the application of GenSQL to real-world problems: one in medicine (clinical trial data) and one in synthetic biology (wetlab data).The datasets can be costly to obtain and researchers are interested in understanding and analyzing their data.In the first case, we show how anomaly detection in GenSQL can be used to check for probable mislabelling of the data.The anomalous rows can also be investigated further to understand the reasons for the anomaly.In the second case, we show how GenSQL can be used to generate accurate synthetic data, capturing the complex relationships between different host genes and experimental conditions.Capturing these relationships with the model helps predict whether a certain experimental condition or modification of the genome has cascading downstream effects through the interrelations between the genes.Such effects can render the cell toxic and kill the bacterium, leading to a failed experiment.The virtual wet lab allows researchers to check for such effects before running costly experiments in the real world.Anomaly detection in clinical trials.The (BEAT19) clinical trial [87] contains data about COVID-19 and records behavior, environment variables, and treatments.Figure 12a shows a query used for anomaly detection [15].For each row, it computes the model likelihood of the value BMI given the other values of the row.The trial participants labeled anomalous (Fig. 12b) all report above-average or well-above-average health and that they exercise, while meeting the World Health Organization's definition of clinical obesity [83].Fig. 12d compares the overall population in the trial (grey), anomalous individuals (red), and the subset of the population with the same behavioral covariates (exercise, health status, etc.) as the anomalous individuals (black).For similar individuals, the data generally suggests a lower BMI.We can also compare the marginal and the conditional probability of BMI values in the table of clinical trial records (Fig. 12d).Anomalous data (red) is lower than the diagonal line, highlighting the "contextualization" of BMI values that happens by conditioning the models: the BMI values are less likely given the context of the other values in the row, while not necessarily extreme.To demonstrate this effect, we first apply a WHERE filter that removes BMI values outside of the 5th and the 95th percentile, excluding one-dimensional extreme values (Fig. 12a).We then compute the conditional probabilities of the BMI values in each row for the remaining data and return anomalies.Fig. 12c shows the posterior predictive over the ensemble of models (line) and for each individual model (dots) for a BMI above 30 given exercise and health status.Conditional synthetic data generation for virtual wet lab. Figure 13 shows synthetic gene expression data generated using GenSQL, given a dataset from an experiment testing genetic circuits [56] in Escherichia coli.The synthetic data aligns with the overall population characteristics (Figs.13a and 13b) and accurately reflects the outcomes of actual experimental interventions (Figs.13c to 13f).In synthetic biology, the prospect of implementing genetic circuits has fundamental implications for medical device engineering [53], bio-sensing [82] and environmental biotechnology [89].These circuits require input which is typically provided by adding inducer substances to the culture mediums where the organisms are grown.Our figures show the effect of adding two such inducer substances, Arabinose and IPTG on 6 different host genes.
Producing standard RNA sequencing data can be costly [48], especially for new, engineered organisms that are not mass-produced.When it is produced, RNA sequencing will yield measurements for gene expressions for thousands of annotated host genes [9].These genes are highly interrelated, and knowledge of the relations is only partially available [44].Thus, the application of generative models to these data presents a challenging high-dimensional modeling problem, further compounded by the inherent non-linearity in the data, as illustrated in Figs.14a and 14b.
The most popular approach to modeling gene expression data is linear regression [22], as models are easy to interpret and readily available in data analysis libraries.For non-numerical data, linear regression requires analysis-specific re-coding of discrete values.That aside, the low capacity of the model means that it fails to faithfully reproduce in the actual wet lab data, as shown in Fig. 14a.Conditional generative adversarial networks(CTGAN) [88], though more complex, are also an appropriate baseline because they are domain-general and effective at modeling multivariate, heterogeneous data.However, GANs are hard to interpret and as RNA sequencing data acquisition is so costly, the number of available training examples (943) renders it unsuitable for CTGANs.Fig. 14b depicts this model class failing to accurately model the gene expression data.

RELATED WORK
Probabilistic databases.Probabilistic databases systems [79,81] develop efficient algorithms for inference queries on discrete distributions over databases, often based on variants of weighted model counting, for which hardness complexity results were shown and algorithms were developed for tractable cases and efficient approximations.Cambronero et al. [13] integrate probabilities into a relational database system to support imputation, while Hilprecht et al. [39] use probabilistic circuits to improve query performance.Jampani et al. [42] use probabilistic databases to support random data generation and simulation.Cai et al. [12] provides Gibbs sampling support in the space of database tables to a SQL-like language, enabling bayesian machine learning workload such as linear regression or latent Dirichlet allocation.These languages are typically extensions to SQL or relational algebra but with limited support for probabilistic models, which they tradeoff for performance.Schaechtle et al. [76] presents a preliminary design for extending SQL to support probabilistic models of tabular data.Our work differs in that it presents (1) a formalization of the system; (2) a denotational semantics; (3) soundness guarantees for the system; (4) a unified interface that probabilistic models implement; (5) a lowering transform and target lowering language; (6) an extensive performance evaluation; and (7) two new case studies on real-world data.
Semantics of probabilistic databases.Bárány et al. [3] and Grohe et al. [34] give a semantic account to probabilistic databases by giving a probabilistic semantics and guarantees to an extension of Datalog.Dash and Staton [21] give a monadic account and denotational semantics for measurable queries in probabilistic databases.Their semantics of SQL-like expressions inspired the semantics of our table expressions.Grohe and Lindner [35] established a formal framework for reasoning about infinite probabilistic databases.Benzaken and Contejean [5] formalized the semantics of SQL in Coq while Borya [11] formalized relational algebra and a SQL-like syntax using a model checker.
Probabilistic program synthesis.GenSQL has been designed with the possibility to leverage powerful probabilistic program synthesis techniques based on Bayesian [1,50,69] or non-Bayesian [17,30,36,41,57] probabilistic model discovery.The AMI provides a unifying approach to expressing powerful Bayesian inference workflows in these probabilistic programs using a highlevel SQL-like language.Extending the interface to handle synthesized models of time series [70,72] and/or relational data [71] is a promising avenue for future work.
Probabilistic programming systems.While we used a Clojure version [16] of CrossCat [50] in our experiments, our system supports any probabilistic program that satisfies the rowModel interface.We can thus reuse models written in the variety of PPLs developed in the literature, such as models written in languages supporting approximate inference [8,14,20,26,54,75,85] and exact inference [27,40,73,90].Our model interface is inspired by the SPPL interface [73] and the CGPM interface [68].Gordon et al. [33] propose a probabilistic programming system using a functional syntax similar to the stochastic lambda calculus, specialized to inference over relational databases, implemented on top on Infer.net.It can perform inference tasks such as linear regression and querying for missing values which enable data imputation, classification, or clustering.Borgström et al. [10] present a probabilistic DSL and semantics for regression formulas in the style of the formula DSL in R. Domain-specific PPLs for tabular data have also been designed to solve tasks such as data cleaning [46,63].
BayesDB.Although BayesDB [52] was motivated by similar goals as GenSQL, GenSQL introduces novel semantics concepts and soundness theorems that BayesDB did not.GenSQL also improves upon BayesDB in terms of expressiveness and performance, as shown in Section 5.For example, GenSQL queries can be nested and interleaved with SQL, and also combine results from multiple models.GenSQL also provides an exact inference engine for a broad class of sum-product probabilistic programs [73].BayesDB, on the other hand, has interesting features that GenSQL does not yet support such as iterating over model and columns (e.g. to find pairs of columns with the highest mutual information) [67] and similarity search between rows [65].BayesDB also has a "meta-modeling" DSL [66] for composing probabilistic programs from various sources.Automated Machine Learning.Several systems [6,23,43,45,58,80] have been developed to automate the use of discriminative machine learning methods for analyzing tabular data.Unlike GenSQL, they do not support the use of generative probabilistic programs for tabular data satisfying a unified interface (for sampling, conditioning, and evaluating probabilities or densities) which enables a single model to be reused across many different tasks.

CONTRIBUTIONS
GenSQL specializes probabilistic programming languages to applications with tabular data.It is differentiated from general purpose PPLs in three main ways: • Through the AMI, GenSQL enables multi-language workflows.Users from different domains and with different expertise should be able to use probabilistic models for their queries without having to learn all the details of the PPL in which the model is written.The AMI enables this separation of concerns by providing a well-specified interface.It enables the integrating probabilistic models of tabular data in different languages, as it can be implemented in either a general-purpose or domain-specific PPL (Appendix C).There is no standard way to jointly query models in different PPLs or use the result of a query in one language against a model in another language.As different PPLs focus on different workloads, users of GenSQL can work with several models written in different PPLs.GenSQL thus provides a natural multi-language workflow, and our experiments already use multiple backends (Gen.clj,SPPL, and ClojureCat).
• GenSQL enables declarative querying.No current PPL offers a simple declarative syntax for evaluating complex queries (e.g., containing elaborate joins and nested selects) interleaving calls on probabilistic models and database tables.A number of PPLs provide declarative syntax for specifying and conditioning models, but the user must decide which operations on what conditional distributions to evaluate and then manually combine the results of these operations.GenSQL relieves the users of such concerns, reducing the chances of programming errors.
• GenSQL enables reusable performance optimizations.Widely used database management systems (DBMS) have been optimized by many engineer-hours of effort over several decades.These optimizations are highly reusable because they are independent of the application domain and specific languages that the DBMS interfaces with.GenSQL enables analogous optimizations for workloads that interleave ordinary database queries with probabilistic inference and generative modeling.GenSQL's optimizations can carry over to many domains and workflows, avoiding the need for project-specific performance optimizations involving probabilistic models of tabular data.
We see two interesting avenues for GenSQL to impact database applications and design.
Integration of GenSQL with database management systems (DBMS).First, GenSQL could serve as a query language, allowing users to query generative models of tabular data directly from the DBMS.One use case of rapidly increasing practical importance is querying synthetic data, generated on the fly to meet user-specified privacy-utility trade-offs, instead of querying real data that cannot be shared due to privacy constraints.Other potential applications for synthetic data include testing, performance tuning, and sensitivity analysis of end-to-end data analysis workflows.In all these settings, GenSQL implementations could also draw on performance engineering innovations from DBMS engines, optimized further using the generative models themselves (e.g., to reduce variance for stratified sampling approximations to SQL aggregates [2]).
Modularized development of queries and models.GenSQL introduces abstractions that isolate query developers and query users from model developers.This separation of concerns is analogous to the physical data independence property achieved by relational databases [19].Most database users do not need to know the details of how data is stored and indexed to be able to query it efficiently, but some experts do understand how to tune indices to ensure that databases meet the necessary performance constraints.Most GenSQL users need not be experts on the details of the algorithms, modeling assumptions, and software pipelines that produced the underlying generative models.Expert statisticians and generative modelers can still ensure the models are of sufficient quality and tune trade-offs between performance, maintenance costs, and accuracy, improving models without invalidating user workflows.With GenSQL, both typical users and experts can more easily and interactively query generative models to test their validity, both qualitatively and quantitatively.This division of responsibility between users, generative modelers, and probabilistic programming system developers could potentially help our society more safely and productively broaden the deployment of generative models for tabular data.

DATA-AVAILABILITY STATEMENT
An artifact providing a version of GenSQL, and reproducing our experiments, is available [77].

( a )
Type System for Table Expressions

3
Our implementation is less restricted.It allows join variants such as SQL's left join where the tables do not have disjoint columns.It also allows multiple conditionings on the same model, which are then normalized to the form above. See Appendix D.1 for details about the normalization.Proc.ACM Program.Lang., Vol. 8, No. PLDI, Article 179.Publication date: June 2024.

Theorem 4 . 3 (
Consistent AMI Guarantee).Let Γ, [] ⊢ : [?id]{cols} be a safe query and suppose the AMI methods have asymptotically sound implementations.Then, for every evaluation of the context , P-almost surely Varying number of conditions in GIVEN clause for GENERATE UNDER queries (caching does not apply).
Varying number of rows in a data table used in the FROM clause of SE-LECT with a PROBABILITY OF query.
(e) Compare conditional and marginal BMI.

Fig. 13 .
Fig. 13.Case study: Conditional synthetic data generation for a virtual wet lab.
Real and generated data from bivariate linear models (compare to Fig.13b).
Comparison of real and generated data from CTGAN (compare to Fig.13b).

Fig. 14 .
Fig.14.Linear models and conditional generative adversarial networks (CTGAN[88]) produce less accurate synthetic virtual wet lab data as compared to the synthetic data from GenSQL shown in Fig.13b.In (b), the default model and inference parameters in the open source implementation of CTGAN is used.
AS col ≡ 1 AS col 1 , . . ., AS col .Base types.Cells of tables can have a base type , which is either a continuous type or a discrete type .Continuous types are Real for reals, PosReal for non-negative reals, or Ranged( , ) for reals in the range [ , ]. Discrete types are Nat for natural numbers, Int for integers, Str for strings, Cat(n 1 , . . ., n ) for a categorical type over attributes, and Bool for Boolean.

= on integers and reals, ∧, ∨ on Booleans, as well as constants for every value of a base type. For any op ∈ Op, we write op : 1 , . . . , → if op has arity , takes arguments of base types 1 , . . . , and returns a value of type . In particular, operations with no arguments are constants of the appropriate type such as true and false at the boolean type. All base types have an additional constant Null representing a missing value. This constant is preserved by primitive
Table and rowModel types.We denote these types by [?id]{col 1 : 1 , . . ., col : }. is either for tables or for rowModels.?id is an optional identifier, allowing access to columns of a table or rowModel in a query.For instance, in SELECT id.weight, the identifier id refers to a table and weight to a column of that table.The identifier can be be optional, e.g.there is no default identifier for a table created after a join.The notation {col 1 : 1 , . . ., col : } indicate that the table has columns col of type .Therefore, we can think of each row of a table as an element of a record type {col 1 : 1 , . . ., col : }, a bag of rows as a table, and a rowModel as a row generator.operations (e.g.Null + 3 ↦ → Null, Null * 4.1 ↦ → Null).By convention, WHERE Null clauses act as WHERE false.

Table expressions .
Apart from typical SQL operations, we have two ways to generate synthetic data.GENERATE UNDER returns a synthetic table with a given number of rows specified by the LIMIT clause, where each row is generated by sampling from a given rowModel.GENERATIVE JOIN takes a rowModel and a table, and returns a synthetic table with the same number of rows, where each row is generated by concatenating the current row of the table with a sample from the rowModel.The model generating the samples can be conditioned on the current row of the table.RENAME renames a table or rowModel with Table and rowModel Types Tup⟦ [?id] {col 1 : 1 , . . ., col : }⟧ = 1≤ ≤ ⟦ ⟧ ⟦ [?id] {col 1 : 1 , . . ., col : }⟧ = P dens 1≤ ≤ ⟦ ⟧ Tab⟦ [?id] {col 1 : 1 , . . ., col : }⟧ = PBag 1≤ ≤ ⟦ ⟧ Tab⟦ [?id] {col 1 : 1 , . . ., col : }⟧ = P adm 1≤ ≤ ⟦ ⟧ (b) Semantics of Contexts