Inference of Probabilistic Programs with Moment-Matching Gaussian Mixtures

Computing the posterior distribution of a probabilistic program is a hard task for which no one-fit-for-all solution exists. We propose Gaussian Semantics, which approximates the exact probabilistic semantics of a bounded program by means of Gaussian mixtures. It is parametrized by a map that associates each program location with the moment order to be matched in the approximation. We provide two main contributions. The first is a universal approximation theorem stating that, under mild conditions, Gaussian Semantics can approximate the exact semantics arbitrarily closely. The second is an approximation that matches up to second-order moments analytically in face of the generally difficult problem of matching moments of Gaussian mixtures with arbitrary moment order. We test our second-order Gaussian approximation (SOGA) on a number of case studies from the literature. We show that it can provide accurate estimates in models not supported by other approximation methods or when exact symbolic techniques fail because of complex expressions or non-simplified integrals. On two notable classes of problems, namely collaborative filtering and programs involving mixtures of continuous and discrete distributions, we show that SOGA significantly outperforms alternative techniques in terms of accuracy and computational time.


INTRODUCTION
Probabilistic programming languages are programming languages augmented with primitives expressing probabilistic behaviours [Gordon et al. 2014].Examples are random assignments ("program variable  is distributed according to the probability distribution "), probabilistic choices ("do  1 with probability  else  2 ) or conditioning ("variable  is distributed according to , under the constraint that it can only take positive values").This has enabled a variety of applications such as the analysis of randomized algorithms, machine learning and biology [Gordon et al. 2014].
Given a probabilistic program, there are different equivalent ways in which its semantics can be defined [Kozen 1983].Following Kozen's Semantics 2 [Kozen 1979], in this paper we see a program as a transformer: given an initial joint distribution over the program variables, each instruction A similar behavior is exhibited by applying Pyro's variable elimination [Obermeyer et al. 2019], which computes the exact posterior in 0.192 s for  = 1 and about 9 s for  = 100 (see Section 7.3).

Proposed Approach
The difficulty in performing inference on the previous program stems from various factors: PSI's exact engine returns non-simplified integrals, requiring computationally expensive numerical integration.STAN's MCMC, Pyro's VI and AQUA's quantization cannot be applied in this case, but in general, can incur long computational times and out-of-memory errors (see Section 7).BLOG's and Pyro's ad hoc sampling perform best, but increasing the number of steps hinders scalability.
To complement all these techniques, we present a new approximate analytical method that does not require integration or sampling and that relies on a compact representation of the joint distribution using moment-matching Gaussian mixtures (GMs).Our choice of representation is based on some desirable properties of GMs, and in particular the following three: i) they can encode both continuous and discrete distributions (using degenerate GMs); ii) their moments can be computed exactly and efficiently; iii) they are universal approximators, so we can always increase the number of components in our representation to get a better approximation.These considerations lead to the definition of a family of approximating semantics called Gaussian Semantics.
More in detail, we define Gaussian Semantics so that it is closed with respect to the class of (degenerate) GMs, meaning that, at every program location, the Gaussian Semantics of a program transforms a GM into a GM.In particular, we proceed as in the general approximation scheme proposed by Boyen and Koller [1998]: given a GM , the exact semantics of a program location would transform it in a different distribution , which is not necessarily a GM.However, we approximate  with a new GM   and define the Gaussian Semantics as semantics that transform  into   at that program location.This process is represented in Figure 1.Performing this at every program location approximates the whole program semantics.In particular, we choose to approximate  with   using moment-matching, meaning that   is a GM having the same moments of  up to a certain order  .This is convenient for two reasons: first, it avoids computing the full pdf of , as only its first  moments are needed to find   ; second, since  is obtained as a transformation of a GM, it can be expressed as a linear combination of transformed Gaussians, and its moments can be computed analytically using the results summarized in Table 1.
To sum up, in a Gaussian Semantics each program location is associated with an integer  , and the semantics acts on a GM  performing two steps: first, it computes the first  order moments of the transformed distribution , using the results in Table 1; then, it finds a new GM   having same moments as  up to order  .More than one moment-matching GM   can exist, therefore, Table 1.Summary of the theoretical results used to compute the moments of transformed Gaussian Mixtures.N (, Σ) denotes a Gaussian distribution with mean  and covariance matrix Σ,  is any real constant and ,  are vectors in R  defining the hyper-rectangle [, ] = { ∈ R  :   ≤   ≤   }.
In general, the posterior computed by SOGA is a GM whose number of components grows exponentially in the number of conditional statements.To help cope with this, we introduce a pruning strategy that keeps the number of components in the GMs below a user-specified threshold by merging components with minimal cost.Using a prototype implementation, we compare SOGA on a corpus of benchmarks against state-of-the-art tools representative of different inference methods: MCMC sampling (STAN), symbolic execution (PSI), quantization (AQUA), VI (Pyro).Even when it is not the best-performing method, it still provides the flexibility to model both continuous and discrete posteriors, unlike STAN, Pyro and AQUA, which only support the former.Additionally, it enables reaching numerical solutions in reasonable runtimes when PSI returns non-simplified integrals that demand computationally prohibitive times for numerical integration.When applied to the analyzed benchmarks, pruning significantly reduced the computational time without incurring noticeable approximation errors.
Importantly, we highlight that SOGA is particularly useful for performing inference on two classes of programs: those involving mixtures of continuous and discrete distributions and collaborative filtering models.Most state-of-the-art approaches do not support the first class, even though it is known that this kind of distribution arises in various application domains [Gao et al. 2017;Kharchenko et al. 2014;Pierson and Yau 2015].Thanks to its GM representation, SOGA can easily encode these distributions.When tested on benchmarks introduced specifically for this problem, SOGA is able to perform inference faster than dedicated methods such as Wu et al. [2018], while identifying the exact posterior.Collaborative filtering models are an established framework to model recommendation systems and have been extensively investigated in the machine learning community [Koren et al. 2021].SOGA can deal with a large number of variables without incurring large computational times or out-of-memory errors, as happens with alternative methods.
Contributions.In summary, the contribution of this paper is threefold: i) From the theoretical point of view, we define a family of approximating semantics called Gaussian Semantics and prove that they approximate the exact semantics of a bounded probabilistic program arbitrarily well.ii) From the practical perspective, we present an implementation of a particular instance of Gaussian Semantics, called SOGA, and evaluate it against other state-of-the-art implementations of alternative techniques (PSI [Gehr et al. 2016], STAN [Carpenter et al. 2017], AQUA [Huang et al. 2021], Pyro [Bingham et al. 2019]) on a set of benchmarks taken from the literature.While not always the best performer, SOGA can handle models with discrete posteriors while STAN, AQUA and Pyro only support continuous ones.On the other hand, SOGA can provide accurate and computationally tractable approximations when symbolic analysis by PSI may fail due to complex formulas or non-simplified integrals that cause high computational costs for their numerical integration.iii) We focus on two classes of models taken from the machine learning literature -collaborative filtering and inference involving mixtures of continuous and discrete distributions -where SOGA clearly outperforms the other methods, complementing the current state-of-the-art.
Paper Structure.Notation and background notions are presented in Section 2. Control-flow syntax and exact semantics are introduced in Section 3. Gaussian Semantics is introduced in Section 4, while the universal approximation theorem is presented in Section 5. We present SOGA in Section 6 and evaluate it in Section 7. We cover further related works in Section 8, while conclusions and future works are drawn in Section 9.

BACKGROUND
We now introduce some of the notation and the concepts that will be used in the rest of the paper (we refer the reader to the Supplementary Material for additional background material).
Probability Distributions.We always deal with distributions over R  and use  ∼  to denote that the stochastic vector  is distributed according to distribution .We always assume that a distribution  can be specified by its probability density function (pdf)   : R  → R ≥0 .For  ∼  and a set  ⊆ R  , the probability of  under , denoted by   (), can be expressed as the Lebesgue integral   () = ∫    ().Sometimes, we will find it more convenient to refer to the probability measure induced by  on the measurable space (R  , B (R  )), where B (R  ) is the Borel -algebra on R  .By probability measure, we mean a function  : B (R  ) → [0, 1] that satisfies the following two properties: i) (∅) = 0 and (R  ) = 1; ii) (∪  ∈N   ) =  ∈N (  ) for any countable collection of disjoint sets   ⊂ R  .For a distribution , the associated measure Moreover, due to the presence of conditional branches and observe statements in a probabilistic program, we consider distributions conditioned to subsets of R  .Letting I  be the characteristic function of a set  ⊂ R  such that   () > 0,  | will denote the distribution of  truncated (or conditioned) to , whose pdf is given by   | = 1   ()   I  .Observe that   | is obtained by setting   to 0 outside , and then, by dividing it by   (), so that the induced measure is still a probability measure.Given a -dimensional random vector  ∼  and a subvector  ′ = (  1 , . . .,    ) with  1 , . . .,   ∈ {1, . . .,  }, we denote by Marg  ′ () the marginal distribution of , obtained integrating out the components not in  ′ , i.e.  ′ ∼ Marg  ′ () = ∫ R  −   () ( \  ′ ).Gaussian Distributions and Mixtures.Gaussian distributions with mean  and covariance matrix Σ are denoted by N (, Σ).We assume that Σ can be singular, which corresponds to Gaussian distributions having support in a subspace of R  , as shown in Figure 2. In particular, when Σ is the null matrix we consider the associated random variable to be a Dirac delta centered in , referred to as   .Gaussian distributions enjoy many useful properties; some that we will use are listed in Table 1 while we refer to [Bishop and Nasrabadi 2006, Chapter 2.3] for a detailed treatment.
We refer to mixtures as the scalar products of two vectors ( 1 , . . .,   ) and ( 1 , . . .,   ) such that  =1   = 1, 0 <   ≤ 1, and   is the distribution of the -th component, with  = 1, . . ., .The numbers   are called mixing coefficients.We will denote a mixture as  =  1  1 + . . .+     , thus indicating that  has pdf   () =  1   1 () + . . .+      ().When  = 1, we recover the case of a single distribution.A special case is given by Gaussian Mixtures (GMs) in which   = N (  , Σ  ) for  = 1, . . ., , with mean vectors and covariance matrices   , Σ  .We assume (  , Σ  ) ≠ (  , Σ  ) for  ≠  .The set of GMs is dense in the set of probability distributions with respect to the weak topology [Lo 1972], meaning that for any probability distribution one can always find a GM that approximates it arbitrarily closely with respect to a particular metric, the Levy-Prokhorov distance.Since we consider Dirac deltas as particular Gaussian distributions, discrete distributions over a finite set of values are included in the set of GMs.
Distributions Determined by Their Moments.Let  ∼ .For  = ( 1 , . . .,   ), with   ∈ N, ∀  define Letting  vary over all vectors in N  such that  1 + . . .+   =  we obtain the set of all  -th order central moments of .Observe that, for any , E[ 0 ] = 1.Since the construction of our semantics relies on the Method of Moments, we need to assure that this converges to the correct distribution.This is true only if no other distribution has all moments equal to those of the target one [Billingsley 2008].We say that in this case, the target distribution is determined by its moments, formalized next.
Definition 2.1.A distribution  is determined by its moments, if for any other distribution  ′ such that for all  1 , . . .,   ≥ 0 ∫

Syntax
Following Kozen [1979], we will consider probabilistic programs as transformers over distributions  defined over a vector of variables taking values in R  .Similarly to Chaudhuri and Solar-Lezama [2011], we represent programs in a control flow-graph (cfg) syntax [Cousot and Cousot 1977].We use as explanatory example the simple program in Algorithm 1.
A program is a directed graph  = ( , ) where  is set of nodes and  is the set of edges.Specifically, we consider directed acyclic graphs (DAGs) of bounded depth.Each node belongs to one of five types in Γ = {entry, state, test, observe, exit}.We denote the fact that a node  ∈  is of a given type  ∈ Γ with  : .The nodes satisfy the following properties.
• A node  :  has no incoming edge and one outgoing edge.Moreover, for each program, there is exactly one  ∈  such that  :  and one  ∈  such that  : , and they correspond to the root and the only leaf of the DAG representing the program, respectively.The control-flow syntax for Algorithm 1 is represented in Figure 3.  where x is an output variable, i.e., a variable on which to compute the posterior distribution, and g denotes a fresh read-only variable distributed according to univariate GMs with mixing coefficients   , means   and variances  2  ,  = 1, . . ., .For the sake of brevity, in our examples, we will also use read-only variables denoted by gauss(, ) which is syntactic sugar for gm( [1], [], [𝜎]).We use read-only variables to perform random assignments, as it is done in lines 2, 4, and 6 of Algorithm 1 and to encode Boolean conditions depending on arbitrary distributions.
State nodes are labelled by either skip or assignment instructions of the type x i = E(z), where E(z) is an expression of the following form: where c, c 1 , . . ., c n are scalar.
Test and observe nodes are labelled by linear Boolean conditions (LBCs) of the following form: where c, c 1 , . . ., c n are scalar constants, ⊲⊳ ∈ {<, ≤, ≥, >} and □ ∈ {==, !=}.We associate an LBC with the set, defined on the space of augmented variables, where ⟦true⟧ = R  and ⟦false⟧ = ∅.Observe that an expression or LBC can have at most  output variables but any finite number of read-only variables.

Supported Programs
Our syntax rules out general distributions depending on non-constant parameters, unbounded loops, and non-polynomial functions.We briefly comment on the limitations of this approach, how they can be mitigated, and when they are shared by other techniques.
Probabilistic Assignments.Probabilistic assignments are performed by assigning univariate readonly variables to output variables.This is not a limitation since dependence between variables can be encoded using multiple assignments.For what concerns the restriction on GM distributions, instead, we exploit the already discussed density of GMs in the space of distribution [Lo 1972], and assign a GM arbitrarily close to the target distribution.In this sense, we will assume that we are approximating non-GM distributions with a GM whenever we refer to non-GM distributions.
Finally, probabilistic assignments will involve only distributions depending on constant parameters.This restriction is more difficult to overcome and is shared with other tools based on moment-based techniques, such as Bartocci et al. [2020] and Moosbrugger et al. [2022].This is because it is not always possible to derive how the moments change if one or more parameters of a distribution are probabilistic.As in Moosbrugger et al. [2022], this limitation can be mitigated by performing suitable reparametrizations (see Supplementary Material).
Iterations.We restrict our attention to loops bounded by deterministic constants (as in our illustrating example in Algorithm 1), similarly to Gehr et al. [2016], Huang et al. [2021], Holtzen et al.
[2020], Albarghouthi et al. [2017] and Nori et al. [2014].If guarantees on almost sure termination can be given, the true distribution of the loop could be approximated by a bounded unrolled loop with a sufficiently large number of iterations [Kozen 1979].
Polynomial Programs.Differently from Carpenter et al. [2017]; Gehr et al. [2016]; Huang et al. [2021], we consider programs involving only the arithmetic operations +, −, * , and ^.This assumption is common to other approaches relying on moment-based techniques such as Bartocci et al. [2020] and Moosbrugger et al. [2022], due to the fact that non-polynomial functions (such as the logarithm) may generate distributions that are not determined by their parameters.We remark that from expressions such as (1) and (2), general polynomial expressions and Boolean conditions can be obtained, respectively, by chaining state nodes and nesting conditional statements.A probabilistic choice, i.e.,  is assigned  1 with probability  or  2 with probability 1 − , is encoded using the LBC  <  where  is a standard Gaussian and  is the Gaussian -quantile.

Exact Probabilistic Semantics
The "exact" semantics follows Kozen's Semantics 2 [Kozen 1979].Since we are using the control-flow syntax of Cousot and Cousot [1977], we are close to the collecting semantics in Chaudhuri and Solar-Lezama [2011]: we combine the semantics of the nodes to define the semantics of the paths, then define the semantics of the program as a sum over the semantics of the paths.This semantics is particularly convenient for our method because it gives the posterior distribution as a mixture, similarly to Zhou et al. [2020].
To overcome conditioning with respect to zero-probability events we assume that whenever else.
• For an observe node  labelled by B(z) we condition the current distribution to ⟦B(z)⟧.
Observe that if B(z) only contains read-only variables, conditioning does not affect the distribution of the output variables .
In all other cases conditioning is treated as usual: • The exit node  takes as input (, ) and outputs the same pair (, ): The semantics of the program  is then defined as: (4) Example 3.1.For the program in Algorithm 1 we have only two paths,   =  0  1  2  3  5  6  7 and   =  0  1  2  4  5  6  7 corresponding to evaluations of the conditional statement as true or false, respectively.To compute the semantics of   we start from  0 , that outputs (1,  0 ).This pair is taken as input by  1 , which is a state node assigning  (1, 0) to  1 , so its output is (1, N (0, Σ 1 )) with Σ 1 = diag(0, 1) (corresponding to the distribution in Figure 2b).This pair is taken as input by the test node  2 .Since we are considering   , for which    ( 2 ) =  3 , and  ( 3 ) = , the semantics of  2 in this path conditions N (0, Σ 1 ) to  1 > 0. Therefore in this path the output of  2 is (0.5, N (0, Σ 1 )|⟦ 1 > 0⟧), where 0.5 =  N (0,Σ 1 ) (⟦ 1 > 0⟧).This pair is taken as input by  3 , which updates the distribution of  2 and therefore outputs a new pair (0.5,  3 ).We can proceed until we compute the output of  7 , which gives the final pair ⟦  ⟧ = (  ,   ).In the same way, we compute ⟦  ⟧ = (  ,   ), and finally, the semantics of the whole program as the mixture

GAUSSIAN SEMANTICS
Gaussian Semantics is a family of semantics closed with respect to GMs.Each node takes as input and returns a GM over the program variables.This is done by composing the exact semantics of a node with an operator    acting on the output distribution of ⟦⟧  .In particular,    transforms any distribution  into a GM , having the same moments of  up to order  .Therefore, we call    the moment-matching operator.Formally, we use a map  :  → N 0 to associate each node  ∈  with the highest order of moments that will be matched at .The semantics of a node is then: where I is the identity acting on the first element of the pair (, ), and    () is the momentmatching operator.The Gaussian Semantics of paths and programs are defined similarly to the exact semantics as: Example 4.1.We have seen in Algorithm 1 that the exact semantics is not closed with respect to GMs.For example, node  2 takes as input a Gaussian but returns a truncated Gaussian, which is not a GM.If, instead, we consider Gaussian Semantics with ( 2 ) = 3 the output of  2 will be a GM matching the first three order moments of N (0, Σ 1 ) | ⟦ > 0⟧.Two steps are required to compute ⟦ 2 ⟧   .In the first step, we compute the first ( 2 ) = 3 order moments of the output distribution of ⟦ 2 ⟧  .Therefore, we compute the first order moments Observe that, thanks to the results in Table 1, this is significantly easier than computing the whole pdf of the output distribution.The second step involves finding a GM having the computed moments.This is generally more complex and is performed by the operator    .In the rest of the section, we will assume that the computation of the moments is done using the aforementioned formulas, and we focus on the definition of the operator    and the derivation of its properties.
Moment-Matching Operator.In general, the operator    () acts on distributions  =  =1     that are mixtures (possibly of a single component).The moments of  are computed as a linear combination of the moments of its components: if . Therefore, when computing the moments of , we first compute the moments of every component   .Then, it makes sense to define    () so that when it acts on a mixture  ( > 1), it recursively acts on each component of the mixture, moment-matching each of them.When, instead,    acts on a non-mixture distribution  ( = 1), it returns a GM having moments up to order  equal to those of .This second action is encoded by a second operator match  .
We require that match  () satisfies the following two conditions: R1) for any distribution , match  () is a GM; R2) match  () has central moments up to order  equal to those of .
The existence of the operator match  is guaranteed by the following result, derived from Schmüdgen [2017, Theorem 17.2], stating that that, for any finite sequence of moments, there exists a momentmatching discrete distribution putting positive mass on a number of points smaller than or equal to the number of matched moments.Since discrete distributions are GMs, the Proposition holds.
For detailed proof, see the Supplementary Material.
Example 4.3.In our example, we want to match a total of 10 moments, a zeroth-order moment (which is always 1), two first-order, three second-order, and four third-order moments.Theorem 17.2 in Schmüdgen [2017] ensures that there exists at least one discrete distribution (therefore a GM) with  ≤ 10 components that has exactly the given moments.Proposition 4.2 ensures the existence of at least one GM matching the moments of  up to order  .In general, letting   () denote the set of all finite GMs matching the moments of  up to order  , we may have that |  ()| > 1.For match  to be well-defined, we need to uniquely identify a moment-matching GM in   ().This can be done using different heuristics: we propose one based on the principle of maximum entropy, which we call the max entropy matching (MEM).
Max Entropy Matching.MEM can be summed up as follows: if |  ()| > 1 we choose the GM  having the least number of components (in order to minimize the number of parameters to be fit) and minimizing a certain cost function.Any remaining tie is resolved by comparing the vectors of parameters  that identify the GMs with respect to lexicographic ordering (we give an ordering on the parameters of GMs in the Supplementary Material).We select our cost function as the sum of the opposite of the differential entropy plus a penalty term, where the differential entropy for a distribution  with pdf   is defined as [Cover 1999] Intuitively, the principle of maximum entropy asserts that the distribution maximizing entropy is the one that minimizes the number of assumptions on the distribution [Cover 1999].Therefore, maximizing  () we are choosing the most general moment-matching distribution.We add to  () a penalty term to avoid uncontrolled growth of the parameter values.Then, the procedure to compute match  () is the following.
The following proposition guarantees that MEM leaves us with a well-defined operator match  .It is again proved using Theorem 17.2 from Schmüdgen [2017], and by noticing that P is a compact set, therefore Eq. ( 8) is well-defined.Again, we defer a detailed proof to the Supplementary Material while we explain how MEM works using an example.Proposition 4.4.For any  and  the max entropy matching uniquely identifies match  ().
We remark that the choice of MEM is arbitrary, as other cost functions could be introduced.However it has various benefits.(i) To guarantee that Proposition 4.4 holds, one needs a bounded cost function.(ii) Using entropy leads to a parallelism with VI: SOGA itself can be seen as a form of VI since it involves the minimization of the reverse differential entropy [Kullback and Leibler 1951].However, correspondence with VI is lost when higher-order moments are considered, because the minimizer of the reverse differential entropy is not analytically expressible for GMs.(iii) In the spirit of minimizing the number of assumptions made on the approximating distribution, the approach looks more pleasing mathematically.
Example 4.5.While Schmüdgen [2017] ensures that we can find a moment-matching GM with 10 components, it is easy to check that  * = 2 is the minimum number of components required to match three order moments.In fact,  * > 1, since for a single Gaussian, given the mean and the covariance matrix, all the other moments are fixed (so, we can match the first two order moments but not the third).For  = 2 instead we can consider the GM  with pdf N (, ) + (1 −)N ( ′ ,  ′ ) such that ,  ′ , ,  ′ satisfy the following system: 1,2 ≥ 0 In the system we equate the moments of  (l.h.s) with those of  ∼  (r.h.s, computed in Example 4.1).Moreover, we look for solutions such that 0 <  < 1 and ,  ′ are positive semidefinite (last line).Since the system is polynomial, using SMT solvers over reals we can check it is satisfiable; therefore,  * = 2. Now we should determine the set P of all solutions and find those that minimize the cost function.Since finding all solutions is generally impossible, we directly proceed to optimize our cost function numerically, constraining the variables to satisfy the previous system.We find the approximate solution  = 0.572,  = (0.471, 0),  ′ = (1.236,0) and  = diag(0.066,0),  ′ = diag(0.426,0).The approximating GM is shown in the green line of Figure 4, while the blue line shows the true non-Gaussian distribution.
The example shows that the difficult step in computing a Gaussian Semantics of arbitrary order is 2).Finding the parameters of a moment-matching GM requires the solution of a system of polynomial equations, like the one in the example.This problem is notoriously hard to solve, as no analytical solution exists [Lasserre 2009].Performing numerical optimization can solve the problem approximately, but is in general numerically unstable and requires relatively long computational times (in our example, using scipy [Virtanen et al. 2020], it took around 7 s to match a single Gaussian!).While we leave open the problem of solving 2) efficiently in the general case, the following lemma gives two important properties of match  , which will be used to derive our second-order approximation.The proof is quite trivial and reported in the Supplementary Material.
Lemma 4.6.The following two properties hold: i) when  = 2, match 2 () is a single Gaussian distribution with mean and covariance matrix equal to those of ; ii) if  is Gaussian, for any  ≥ 2 match  () = .
We conclude the section with a consequence of Lemma 4.6.It follows from ii) that    has the desirable property of leaving GMs unaltered, i.e. if  is a GM    () =  for all  ≥ 2. As a consequence, Gaussian Semantics coincides with the exact semantics for programs only involving GMs, and in particular, for programs involving only discrete distributions (mixtures of deltas).Proposition 4.7.Let  = ( , ) be such that every read-only variable in the program is a finite discrete distribution.Then, for any , ⟦⟧  = ⟦⟧.
Proof.Since truncations, linear combination and products of discrete distributions are discrete distributions, only discrete distributions are generated in the execution of ⟦⟧.By Lemma 4.6 applying the moment-matching operator    to them leaves them unaltered, so conclusion follows.□ Example 4.8.Consider a second Gaussian semantics that maps  2 to ( 2 ) = 2.In this case, we want to match only the first two order moments, namely As noticed before, in this case  * = 1, since we can take the Gaussian N (, Σ) with  = (0.7979, 0) and Σ = (1, 0) and it will have required moments.Observe that we do not need to solve any system or optimization problem.
In general, for a fixed number of moments matched, we expect Gaussian Semantics to approximate reasonably well the matched moments but not necessarily the whole distribution (see Section 7.2 for further discussion).Indeed, let us compare the exact posterior distribution with the ones obtained distribution when  = 2, 3, 4, respectively, using Kullback-Leibler (KL) divergence [Kullback and Leibler 1951].The KL divergence between two distributions  and  is a standard way to evaluate the error committed in approximating  with .In our case, we take as  the truncated Gaussian and as  the GMs obtained matching different order moments.The respective values are 2.29 (R=2), 1.71 (R=3) and 1.36 (R=4), so indeed higher-order Gaussian semantics improve the approximation.Figure 4 compares the true marginal pdf of  1 at node  2 (blue solid line) with the second-(orange dashed), third-(green dash-dotted) and fourth (red dotted) approximations.The advantages of fitting only a small number of moments are mainly computational.Indeed, it can be seen from the legend that as the number of moments matched grows, the increased KL accuracy comes with an increased computational cost.

UNIVERSAL APPROXIMATION THEOREM
Our main convergence result states that, for well-behaved programs, it is possible to find a map  so that the output distribution yielded by the semantics ⟦•⟧  is arbitrarily close to that of ⟦•⟧ in the Levy-Prokhorov metric [Ethier and Kurtz 2009].By "well-behaved" we mean that the distributions in the exact semantics are determined by their moments and that they can measure continuously  sets in the form (3). To formalize the latter requirement we introduce -continuity sets.Since both this definition and that of Levy-Prokhorov metric are borrowed from measure theory we refer the reader to the Supplementary Material for a more detailed background on these concepts.
Definition 5.1.Given a measure , a set  ⊆ R  is called an -continuity set if () = 0 where,  is the boundary of set , defined as the closure of the set  minus its interior.
Then, we can state our main theorem.
Theorem 5.2.Assume that  = ( , ) is a program such that for each  ∈  and each path  ∈ Π  the output distribution  of ⟦⟧  satisfies the following: H1)  is determined by its moments; H2) if  is the input distribution for a test or observe node  ′ , then the set defined by the LBC labelling  ′ is an    -continuity set.Then there exists a sequence of maps (  :  → N 0 )  ∈N such that: where the convergence is intended in the weak topology, or equivalently, in the Levy-Prokhorov metric.

Satisfaction of the Hypotheses
Before giving an outline of the proof, we briefly comment on the hypotheses.
First, observe that H1 and H2 are sufficient but not necessary.In particular, if the hypotheses of Proposition 4.7 are satisfied convergence holds trivially, even when H1 or H2 are violated.
Hypothesis H1 is common to other works considering moment-based approximation, such as in Bartocci et al. [2020] and Moosbrugger et al. [2022] and is needed to guarantee that the method of moments converges to the true distribution [Billingsley 2013].For a given program, it is possible to perform static analysis to check whether the arising distributions are determined by their moments, exploiting known results on moment determinacy (see, for example, the moment-generating function characterization in [Billingsley 2013]).Notably, to apply these results is not necessary to compute the exact pdf of the arising distributions, but it is sufficient to keep track of their type.In fact, for some classes of distributions, moment-determinacy is established: this is true for finite discrete distributions, Gaussians, uniforms, Poissons, exponentials, truncations and mixtures thereof [Billingsley 2013].The case studies analyzed in this paper feature such distributions.On the contrary, log-normal distributions are not determined by their moments.However, as long as moments are computable, Gaussian Semantics can still be applied: in this case no formal guarantee of convergence towards the true distribution is given, but the method still provides an analytical approximation for the moments.
Hypothesis H2 guarantees that when distributions are conditioned to sets in the form (3), weak convergence is preserved.This requirement can be falsified if  has degenerate components that place positive probability mass on the boundary of the set defined by the LBC.This could happen, for instance, if a component is a Dirac measure centered on any point of ⟦B(x)⟧.For example, consider line 12 of Tracking_ in Section 1.1, where  can be 1 or 0 with probability > 0. This falsifies H2.However, such cases can be statically detected and the program can be transformed into one that uses the equivalent condition as  > 0.5, so that H2 holds.More in general, continuity corrections such as those performed in Laurel and Misailovic [2020] can be adopted.
For Tracking_ in Section 1.1, each marginal is obtained by Gaussians, performing sums, squares, or conditioning.Here, to check moment-determinacy, we use a result in Billingsley [2008], which states that if the moment-generating function (mgf) of a distribution is defined in an interval of 0, then the distribution is determined by its moments.Using symbolic integration, we can compute the mgf of the product of two Gaussians and verify that it is defined in an interval of 0. Therefore, the distribution is determined by its moments and H1 holds.For H2, we have already shown how to correct the condition in line 12 so that H2 is verified.For the if statement in line 7, observe that before entering it, the marginal w.r.t. to  is the distribution of  2 +  2 , which is continuous, and therefore the point 10 has measure 0 with respect to it.Therefore also H2 holds.

Outline of the Proof
The proof of Theorem 5.2 is rather technical and involves a number of results from measure theory.Thus, here we provide a sketch, and refer the reader to Supplementary Material for a detailed proof.
First of all, recall that the semantics of a program is defined as a (finite) mixture of the semantics of the paths in the program.Therefore, (9) holds if it holds for every path  ∈ Π  .Moreover, it can be shown that we can consider only the paths such that ⟦⟧ = (, ) with  ≠ 0, since paths for which  = 0 contribute to the semantics of the program nor in the exact, neither in the Gaussian Semantics.So, the proof amounts to showing that for every path  ∈ Π  such that ⟦⟧ = (, ) with  ≠ 0 we can choose a sequence of maps   such that ⟦⟧   → ⟦⟧.
We can build the sequence of maps   by specifying   () for each  ∈ .In particular, for  0 we can choose any value of   .For  1 , if it is not the exit node, we can assume that ⟦ 1 ⟧ transforms the pair (1,  0 ) into ( 1 ,  1 ) with  1 ≠ 0.Then, by definition of Gaussian Semantics, ⟦ 1 ⟧   transforms (1,  0 ) into the pair ( 1 ,    ( 1 ) ( 1 )).To ensure convergence in this case, we use Theorem 5.4 of Billingsley [2008], which we state below.Theorem 5.4 (Billingsley [2008]).Suppose  ∼ ,   ∼   , and  is a distribution determined by its moments, while   has have moments of all orders.If for all  > 0 lim then   →  in the Levy-Prokhorov metric.
Then, if   ( 1 ) is an increasing sequence in  (for example   ( 1 ) = ) we can use Theorem 30.2 of Billingsley [2013] and H1 to say that For  2 and for any node   after that, we assume that, in the exact semantics, the node takes as input a pair (, ) such that  ≠ 0, while in the Gaussian Semantics associated with   , it takes as input (  ,   ) such that   →  in R and   →  in the Levy-Prokhorov metric.Then, we need to prove that we can choose   (  ) so that ⟦  ⟧   (  ,   ) → ⟦  ⟧(, ).
We do this in three steps.First, we prove that the exact semantics preserves the convergence, i.e.
In particular, when   : , an assignment is of the type   =  () with  () in the form (1) (here we do not consider read-only variables for simplicity, but they can easily taken into account).Since  () is continuous, convergence follows from Theorem 5.5 by taking For   :  and   :  we use again the Mapping Theorem, in two different ways.First, we use it with ℎ = I ⟦ ( )⟧ , where () is the LBC labelling the node, to show that  ′  →  ′ .Then, we fix a point  * such that    ( * ) = 0 for all  and we choose ℎ to be: In both cases, H2 is fundamental in guaranteeing that the Mapping Theorem still holds.

SECOND ORDER GAUSSIAN APPROXIMATION
As discussed in Examples 4.5 and 4.8, while implementing an arbitrary order Gaussian Semantics may be difficult, it is straightforward to compute the Gaussian Semantics associated with  = 2 (i.e., such that at each node of the control-flow graph matches the first two order moments (mean and covariance matrix).We propose Second Order Gaussian Approximation (SOGA), an algorithm that implements this particular case.

Overview
In our implementation, SOGA accepts programs in a Python-like syntax, then compiled into a formal control-flow graph.SOGA recursively visits the nodes of the control-flow graph in a breadthfirst fashion to compute the semantics of all paths.Furthermore, each node has two attributes,  and :  is a non-negative scalar proportional to the probability of reaching that node, while  stores the output distribution (in the form of a GM) computed by the semantics of that node.
The procedure applied for each node is summarized in Algorithm 2, where we assume that each node stores its parents and children in suitable attributes.When entering a new node, SOGA retrieves the pairs (, ) computed by the parents of the current node and merges them in a single pair (_, _) using the function merge_dist (line 1-5).Then, it computes how the node semantics transforms the latter pair and stores the result in the attributes .,.(line 6).Finally, it calls itself recursively on the children node (lines 7-9).When the  node is reached the algorithm ends, leaving the posterior distribution stored in its attribute ..
The core of the algorithm is the function node_semantics, that, for each node type, transforms the pair _, _ into a new pair , .When  : ,  node_semantics leaves the pair unaltered; when  :  or  : ,  the functions apply_rule(_,  ) and approx_trunc(_, ) are invoked, respectively.We detail the functions below.
Function apply_rule.It implements the semantics of a state node.In particular, it takes as input the current mixture _ and an expression  of type (1).It returns a new distribution  obtained applying  and   2 to _.To compute the moments of the transformed distribution , and therefore its second-order approximation, it uses the results in Table 1: when  is a linear transformation, it applies the formulas for the sum of multivariate Gaussians [Billingsley 2013].When  involves products, it applies Isserlis' theorem [Wick 1950].
Function approx_trunc.It implements the semantics of a test or an observe node.It takes as input the current mixture _ and a set  defined by an LBC of type (2).It returns the probability mass , given by the probability that _ satisfies , and a new mixture distribution , representing the GM approximating _ conditioned to .Again, it applies the results in Table 1 to compute : in particular, when the LBC expresses inequality constraints the formulas in Kan and Robotti [2017] are used; when instead the LBC has the form x i == c it uses the formulas from Bishop and Nasrabadi [2006].
Function merge_dist.Merging is performed whenever a node is accessed prior to applying its semantics: merge_dist collects all the output pairs (, ) computed at the parent nodes, and merges them together in a single GM.Given the set of  parents' pairs ( 1 ,  1 ), . . ., (  ,   ), the function returns probability mass _ =  1 +. ..+  and a new GM _ = 1  _ ( 1  1 +. ..+    ).For an exit node, the output of this function is the output distribution of the program.

Distributivity of Transfer Functions
SOGA explores the control-flow graph in a breadth-first fashion, performing merges when required.On the other end, the exact and the Gaussian semantics are defined as a sum over all execution paths, leading to an apparent discrepancy.To ensure that SOGA indeed computes the Gaussian Semantics associated with the map () = 2 we show in Proposition 6.1 that the transfer function of the exact semantics is distributive with respect to the merge operation.
To do this, for a set of pairs (  ,   ) we define the merge operator We show that computing the semantics of a node after performing a merge gives the same output distribution as computing the semantics of each pair and then merging the results.This distributivity transfers straightforwardly to Gaussian Semantics, since the latter is computed by composing the exact semantics with the operator    , which is distributive with respect to merging by Definition 6.This, in turn, justifies computing the semantics exploring the control-flow graph in a breadth-first fashion as SOGA does.Proposition 6.1.Let (  ,   ) be pairs with   ≥ 0 and   a distribution for  = 1, . . ., .Let  be a node of type state, test, observe or exit.Then  (⟦⟧( 1 ,  1 ), . . ., ⟦⟧(  ,   )) = ⟦⟧( (( 1 ,  1 ), . . ., (  ,   ))) Let us show for each type of node that P = P and D = D. We observe that for  : , with  labeled by skip, and for  :  conclusion follows trivially.We examine the remaining cases separately.
• Let  :  and suppose  is labelled by

SOGAprune
To improve the scalability of SOGA we propose a second version of the algorithm, called SOGAprune, in which the user can introduce at script level the instruction prune(),  being an integer number.
When the script is compiled in a cfg, the prune instruction is compiled in a new node of type .
Function prune_dist.It prunes the current distribution _ to keep the number of its components below a user-specified bound .The pruning is performed similarly to Chaudhuri and Solar-Lezama [2010].In particular, for each pair of components ,  in the input distribution input_dist, having mixing coefficients   ,   , means   ,   and covariance matrices Σ  , Σ  , we compute the mean After computing the cost for all pairs (, ) such that  ≠  and ,  < , the pair (, ) with minimal cost is substituted with a single component having mean  ′ and covariance matrix Σ ′ with Observe that  ′ and Σ ′ are exactly the mean ad the covariance matrix of the mixture

Computational Cost
We first compute the computational cost without pruning, then we discuss how pruning affects it.
Cost without pruning.Let | | denote the total number of nodes, | | the number of test nodes, | | the number of test and observe nodes and | | the number of state nodes.W.l.o.g.we assume for simplicity that all read-only variables are pushed to an initial distribution  0 over R  ; thus the output of the entry node is (1,  0 ) and all assignments only use output variables.By doing this we compute an upper bound on the true computational cost since the dimensions corresponding to read-only variables are dropped after marginalization.Letting  0 denote the number of components of  0 , the output distribution will have at most  ≤   = 2 | |  0 components.
We consider the cost to access a node and perform elementary operations, such as assignments and products, constant.Expressions  and  are assumed to be stored in suitable data structures accessible in constant time, so storage and reading of them are also considered elementary operations.Overall, elementary operations contribute to the total computational cost with a term  (| |), which is however dominated by the computational cost of executing approx_trunc, apply_rule and merge_dist.We examine their cost separately.
The function approx_trunc is invoked once when an observe node is accessed and twice when a test node is accessed, for the true and the false branch respectively.When B(z) is in the form c 1 • z 1 + . . .+ c n • z n ⊲⊳ c a singular value decomposition is performed to change coordinates, so that in the new set of coordinates the truncation set is a hyper-rectangle (cost  ( 3 ), [Gu and Eisenstat 1995]).Then, a new mixing coefficient has to be computed for each component to convert the truncated GM into a mixture of truncated Gaussians (cost  ()).Finally, for each truncated Gaussian, the first two order moments are computed using the formulas in Kan and Robotti [2017] (cost  ( 4 ), for a detailed account see Supplementary Material).When B(z) is in the form x i == c, to apply the formulas in Bishop and Nasrabadi [2006], matrix multiplication must be performed, amounting to cost  ( 3 ) [Skiena 2008].Overall, we have a cost of  | |   4 .
The function apply_rule is invoked every time a state node is accessed.Since affine transformations require matrix multiplication (cost  ( 3 )), the total cost is  | |   3 .
Finally, the function merge_dist is invoked whenever a node is accessed and performs a scalar product.It contributes for a cost  (| |  ) .
The total cost of SOGA is therefore The function prune_dist is invoked at most | | times.When invoked, it first computes the cost for all possible pairs of components, which is at most   (  − 1).The computation of the cost function for each pair has cost  (), while the computation of the covariance matrix (cost  ( 2 )) is performed for a single pair.At its first iteration, the computational cost of prune_dist is, therefore,  ( 2  ).After this, new costs are computed for at most   −  times, but each time only for  <   pairs of components.The whole cost of the function is therefore  ( 2  ).
Substituting in (11) one gets that the computational cost with pruning is bounded by: Comparing ( 11) with ( 12) one can conclude that pruning is only effective in reducing the computational cost when the overhead introduced by pruning ( 2 2 2| |  ) is less demanding than dealing with the full space of paths ( 0 2 | | ).To keep the overhead contained one could use small values of  while keeping also | |  small (e.g. by introducing many pruning instructions).However, this introduces an additional level of approximation which can hinder the accuracy of SOGA.

NUMERICAL EVALUATION
We split the numerical evaluation into four parts.In Section 7.1 we compare SOGA with four baseline tools representative of different inference methods for estimating the posterior mean: STAN for MCMC [Carpenter et al. 2017], PSI for exact symbolic analysis [Gehr et al. 2016], AQUA for quantization of posterior distributions [Huang et al. 2021] and Pyro for VI [Bingham et al. 2019].In Section 7.2 we compare SOGA against Pyro in performing Maximum A Posteriori (MAP) estimation [Gelman et al. 2013], to test how well our method is able to capture the posterior distribution, in addition to its moments.Finally, in Sections 7.3 and 7.4, we evaluate SOGA's performance on two applications that have been extensively studied in the literature, owing to their significant practical impact.The first application is inference on models involving mixtures of continuous and discrete distributions, as in Gao et al. [2017]; Kharchenko et al. [2014]; Pierson and Yau [2015]; the second application is Bayesian inference on collaborative filtering.[Zhao et al. 2013].
The considered programs are listed in Table 3. Pruning was applied after every test and observe nodes (repeating it only once if they occur subsequently) for programs whose computation time was greater than 1 s and at least ten times larger than the worst performing tool.We set  = 0.015  except for NormalMixtures; there, since   exceeded the tens of thousands, we set  = 30.With this strategy, the pruning algorithm was invoked only in 4 out of the 18 considered programs.
The experiments were performed on a laptop equipped with a 2.8 GHz Intel i7 quad-core processor and 16 GB RAM, CmdStan v2.30.1 and Wolfram Mathematica 13.1 (Wolfram Research, Inc.), setting a time-out threshold at 600 s.
7.1.1Results.Table 3 collects the results where time refers to the average runtimes (in seconds) out of 10 executions and value refers to the computed expected value of a target variable in the model.For each model we specify the kind of distributions involved: B=Bernoulli, Be=Beta, D=Discrete, G( * ) =Gaussian (with non-constant mean), U=Uniform.For STAN, we indicate the  time needed to obtain a 5% confidence interval whose amplitude is contained in 1% of the mean (up to a maximum of 10 5 samples).For PSI we report the sum of the time needed to generate the symbolic formula and that needed to integrate it when in the presence of non-simplified integrals (observe that in the original paper, only the time for symbolic computation was considered).For VI, due to high sensitivity with respect to the hyperparameters [Hoffman et al. 2013], we proceed using three different learning rates (0.01, 0.005, 0.001), and we report the most accurate estimation (detailed results can be found in the Supplementary Material).The number of iterations of the stochastic gradient descent is increased from a minimum of 100 to a maximum of 10k, stopping the optimization if the difference between the estimated mean posterior and the mean posterior estimated 100 steps before is less than 1% of the current estimation.For SOGA, runtimes labeled with * indicate that the pruning algorithm was invoked.Finally, we highlight the fastest method with a grey background.For accuracy evaluation, we consider PSI's results as ground truth when available (i.e., when PSI terminates and the integration is successfully computed within the timeout threshold).We made this choice since PSI is an exact method and the only guaranteed to be exact among the evaluated tools.
Only in one example, BayesPointMachine, SOGA performs poorly in terms of accuracy, estimating a value of 0.011 for a parameter estimated by STAN as 0.054 ± 2e−4.We remark, however, that this program turned out to be particularly difficult to solve for AQUA (which issued an out-of-memory error) and PSI (which was not able to complete the symbolic computation of the posterior).On the other examples, SOGA yields very good accuracy, with a relative error below 7% across all comparable models.We now discuss a detailed comparison of runtimes against each tool method.
STAN.STAN does not support discrete posteriors; hence it could not analyze eight models.For the models that can be analyzed by both, SOGA outperforms STAN in terms of runtimes on Altermu, Altermu2, and RadarQuery.By contrast, STAN outperforms SOGA in Bernoulli and NormalMixtures.We attribute this to the presence of non-Gaussian priors and a large number of observations, resulting in a high number of components and truncations to be computed.Both have similar performance on the remaining models.
AQUA.SOGA is more flexible than AQUA in that it supports discrete posteriors.On ClickGraph, and TrueSkills AQUA issued an out-of-memory error while SOGA could approximate the posterior mean.We ascribe this issue to the fact that AQUA uses tensors, whose dimension rapidly increases with the number of distributions.In particular, in AQUA each distribution must be stored in the tensor, while SOGA can use fresh read-only variables which are dropped once the marginal over the output variables is evaluated.Notably, SOGA outperforms AQUA also on Altermu, Altermu2 and TimeSeries proposed in the AQUA paper [Huang et al. 2021].Instead, AQUA is more efficient than SOGA in RadarQuery, Bernoulli and NormalMixtures, for the same reasons explained for STAN.
Pyro.Being a gradient-based method, Pyro's VI offers limited support for discrete variables, 1 so that, similarly to STAN and AQUA, we were not able to encode models with discrete posterior.In addition, we found that the encoding of RadarQuery incurred runtime errors.For the remaining models, VI is comparable to SOGA, when not less accurate, and taking longer runtimes.Noticeable exceptions are BayesPointMachine, where, as already noticed, SOGA is not able to achieve a good accuracy and ClickGraph, where SOGA incurs long runtimes even with pruning.On the other hand, VI exhibits a significant sensitivity with respect to the choice of the hyperparameters, which can result in non-convergence and sloppy approximations for poor choices of the parameters (results for all the tested learning rates can be found in the Supplementary Material).(SurveyUnbias, Trueskills, Radar) or by the high number of observations (Altermu, Altermu2, TimeSeries).In Altermu, PSI could not compute a symbolic formula within the time-out threshold, while in BayesPointMachine, TrueSkills, and TimeSeries the formula contained non-simplified integrals, whose integration in Mathematica took longer than the time-out threshold.Notably, for models involving only Bernoulli distributions (Burglar, Grass, MurderMistery, NoisyOr, TwoCoins), for which both tools are exact, their performance is comparable.

Performance of Pruning.
In the right inset we report the runtimes, values, and number of components  for SOGA without pruning applied to the four models that required the application of pruning (Bernoulli, ClickGraph, ClinicalTrial, NormalMixtures).All models share the occurrence of GM distributions with more than 1000 components.For Bernoulli, pruning allowed comparable runtimes with respect to the best-performing tool, while base SOGA was about 9 times slower (11.97 s).In addition, base SOGA computed an output indistiguishable from the pruned version up to the third decimal digit.For the other three cases, base SOGA was unable to compute a numerical result within the time-out threshold.For these, the number of components  is the one reached before timing out.Applying pruning allowed SOGA to complete the computation within the time-out threshold while achieving excellent accuracy with respect to the ground truth.

Maximum a Posteriori Estimation
Since SOGA approximates the posterior with a Gaussian mixture, it can also compute the Maximum a Posteriori (MAP) estimate by simply returning the mean of the GM component with the largest mixing coefficient.Here we compare its performance against Pyro, in which MAP estimation can be performed using a different parametrizing distribution than the one used for the mean posterior inference. 2To get a baseline for the MAP value, we first generate the symbolic posterior using PSI and then optimize it numerically.For models in which PSI is not able to compute the exact posterior, we estimate the ground truth by taking 10k samples from the posterior and binning them into 50 intervals; then, MAP is the midpoint of the interval with the most samples.We tested the same models with continuous posterior reported in Table 3, except Altermu2, since, by visual inspection, we found that it has a flat posterior.
Results are reported in Table 4. Due to Pyro's sensitivity to hyperparameters observed in the previous section, we tested three different values of learning rate.Table 4 only reports the closest estimation to the baseline; full results are available in the Supplementary Material, confirming the sensitivity issues.These experiments show that SOGA performs relatively worse than in the estimation of the posterior mean.This is expected because SOGA is designed to match means and variances, but it does not necessarily approximate the whole distribution.However, compared to Pyro, it is still able to obtain the closest estimation for BayesPointMachine, ClickGraph, TrueSkills, Altermu and NormalMixtures, while it is outperformed by Pyro in Bernoulli, CoinBias, SurveyUnbias and TimeSeries.Finally, we note that analyzing Bernoulli with SOGAprune degrades the MAP estimation, unlike in the posterior mean.

Mixtures of Continuous and Discrete Distributions
Mixtures of continuous distributions and discrete probability masses appear in different domains such as in Gao et al. [2017]; Kharchenko et al. [2014]; Pierson and Yau [2015].Languages such as STAN and AQUA do not support them.Ad hoc methods have been proposed in Tolpin et al. [2016] and Nitti et al. [2016].More recently Wu et al. [2018] extended the sampling techniques used in BLOG for more accurate inference.We test SOGA on the three benchmarks proposed by Wu et al. [2018] and compare its runtimes against PSI, BLOG, and variable elimination (VE) as implemented in Pyro [Obermeyer et al. 2019 are reported exactly as in the original paper, while the Tracking_ example from Section 1.1 is adapted since it was originally cast as a control problem.All examples have a Dirac delta posterior, which is computed exactly by all.However, SOGA is the fastest and the one which scales better as the number of steps  increases.

Bayesian Inference for Collaborative Filtering
Collaborative filtering models are well-known in machine learning for applications to recommendation systems [Koren et al. 2021].We target the problem of Bayesian inference on the latent factor model proposed in Hofmann and Puzicha [1999], which arises after a singular value decomposition and serves as the basis for solving an optimization problem [Zhao et al. 2013    will result in the same distribution for    , which is the only one observed.In some cases, one may still want to model each parameter separately to allow for more flexibility.In this particular case, though not solving the problem of symmetry and non-identifiability, SOGA can estimate the distribution of    faster than its competitors.Results are shown in Table 5 for various values of .PSI results are not reported because the tool was able to produce a symbolic formula only up to  = 3; however, even in these cases, numerical integration of the non-simplified integrals required more than 600 s.Although STAN's estimates are accurate and close to SOGA's ones, its runtimes are longer due to the increased cost of sampling, which is exponential in the number of variables.
As above, we attribute AQUA's out-of-memory error to its tensor based representation.For VI, we report results for the learning rate 0.005, which we found to be the one performing best in average, among the tested ones.A full set of experiment results can be found in the Supplementary Material.VI exhibits an accuracy comparable to SOGA's, but significantly longer runtimes.We observe, however, that thanks to vectorization, VI's runtimes do not significantly increase with .
Overall, the excellent runtime performance of SOGA is due the particular structure of the models, which exhibit Gaussian posteriors on variables combined in a scalar product without introducing truncations that could slow down the computations.

FURTHER RELATED WORK
Inference.In addition to the techniques discussed earlier in this paper, volume computation can be quite efficient for discrete models [Filieri et al. 2013;Holtzen et al. 2020]; however, it cannot be applied to continuous distributions.All the mentioned methods use a pdf representation of the distributions.More recently, representations using generating functions have been investigated, but only for discrete distributions [Chen et al. 2022].Finally, some approaches use moment-based invariants [Barthe et al. 2016;Bartocci et al. 2020;Chakarov and Sankaranarayanan 2014;Katoen et al. 2010;Moosbrugger et al. 2022].While they share the idea of computing moments up to a certain order, they differ both with respect to the supported programs and the computed information, making a direct comparison difficult.
Universal Approximators.Our approach can be ascribed to the practice, common in many branches of mathematics, of studying universal approximators, whereby one shows that a given function belonging to a certain class is shown to be approximated, arbitrarily closely, by another family of (parameterized) functions.Notable examples are polynomials [Pérez and Quintana 2006], and neural networks [Hornik et al. 1989;Zhou 2020].
Gaussian Approximators.The approximation-by-Gaussian approach is also common to Laplace approximation [Tierney and Kadane 1986].Laplace approximation is a mode matching strategy and is more expensive computationally than VI, as it is based on an optimization process to find the mode.Generally, however, it is inferior to VI [Bishop and Nasrabadi 2006].Another kind of Gaussian approximation is Gaussian Smoothing [Chaudhuri andSolar-Lezama 2010, 2011], although it does target neither probabilistic programs nor the inference problem.

CONCLUSIONS
Gaussian Semantics is a family of approximations parameterized by the moment order to match against a Gaussian mixture at each location of a probabilistic program.The universal approximation theorem states that such a family converges to the true semantics.Although, in principle, any program location could be treated with different moment-order matching, in practice this is a difficult problem that requires the solution of a system of nonlinear equations.While the system is guaranteed to have a solution, finding it using SMT solvers over reals or numerical methods yields poor results, due to long computational times and numerical instability.Therefore we leave open the general problem of implementing Gaussian Semantics for any order of moments.However, we provide an analytical method that matches second-order moments of the exact probabilistic semantics (SOGA).The numerical results for the case studies demonstrate high quality of the approximation and that SOGA complements state-of-the-art methods for probabilistic inference and in particular for inference on models with mixtures of discrete and continuous distributions and for Bayesian inference on collaborative filtering models.Due to the efficiency shown by SOGA, we believe that in these cases our method can effectively be used as an alternative to sampling.
As regards future work, while SOGA performed satisfactorily on all tested benchmarks, it could not be applied to some of the models from the same repositories, due to the limitations of our syntax.Extending the latter to include general distributions depending on non-constant parameters, unbounded loops and non-polynomial functions would widen its scope of applicability.A possible way to overcome the former restriction could be learning offline the approximating distributions as a function of the variable parameters, but how to do this efficiently is currently not clear, even though of great interest.For what concerns unbounded loops, we observed that for almost surely terminating programs, the loops can be unrolled for a finite number of iterations so that the error committed in the approximation is arbitrarily small.This suggests that increasing the number of unrolled iterations together with the number of moments matched should preserve our convergence theorem, even in the case of almost surely terminating unbounded programs.Similarly, one could exploit convergence results for polynomial approximations to extend the convergence result to sequences of polynomial programs that approximate programs featuring non-polynomial functions.We leave the possibility to explore these extensions of our convergence result in future work.
Finally, one might devise algorithms for higher-order moments.While an extension to exact higher-order moment matching seems hard, a relaxed moment problem could be defined as an optimization problem [Hansen 2010].

ACKNOWLEDGMENT
This work was partially supported by the projects SERICS (PE00000014) and by Investment 1.5 Ecosystems of Innovation, Project Tuscany Health Ecosystem (THE, B83C22003920001) and Interconnected North-East Innovation Ecosystem (iNEST, ECS_00000043) under the MUR National Recovery and Resilience Plan funded by the European Union -NextGenerationEU.We would like to thank Joost-Pieter Katoen for his feedback on a preliminary version of this paper and the anonymous reviewers for their valuable comments.Proposition C.2.For any  and  the Criterion of Choice uniquely identifies match  ().
Proof.By Proposition C.1   () is non-empty, moreover by Theorem 17.2 in [Schmüdgen 2017] there is at least one moment-matching mixture such that  < |N ( )|, therefore  * is welldefined.Once  * is fixed, the set of parameters P satisfying the moments conditions is the set of solutions of a system of polynomial equations equating the moments of the mixture of  * components, expressed as functions of   ,   and Σ  , to the moments of  [Wang et al. 2015].Being the set of solutions of a polynomial system, P is closed.Moreover, since by Example 12.2.8 in [Cover 1999] for fixed moments the entropy is bounded from above, − is bounded from below, and we can always choose  > 0 so that P * is contained in P ∩ [0, 1]  * × [0, ]  * + 1 2  (+1) * .It follows that P * is compact.Finally, the maximum with respect to the lexicographic ordering can be seen as maximising projections of the vector of parameters  on different coordinates, in a given order.Since P * is compact, the set of maximals with respect to the lexicographic ordering is non-empty, but since the lexicographic ordering is a total order the set of maximals can have only one element which is uniquely defined.□ Lemma C.3.The following two properties hold: i) when  = 2, match 2 () is a single Gaussian variable with mean and covariance matrix equal to those of ; ii) if  is Gaussian, for any  ≥ 2 match  () = .
Proof.The first point follows observing that, since we want to match the first two order moments, a single Gaussian variable can be used, so  * = 1.Moreover, since we have a single component with mean and covariance matrix fixed, the set P has a single set of parameters and our criterion of choice reduces to approximating  with a Gaussian having the same mean and covariance matrix.
On the other hand, if  is Gaussian then for any  ≥ 2 we always have  * = 1 and the set P has a single set of parameters, so match  () = .□ D PROOF OF THE UNIVERSAL APPROXIMATION THEOREM Theorem D.1.Assume that  = ( , ) is a program such that for each  ∈  and each path  ∈ Π  the output distribution  of ⟦⟧  satisfies the following: H1)  is determined by its moments; H2) if  is the input distribution for a test or observe node  ′ , then the set defined by the LBC labelling  ′ is an    -continuity set.Then there exists a sequence of maps (  :  → N 0 )  ∈N such that: where the convergence is intended in the weak topology, or equivalently, in the Levy-Prokhorov metric.

D.1 Preliminary Results
The proof of the main theorem relies on two auxiliary results: first, we show that the exact semantics preserves weak convergence (Lemma D.2); second we prove that, given a weakly converging sequence of distributions   , it is always possible to choose a sequence of integers   such that     (  ) converges to the same limit (Lemma D.3).

F SOGA IMPLEMENTATION
We assume that each node in the control-flow graph has two attribute lists of children and parents, whose elements point, respectively, to children and parent nodes.Furthermore, each node has two attributes,  and :  is a non-negative scalar proportional to the probability of reaching that node, while  stores the output distribution (in the form of a GM) computed by that semantics of the node.Nodes have type-specific attributes: nodes of type test and observe have an attribute LBC storing an LBC expression; nodes of type state have an attribute cond taking value true, false or none and an attribute expr storing an assignment expression.
To apply SOGA we create a queue _ containing the entry node.Then we apply iteratively SOGA on pop(_).When called on a new node, the algorithm first accesses the attributes  and  of its parents, and invokes merge_dist on the list of pairs (, ).Then, computes the semantics corresponding to the node type as follows: • if  : , it initializes .to 1, .to  0 and .to ; • if  : , it saves the LBC in .,and calls the function approx_trunc; • if  : , it does nothing; • if  : , it checks if .=  or .=   and in that case retrieves the LBC condition from the parent  node.Then it calls the function approx_trunc.This results in a new pair (, ) on which the function apply_rule is applied.Finally, the output is stored in .,.;• if  : , after merging the resulting distribution is returned as the approximated output distribution of the whole program.
After executing the semantics of the node the queue is updated, by pushing the children nodes of the current node.This is detailed in Algorithms 3-7.

Fig. 1 .
Fig. 1.Left part: general approximation scheme used in Gaussian Semantics.In program location   the exact semantics of the program transforms a GM  into a non-GM distribution .In the same program location, Gaussian Semantics transform  into another GM   , that approximates  using moment-matching.Right part: concrete example.The Gaussian distribution  is transformed by the exact semantics into a truncated Gaussian  and by the second-order Gaussian Semantics into the red Gaussian distribution   .
Fig. 4. Marginal pdf of  1 at node  2 in Algorithm 1 given by exact semantics () and the Gaussian Semantics with  = 2, 3 and 4. In the legend we report the KL divergence with the true distribution and the time need to compute the approximating GM.
1 https://pyro.ai/examples/enumeration.html Proc.ACM Program.Lang., Vol. 8, No. POPL, Article 63.Publication date: January 2024.Inference of Probabilistic Programs with Moment-Matching Gaussian Mixtures 63:25 • A state node  takes as input a pair (, ) and returns a pair (,  ′ ) depending on its label.If it is labelled by skip, it returns (, ).If it is labelled by x i = E(z), it returns (,  ′ ) with  ′ the distribution of the vector  [  =  ()]:if  :  then ⟦⟧  (, ) =  ′ ) if  is labelled by x i = E(z).•A test node  labelled by B(z) takes as input a pair (, ) returns ( ′ ,  ′ ) depending on the value of  (  ()).In particular, first, the augmented vector  and its distribution   are considered.If  (  ()) = true, then the node computes the probability of the Boolean condition evaluating to true, i.e.    (⟦B(z)⟧).Then, it conditions the current distribution to such event, i.e.   | ⟦B(z)⟧.The result is the output pair (• ) for each node type.Then, we use again Theorem 5.4 and H1 to say that    ( ′  ) →  ′ as  → ∞.Finally, since ⟦⟧   (  ,   ) is obtained from ( ′  ,  ′  ) applying the operator (I,    (  ) ), we can use a diagonal argument to ensure that we can fix   (  ) for each  so that ⟦

Table 2 .
Function implementing node semantics in SOGA and SOGAprune.The input arguments _, _ are retrieved by the parent nodes' attributes , .The input arguments  ,  and  are stored in node attributes when the cfg is compiled from the program script.
(  | ⟦B(x)⟧) = D. Observe that we have assumed that for at least one  p ≠ 0. However, if that is not the case   (⟦B(x)⟧) = 0 and the conclusion still holds.•Let  : .If B(x) has a probability greater than zero conclusion follows as in the previous case.If B(x) has the form x k == c we can use the same argument, but we need to replace    (⟦()⟧) with the normalization constant +  N (  , Σ  ) +     +  N (  , Σ  ).This produces the best possible approximation of the two components[Chaudhuri and Solar-Lezama 2010].The procedure is iterated until the number of components is less than  (observe that after the first two components have been merged into a new one, we need to recompute the cost only for the pairs in which the new component appears).A summary of how the semantics of each node is implemented is reported in Table2, while detailed algorithms for SOGA implementation can be found in the Supplementary Material.

Table 5 .
Runtimes (in seconds) and computed values for the collaborative filtering model N (   , 1).

Table 8 .
Comparison between Pyro and SOGA for Variational Inference on Collaborative Filtering models.