Compiling Probabilistic Programs for Variable Elimination with Information Flow

A key promise of probabilistic programming is the ability to specify rich models using an expressive program- ming language. However, the expressive power that makes probabilistic programming languages enticing also poses challenges to inference, so much so that specialized approaches to inference ban language features such as recursion. We present an approach to variable elimination and marginal inference for probabilistic programs featuring bounded recursion, discrete distributions, and sometimes continuous distributions. A compiler eliminates probabilistic side effects, using a novel information-flow type system to factorize probabilistic computations and hoist independent subcomputations out of sums or integrals. For a broad class of recursive programs with dynamically recurring substructure, the compiler effectively decomposes a global marginal-inference problem, which may otherwise be intractable, into tractable subproblems. We prove the compilation correct by showing that it preserves denotational semantics. Experiments show that the compiled programs subsume widely used PTIME algorithms for recursive models and that the compilation time scales with the size of the inference problems. As a separate contribution, we develop a denotational, logical-relations model of information-flow types in the novel measure-theoretic setting of probabilistic programming; we use it to prove noninterference and consequently the correctness of variable elimination.


INTRODUCTION
A probabilistic model describes a joint distribution ( , ) over latent variables and observations .Bayesian inference is concerned with computing ( | ) = ( , )/ ( ), the posterior distribution of conditioned on .Typically, the hard work is in computing the marginal likelihood ( ), also known as the model evidence.Computing the marginal may be intractable, as it generally requires integration over all possible values of the latent variables: ( ) = ∫ ( , )d .Probabilistic programming languages (PPLs) are powerful means to specify probabilistic models and solve inference problems.A PPL allows for harnessing the expressivity of a high-level programming language to specify rich Bayesian models, as opposed to using more limiting formalisms such as Bayesian networks.However, the expressive power that makes PPLs enticing makes inference even harder.Recent advances in PPL inference often specialize to a particular class of models and impose restrictions on expressible models, prohibiting useful features such as recursion.
Variable elimination (VE) is an effective approach to inference for probabilistic models with discrete random variables (r.v.s) [68].It works by marginalizing out (i.e., eliminating) discrete r.v.s from a joint distribution, thus producing the marginal likelihood or reducing the inference problem to ones involving only continuous r.v.s.
VE has been generalized to PPLs, but the support for VE does not meet the desired level generality and scalability.For example, Factorie [42] is an interpreted PPL that supports VE for factor graphs, but its VE algorithms make specific, rigid assumptions on the factor-graph structure.SlicStan [29] and PERPL [15] are more recent, compiled approaches to VE and are designed to support broader classes of models.Unfortunately, SlicStan lacks support for recursion-recursion is a natural means to specify models in domains such as language modeling and computational biology.Another issue is that the time SlicStan takes to compile a program does not scale well with the size of the inference problem.PERPL, while supporting (unbounded) recursion, is designed to work in the absence of continuous r.v.s.For exact inference on certain recursive models involving only discrete r.v.s, PERPL does not empirically scale as well as the best algorithms known for the same models.
This paper presents a novel approach to VE for an expressive PPL.While acknowledging that it is an elusive, likely impossible goal for any single inference method to excel at all expressible programs, we aim to achieve good efficiency and scalability across a wide range of programs featuring bounded recursion, discrete distributions, and occasionally continuous distributions, all the while with provable correctness guarantees.We embody this approach in a PPL called Mappl.

VE as compilation.
In Mappl, a probabilistic computation is compiled into a pure computation of the marginal likelihood.A discrete r.v. is eliminated by summing (over the variable's finite support) the product of all factors dependent on that variable.Control flow, namely branching and function calls, are compiled in continuation-passing style: the compiled branch or function takes as input a continuation representing the product of all factors dependent on the return value.
Decomposition, memoization, and amortization.We observe that many recursive probabilistic programs of interest enjoy the property that the exponentially many possible executions share substructure.For these programs, the VE compilation effectively decomposes, in a recursive manner, a global marginal inference problem into subproblems amenable to dynamic programming [9].The same subproblem instance may be queried multiple times during inference, so the solution to the inference subproblem can be memoized and thus the cost of solving the subproblem amortized.
The subproblems are likely easier to solve than the global problem, because they have reduced dimensionality and are free of language constructs such as recursion that are difficult for inference.Some of these subproblems may be solved easily if they happen to contain no continuous r.v.s, some may be solved by existing approximate inference methods that specialize in straight-line programs with continuous r.v.s, and some may sometimes even be solved analytically by capitalizing on advances in symbolic integration.The upshot is that the VE compilation may render an otherwise intractable inference problem solvable in polynomial time.
Factorization by information-flow typing.It is well understood that the effectiveness of VE critically depends on exploiting independence to factorize joint distributions.With recursion, there is more reason for a VE compiler to exploit independence, as decomposition and memoization just would not be as effective if subproblem definitions were too coarse-grained.
To reason about independence, we design an information-flow type system for Mappl.To eliminate a variable x from a computation, the Mappl compiler consults information-flow typing to factorize the computation into two parts, the probabilistic side effects of which are respectively dependent and independent of x.The idea of using information-flow typing [21] to reason about independence is similar to that in SlicStan [29], but Mappl's type-system design and formal development differ substantially from SlicStan's.SlicStan is an imperative while-language where variables must be global and programs must have deterministic support, whereas Mappl is an expressive functional language allowing recursion and stochastic support.SlicStan is defined with an operational semantics, whereas we adopt a compositional, denotational treatment suited for reasoning about independence-and thus for factorizing computations-for open terms under binders.

Generality and scalability.
Mappl generalizes VE compilation and information-flow typing to recursive probabilistic programs-for example, those expressing hidden Markov models (HMMs) and probabilistic context-free grammars (PCFGs). 1 Experiments show that Mappl's VE compiler can generate code that recovers widely used polynomial-time inference algorithms: the forward algorithm for HMMs [55] and the inside algorithm for PCFGs [5].
We consider it important to achieve good scalability of not only inference but also compilation.Notably, compared with SlicStan, the increased generality of Mappl to support recursion has implications for the scalability of compilation.In SlicStan, HMMs have to be expressed by unrolling recursion into a fixed number of iterations.For such models, SlicStan's compilation time does not scale well with the problem size (e.g., the length of the observed sequence).In Mappl, by contrast, compilation time stays constant as the problem size increases for such models, because Mappl can express them as probabilistic recursive functions and compile them to pure recursive functions, without unrolling.In addition, SlicStan uses a semilattice-as opposed to a lattice-of information-flow labels for factorization, which is considered to impede efficient label inference.By contrast, Mappl uses a simpler, better-behaved two-level lattice.
Correctness guarantees.We want to show that Mappl's VE compilation is correct by proving that the compiled program computes the marginal likelihood as defined by the denotational semantics.Since compilation uses information-flow typing to factorize computations, we need to show that our information-flow type system is sound with respect to the denotational semantics.To that end, we contribute a logical-relations model of information-flow types for proving noninterference in the novel, measure-theoretic setting of probabilistic programming.

KEY FEATURES, MAIN IDEAS, AND EXAMPLES
We use a simple hidden Markov model as a starting point to illustrate the key features and the main ideas of our approach.Figure 1a models a sequence of observations as being generated by a sequence of hidden states.The recursive function hmm takes as input the initial hidden state z 0 and a data sequence, which is assumed to be gathered by prepending newer observations to the front of the sequence.The return value of hmm is the next hidden state.The probability of transitioning from a state to the next state is given by a pure function step : B → dist(B).The probability of observing a data point in a state is given by an emission function emit : B → dist(R).The HMM is conditioned on observing the data sequence.
The inference problem is, given any given data sequence, to compute the marginal likelihood of observing it.The Mappl compiler translates the recursive, probabilistic hmm in Figure 1a into the recursive, pure hmm in Figure 1b.When the compiled hmm is called with the top-level continuation _. 0 for the parameter k, it computes the desired marginal likelihood.This procedure for exact inference runs in time linear in the length of data, recovering the forward algorithm for HMMs.
z. logML( w = sample(Normal(0,1)) factor(logPr(emit ′ (w,z); x)) ) + logsumexp B ( y. logPr(step(z); y) + k(y)), z 0 , xs) end Figure 1.Examples of VE-compiling probabilistic programs in Mappl.The return value of a multi-line block of terms is that of the last term.The construct ret( ) is the monadic return that li s a pure expression to a probabilistic term.The primitive logPr ( ; •) is the log-probability density or mass function of the distribution .logsumexp B : (B → R) → R and logsumexp N : N → (N → R) → R are the usual log-sum-exp functions for log-domain sums.The choose primitive randomly selects a natural number in a given range, but unlike the sample primitive, choose does not otherwise incur any probabilistic side effects.
Handling expressive language features.The hmm example uses recursion, which violates the assumptions of many existing approaches to PPL inference.Instead of using recursion to define the HMM, one could perform exact inference by unfolding the model into a fixed number of iterations and then applying existing inference methods good at nonrecursive programs.But this approach is awkward when the number of iterations is not known statically-namely, when data is dynamically sized.A key design goal of Mappl is to support a broad class of models definable with bounded recursion where the bound may not be known statically.
Unrolling recursion is even more awkward for models that are not iterative but properly recursive and for models where control flow is stochastic.A prime example is PCFGs.Figure 1c shows a PCFG model in Mappl that samples parse trees for the simple grammar → a (0.5) | (0.5).The program is conditioned on it generating a parse tree for a sequence of words.The recursion pattern of pcfg is more complex than the HMMs.First, it is tree-structured rather than linear.Second, control flow is stochastic-which branch of case is taken depends on the sampled variable z in each recursive call.Yet, the Mappl compiler can still compile the program into a pure one that computes the marginal likelihood of observing words.The compilation applies to mutually recursive functions, too, which are useful for expressing more complex PCFGs in practice.The pure pcfg in Figure 1d recovers the cubic-time inside algorithm.
An information-flow type system.VE for Bayesian networks has a worst-case exponential running time, but tractable inference is possible for certain models by exploiting independence in the model structure: factors independent of a variable can be factored out of the summation that marginalizes out the variable.
Similarly, VE for recursive programs requires analyzing dependence and independence, for which we use a static information-flow analysis.In particular, we design an information-flow type system.Information-flow typing is a compositional, automatable means to reason about dependence, with applications in many different contexts [1], among which the most well-known is language-based security [60].We repurpose the idea to our expressive PPL.
While information-flow typing for the pure fragment of Mappl is mostly standard, it is less obvious how to design typing rules for the probabilistic fragment.A principle is that the design of the type system should be guided by the denotational semantics.The denotational semantics of a probabilistic computation in Mappl is a measure over the space of its possible outcomes, which can be roughly thought of as a function that, given a set of values, produces the unnormalized probability that the computation returns a value in that set.Accordingly, our type system assigns to a probabilistic computation a labeled type ℓ : types the return value of the computation, and the label ℓ classifies the level of information contained in the measure denoting the computation.∆; Ψ; Ξ; C; Γ ⊢ : ℓ 1 ∆; Ψ; Ξ; C; Γ, x : ℓ 2 ⊢ : ℓ 3 ∆; Ψ; Ξ; C; Γ ⊢ x = ; : ℓ 1 ⊔ℓ 3 For example, consider typing variable bindings in the probabilistic fragment.As expected, the rule requires the composed computation x = ; to have a label no lower than the labels of the computations and being composed.However, it does not explicitly constrain the label ℓ 2 of the variable x being bound.In particular, it is not required that the label of x be at least as high as the label of .This is in keeping with the denotation of x = ; , which is defined by composing the measures that denote and and marginalizing over the entire support of x.
Factorizing computations using information-flow typing.To eliminate a variable from a probabilistic computation, Mappl's VE compiler infers labels for the subcomputations, constraining that the variable being eliminated be labeled H (high).As many subcomputations are inferred to be labeled L (low) as possible.The larger computation can thus be factorized into a H partition and a L partition, and the L partition need not be involved in the elimination of the variable.
For example, in Figure 1a, to eliminate sample(step(z)) from the cons branch, the sampled variable is labeled H, and the information-flow analysis deduces that the probabilistic side effects of the recursive call hmm(z 0 , xs) and the conditioning observe(emit(z); x) can be labeled L, while the side effects of sample(step(z)) and any computations in the caller that depend on the return value z must be labeled H.This factorization indicates that in the compiled program, only the H partition needs to be nested under the logsumexp B that marginalizes out sample(step(z)).
Compiling with continuations.Continuation-passing style (CPS) transformations [19] are an effective way to eliminate various forms of effects away from a program.It has found applications in the implementation of PPLs [28] and in the cost analysis of randomized algorithms [4,35].The Mappl compiler uses CPS to capture dependence in the presence of functions calls and branching.
For example, in the compiled hmm, a continuation of type B → R represents the dependencies of the recursive call's return value z: the continuation takes z as input and returns a log-likelihood that is the transformation of those caller terms whose probabilistic side effects depend on z.In the compiled pcfg, the continuations passed to recursive calls are less involved, as the information-flow analysis deduces that the return value has no nontrivial dependencies.
Decomposition into subproblems with memoization.The compiled hmm and pcfg run in polynomial time, despite the exponentially many possible executions of the input programs.This algorithmic efficiency is because the recursive programs have dynamically recurring substructure, on which the VE compilation capitalizes to generate recurring subproblems and memoize their solutions.For example, the compiled hmm corresponds to the following recursive equations for computing, in log domain, the marginal likelihood L(k, z 0 , data) of observing data: The generated inference subproblem 2 eliminates the discrete r.v.sample(step(z)), denoted by y, by summing over its finite support.Although the subproblem is nested inside a continuation, inference time is linear in the length of data: the subproblem needs to be solved at most once for each of the two possible values of z and for each continuation k created-there can only be as many as the length of data.The solution to a subproblem instance, once computed, can be memoized and reused whenever the same subproblem instance is encountered again.This decomposition and memoization is what recovers the dynamic-programming algorithms for HMMs and PCFGs [55,5].Contrast this with solving the global problem directly (say, with an enumeration-based approach to exact inference) without first compiling the recursive program to decompose it into subproblems:

=0
Pr emit(z i ); x i+1 Pr step(z i ); z i+1 The sums over the Boolean variables z 1 , z 2 , ..., z n enumerate all possible execution traces of the program.Hence, the inner product has to be computed (2 ) times, where is the length of data.
Continuous parameters.Consider hmm ′ in Figure 1e, a hybrid discrete-continuous HMM.It is largely the same as hmm except that the emission function takes as input an additional, freshly sampled Gaussian variable w.Directly solving the marginal-inference problem using a generalpurpose inference method such as importance sampling would be intractable.Instead, Figure 1f shows that hmm ′ is compiled similarly to hmm, with the factor 1 replaced by a nested inference problem marginalizing out w. Marginal inference with the compiled hmm ′ is efficient, even when the inference subproblem logML(...) is solved using general-purpose Monte Carlo methods.The key is that with memoization, this subproblem generated by Mappl's compiler needs to be solved only ( ) times, once for each of the two values of z and for each of the at most values of x: A semantic model of information-flow types.Factoring out independent factors is akin to loopinvariant code motion, a compiler optimization that moves code outside a loop if it is independent of the loop index.The dependence analysis VE entails is more sophisticated, though, due to the measure-theoretic nature of the semantics of probabilistic programs.Fortunately, information-flow typing provides a syntactic, principled means to reason about independence.
How can we argue that the syntactic approach of information-flow typing to reasoning about independence is semantically sound in this probabilistic setting?For an information-flow type system, the usual notion of soundness is noninterference [27].In our novel setting, we take noninterference to mean that the measure denoting a probabilistic computation of a type labeled low behaves irrespective of substitutions for its high-labeled free variables.To prove noninterference, we adapt the semantic, logical-relations proof technique [61,1], interpreting labeled types as partial equivalence relations on measures indistinguishable to an observer.We believe that our semantic model and its metatheories are the first to introduce observer-sensitive equational reasoning to a measure-theoretic setting and thus are of independent interest.Mappl has a pure, deterministic fragment and a monadic, probabilistic fragment, similar to some prior PPL formalisms [67,37,36].Pure computations take the form of expressions.The pure fragment is a simply typed -calculus equipped with real numbers, pairs, sums, iso-recursive types, -ary operations, and two primitive distributions (Bernoulli and normal, representative of discrete and continuous distributions).A special binary operation logPr ( ; ) gives the log-probability density or mass of a distribution at a point .

Syntax.
Recursive types, sum types, and product types together enable the expression of algebraic data types, including Booleans and lists: Distributions have type dist( ), where = B for a Bernoulli distribution and = R for a normal distribution.
Probabilistic computations are in the forms of terms and commands.A command sequences terms.The return value of a command is that of its last term.We will write for $ when it is clear from context that is being used as the last term of a command.Terms have the following forms: ret( ) returns the value of a pure expression , sample ( ) samples from a distribution , factor( ) conditions the program using a log-domain expression , case( ; x. 1 ; x. 2 ) branches on a sum-typed expression , and f ( ) invokes a global function f with arguments .The factor form supports soft constraints, which subsume conditioning on continuous observations-that is, observe( ; ) can be encoded as factor(logPr ( ; )).For brevity of presentation, we omit hard constraints (i.e., factor(−∞)), but they are straightforward to incorporate in both the syntax and the semantics.We sometimes write sample ( ) and logPr ( ; ), where is B or R, to indicate the type of the support of the distribution .
The pure fragment supports nested marginal inference via the form logML( ).While the command is probabilistic, logML( ) is a pure expression, since inference handles the probabilistic effects of .It returns the log-marginal likelihood of the probabilistic computation .
A program in Mappl is composed of a set of global definitions and a main command.The global definitions can be either pure (G) or probabilistic (F ).Importantly, Mappl allows mutual recursion among pure globals and among probabilistic globals.
Recursion leaves open the possibility of nontermination, but for VE in this paper, we do not concern ourselves with programs that have possibility of not terminating.
Type system.The base type system for Mappl is standard.Figure 2 shows selected rules.Expression typing judgments have the form ∆; Γ ⊢ : , where ∆ is a context mapping names of pure global definitions to their types, and Γ is a context mapping local variables to their types.Term and command typing judgments have the forms ∆; Ψ; Γ ⊢ : and ∆; Ψ; Γ ⊢ : .Computations in the probabilistic fragment can use probabilistic globals, whose types are provided by the context Ψ.
qbses are as a drop-in replacement for measurable spaces.They enable carrying out measure theory in the presence of higher-order types and recursive types, by providing well-behaved function spaces and cpos.Thus, in this paper, we use standard measure-theory notations with the understanding that we are working with qbses.For a measurable function : → R + , the Lebesgue integral of with respect to a measure on is denoted

The semantic interpretation
of each type is an qbs.Types in Mappl are similar to those in the SFPC calculus of Vákár et al. [64], which has function types and iso-recursive types; we refer the reader to their paper for detailed constructions of the qbses.For an qbs , we write ⊥ for the lifting of to another qbs with an extra element ⊥ signifying partiality.There is a commutative strong monad of measures on qbses [64]; for an qbs , we write Meas for the qbs of measures on .Figure 3 shows selected interpretations of expressions, terms, and commands.The definitions use the operator >> = to handle partiality.In addition to Here, 1 ( ) is the indicator function that is 1 if ∈ and 0 otherwise.
The contract for interpreting expressions is that an expression of type is interpreted as an element of ⊥ .The denotation ; of an expression typed in contexts ∆ and Γ is interpreted under substitutions ∈ ∆ and ∈ Γ for the bindings in ∆ and Γ.That is, and provide semantic interpretations for the bound global and local variables.The denotation of a primitive distribution is its probability density or mass function.
The contract for interpreting terms and commands is that a term or command of type is interpreted as a measure on x ; ; >> = .log ( ) logML( ) ; ;

INFORMATION-FLOW TYPE SYSTEM
Syntax. Figure 4 shows the syntax of labels, types, and contexts of the information-flow type system.The type system is parameterized by a join semilattice L of labels, although in this work, we will need only the two-level lattice {L, H} with L ⊑ H. Types are of the form ℓ , where is an unlabeled type that is further constructed from labeled types.Metavariables τ range over (labeled) types.We differentiate them from types in the base type system by typesetting labeled types in upright font and in purple.Similarly, we typeset metavariables ∆, Ψ, and Γ differently than ∆, Ψ, and Γ, as ∆, Ψ, and Γ now contain labeled types.
The type system supports label polymorphism, as well as ordering constraints on labels, for functions-that is, the type of a function can be parameterized by label variables and ordering constraints of the form ℓ 1 ⊑ ℓ 2 .Label polymorphism allows more reusable code [45,44].
Typing the pure fragment.Typing judgments for expressions have the form ∆; Ξ; C; Γ ⊢ : τ, where Ξ records the label variables in scope and C the label ordering constraints.These rules are largely standard.In particular, introduction rules (e.g., those for unit, lambdas, pairs, and distributions) do not constrain the label of the expression-the label can be arbitrarily low.A subsumption rule exists to allow weakening (i.e., increasing) the label of an expression.Subtyping rules are standard and thus omitted.For a labeled type ℓ , subtyping ≤ is covariant in both and ℓ.
The type system uses fine-grained labeling, in that every type, top-level or nested, is labeled [56].For example, the type of a pair takes the form  contents of the pair, while the top-level label ℓ 3 classifies the reference to the pair.The distinction enables fine-grained control over the flow of information.
The introduction rules for primitive distributions use the type dist( ℓ 1 ) ℓ 2 , where is either B or R, ℓ 1 classifies the contents of the distribution (i.e., how the r.v. is distributed), and ℓ 2 classifies the reference to the distribution.For instance, given x : B H , the expression Bern(case(x; _.0.7; _.0.1)) can be typed at dist(B H ) L .This fine-grained labeling allows the distribution to be stored in a data structure that can only hold L references, while controlling that when the distribution is eventually retrieved and sampled, the probabilistic effects are classified at H. Whereas coarser-grained type systems trade fine-grained control for reduced label annotation burden, label annotation is not a concern in our setting, because programmers do not specify security policies through labels as they would in a security-typed language.Instead, labels are automatically inferred.
Typing the probabilistic fragment.The design of the type system is guided by the denotational semantics: while the label of an expression is designed to classify the information the semantic value ; contains, the label of a term or a command should classify the information the measure ; ; or ; ; contains.Consider typing sample( ), where has type dist( ℓ 1 ) ℓ 2 .Since the contents of the distribution , as well as the identity of it, determine the measure denoting sample ( ), the term should be typed at a level no lower than ℓ 1 ⊔ ℓ 2 .In Figure 4, the typing rule for sample ( ) handles distributions that can be typed at dist( ℓ ) ℓ .This rule suffices, as the type dist( ℓ 1 ) ℓ 2 is covariant in both ℓ 1 and ℓ 2 : both labels can be weakened to a label ℓ ⊒ ℓ 1 ⊔ ℓ 2 by subsumption.Consider typing case( ; x. 1 ; x. 2 ), where has type (τ 1 + τ 2 ) ℓ .Information flows, via a control structure, from to the measure over the possible outcomes of evaluating the term.So the term should be classified at a level no lower than ℓ.
Typing a call to a probabilistic global function checks, with Ξ; C; L ⊢ C {ℓ/ }, that the constraints specified in the function's type (after substitution) are satisfied under the current context.Now consider typing x = ; .Since the denotation is defined by composing the two measures and , the label of x = ; is required to be no lower than the labels of and .It is perhaps surprising that in the typing rule, the label ℓ ′ of is not required to be at least as high as the label ℓ of .An explanation is that denotationally, merely defines a measure on the possible values x can take; it does not determine the value of x in any given run.
This wrinkle has implications for the precision of the information-flow analysis.Consider the example below.The unnormalized joint density of x and y consists of three factors: 1 (x) 2 (x, y) 3 (y).Suppose that we want to marginalize out x. Labeling x at H, we hope to type the third term at L, to justify that it need not be involved in the marginalization of x: x 1 (x) 2 (x, y) 3 (y) = 3 (y) x 1 (x) 2 (x, y).The typing rule for x = ; allows y to be labeled L, despite that its right-hand side term needs to be typed at H. In contrast, if the typing rule required y to be labeled H, then the third term would also have to be typed at H, which would disallow the third term to be factored out of the sum.
Finally, the expression logML( ) is typed at a label no lower than the command 's label, since the measure denoting determines the model evidence of the probabilistic computation .
Remarks.In the presence of side effects such as mutable state, information-flow type systems often use a program-counter label [22] to lower-bound information leaked through side effects.Indeed, in the imperative while-language of SlicStan [29], typing judgments of commands do use a label to lower-bound the write effects of commands.In Mappl, however, the label of a command is an upper bound that directly classifies the information that can flow into the command's measure denotation, just as an expression's label upper-bounds the information that can flow into the expression's denotation.In SlicStan, by contrast, factorization must first produce an upper bound by joining the labels of its subexpressions.

NONINTERFERENCE VIA A LOGICAL-RELATIONS MODEL
Semantic types.We now establish the soundness of the information-flow type system, by constructing a semantic model of the types.Figure 5 defines our logical-relations model.
The definition uses a function ⌊•⌋ that strips a type of all labels occurring in it; ⌊•⌋ sends a type in the information-flow type system (Section 4) to a type in the base type system (Section 3).The function is overloaded on labeled types τ, unlabeled types , and contexts ∆, Ψ, Γ.
The main idea behind our model is to interpret each type τ as two binary relations: a value relation V τ O for semantic typing of pure expressions, and a measure relation M τ O for semantic typing of probabilistic computations.Both relations are parameterized by a label O that stands for the "security clearance" of an observer, and by a substitution for the label variables occurring free in τ.The model is constructed by first defining, by induction on types, the interpretations

and M τ O
; parameterized by a semantic substitution for free type variables, with fixpoints taken in the case of recursive types.The relations V O , V τ O , and M τ O are then defined on those types without free type variables.
The value relation V O relates two semantic values in the semantic domain ⌊ ⌋ if they are indistinguishable to the observer.What is considered indistinguishable is determined by the observer's label O and the type .At a ground type U or R, only identical values are related.At a function type, two functions are related if they send related inputs to related outputs.The relations at product, sum, and recursive types are standard as well.The relation at a distribution type dist( ℓ ) relates two density functions.Having defined the value relation at unlabeled types , we can define the relation V ℓ O at a labeled type ℓ , which depends on the labels ℓ and O.If ℓ ⊑ O ( ℓ means applying the substitution function to the label ℓ), the observer is cleared to see values at label ℓ, so V ℓ O contains exactly those values related by V O .Otherwise, ℓ ̸ ⊑ O and thus the observer does not have the clearance to see values at ℓ, so The measure relation M ℓ O relates two measures in the semantic domain Meas ⌊ ⌋ if they are indistinguishable to the observer.Recall that for a probabilistic computation, its label ℓ represents the information contained in the measure denoting it.Hence the following two cases: Semantic typing.To define semantic typing, we first define the semantic interpretation of contexts.In particular, two substitutions 1 and 2 are related at context Γ when for every variable in the domain of Γ, 1 and 2 map them to related values.And similarly for the interpretation of ∆ and Ψ.
We prove the fundamental property of logical relations, which states that syntactically well-typed expressions, terms, and commands are semantically well-typed.The proof is by induction on the (syntactic) typing derivations, with each case proving that semantic typing is compatible with some syntactic typing rule.As with a typical logical-relations proof, the challenge is in setting up the logical-relations model (think of them as induction hypotheses) and the proof is routine.Of special note about the proof is that it involves showing that two integrals are equal when the integrands and the measures are not equal point-wise.Lemma 5.2 is a convenient result [18] that enables proving equivalence using a coarser structure than point-wise equality.Lemma 5.2 (Coarsening).Let ( , Σ ) be a measurable space, 1 , 2 be measures on , and 1 , 2 : → R + be measurable functions.Let ⊆ Σ × Σ be a binary relation on measurable sets.If (1) 1 and 2 agree on -related sets, i.e., 1 ( 1 ) = 2 ( 2 ) for all ( 1 , 2 ) ∈ , and (2) if 1 and 2 have -related preimages, i.e., −1 1 ( ), −1 2 ( ) ∈ for all ∈ Σ R , then

The lemma allows proving integrals equal by picking a suitable relation on measurable sets, for which the relations E ⊥
O on measurable sets will fit the bill.
Noninterference.Noninterference follows from the fundamental property.It guarantees that the measure denoting a L-typed term behaves irrespective of the H-labeled variables in the context.

Theorem 5.3 (Noninterference).
Let Γ H be a context that binds only H-labeled variables-that is, for all x ∈ dom(Γ H ), there is some such that Γ H (x) = H . Let : ⌊τ⌋ ⊥ → R + be a measurable function.If ⊢ τ : L, ⊢ Γ L : L, and ∆; Ψ; ∅; ∅; Γ H , Γ L ⊢ : τ, then for all 1 , 2 ∈ ⌊Γ H ⌋ and Here, ⊢ τ : L is defined to mean that all labels occurring in τ, including those nested labels, are L.And ⊢ Γ : L means that ⊢ Γ(z) : L for all z ∈ dom(Γ).It is not a sufficient condition that the outermost label of τ and those in Γ are L.For example, the typing Ψ; ∆; ∅; ∅; x : R H ⊢ ret(⟨x, x⟩) : (R H × R H ) L is valid, but it would be absurd if it implied that ⟨x, x⟩ behaved irrespective of x.Similarly, x : R H , y : (R H → R L ) L ⊢ ret(y x) : R L is valid typing, but it would be absurd if it implied that the application of y to x behaved irrespective of x for a L such that L (y) = .∈ ⌊(R H → R L ) L ⌋ .Theorem 5.3 is a corollary of a more general version of noninterference (given in an appendix [39]) that relaxes the condition ⊢ Γ L : L and allows the integrands on the two sides of the equation to be different.

VARIABLE-ELIMINATION TRANSFORMATION
Main idea.To generate a pure program, the transformation must compile away all probabilisticfragment constructs.In particular, (1) it eliminates random variables (r.v.s) by summation or CPS-translate a probabilistic global function to a pure one CPS-translate a command to a pure expression Eliminate discrete r.v.s in worklist z ∆; Ψ; y : Eliminate probabilistic-level control flow (calls & branching) with CPS ∆; Ψ; y : Transform away any remaining probabilistic-level terms Factorize into H and L with information-flow typing Invariants:  integration, creating inference subproblems as a result, and (2) it compiles stochastic control flow to deterministic control flow in continuation-passing style.In both cases, the transformation uses information-flow typing to factorize a command into a H partition and a L partition.The measure denotation of the L partition is guaranteed to be independent of the H-labeled variable-be it bound to a sample, case, or a call term-being eliminated.So the L partition can be factored out of the summation, integration, or continuation indexed by the H-labeled variable.
Program transformation.Figure 6 formalizes the VE transformation as a set of mutually recursive translation functions.The translation is defined for well typed programs (with respect to the base type system in Section 3), so it additionally takes typing contexts as input, but for brevity, we omit them in Figure 6; a version with complete context information can be found in an appendix.As a running example, the step-by-step translation of hmm is shown in Figure 7.We now describe the translation rules in Figure 6, referring to steps in Figure 7 as concrete instantiations of the rules.
T F translates a probabilistic function F to a pure function additionally parameterized by a continuation k, with the body of F translated as K k ( 1 ).K translates a command given a continuation , which is the log-factor dependent on the return value of .At the top level, the main command is translated with the top-level continuation x. 0. K works by applying to 's return value, appending the resulting factor to , and then uses D + to further translate the resulting U-typed command ( 2 7 11 ).
While K can be applied to any well typed command, going forward, the other translations (D + , D − , C, and R) are only defined on commands of the unit type U, as the commands will already have been CPS-translated.D + does preparatory work for eliminating discrete r.v.s, of which the real work is done by D − .D + accumulates bindings of discrete r.v.s into a worklist, turning all sample B terms into factor terms (12), so that D − can eliminate the discrete r.v.s in the worklist in one go without having to worry about factorization possibly creating unbound references to the variables.
D − z ⊢ eliminates from the discrete r.v.s stored in the worklist z, one at a time, until the worklist is empty (14).Elimination of a variable y from a command involves factorizing into H and L , using the ⇝ judgment to be defined shortly.D − eliminates y by summing over y the factor y. H contributed by H . Importantly, L can be left out of this sum, because it is guaranteed that the measure denotation of L is independent of the H-labeled y.The formalized translation does not make memoization explicit, but solutions to this sum may be memoized, with the memo table indexed by the free (discrete) variables in the sum logsumexp ( y. H ).These free variables can be thought of as the Markov blanket [49] of y, conditioned on which all other variables are uncorrelated with y.The order in which the variables are eliminated is left unspecified; it is a well studied, orthogonal problem.It is NP-hard to find the optimal ordering that has an elimination width equal to the tree width [20].In practice, heuristics (e.g., eliminating variables with fewer neighbors first) are effective in giving good orderings with low elimination widths.
C eliminates probabilistic-level control flow, namely call and branching terms.To translate a function call y = f ( ), the variable y is labeled H, and the continuation to the function call is factorized into H and L .The factor y. H , representing the probabilistic effects of H , is passed to the translated pure function f as the continuation argument.The pure function call f ( y. H ) returns a R-valued factor, which is then appended to the command to be translated further (16).Notice that the measure denotation of L is independent of the return value y.So L can be left out of the continuation passed to the CPS-translated function call to f. C translates a branching term y = case( ; x. 1 ; x. 2 ) in a similar manner.It factorizes the probabilistic continuation into two partitions, constructs a pure continuation using the H partition only, and CPS-translates the branches by passing this pure continuation ( 5 ).
When there is no more control-flow terms to eliminate, R takes over to eliminate any remaining probabilistic-level terms, namely sample R and factor.A sample R term is eliminated by integration (i.e., applying logML), which can be any marginal-inference method of choice.Solutions to this integration may be memoized, with the memo table indexed by the free (discrete) variables in the integral.Unlike sample B terms, sample R terms are not first converted to factor terms before being eliminated, as integration in general requires sampling from prior distributions of continuous r.v.s.
Like in D − and C, the probabilistic continuation is factorized in R to allow irrelevant terms to be left out of the integration.Unlike in D − or C, continuous r.v.s are eliminated not always one at a time, but likely simultaneously, to avoid creating unnecessary nested integrals.For instance, the command def = = x = sample (Normal(0, 1)); y = sample (Normal(3, 1)); factor(logPr (Normal(x 2 + y 2 , 1); 6)) is translated as R = R factor(logML( )) = logML( ).Factorization makes sure that all three terms belong to the same H partition, despite that information-flow typing does not demand so, so that x and y can be marginalized out simultaneously.

Command factorization via information-flow typing. The judgment
Here, Γ ′ is the variable bindings declared in L and available for use in H , such that ∆; Ψ; ∅; ∅; Γ L ⊢ L : U L and ∆; Ψ; ∅; ∅; Γ H , Γ L , Γ ′ ⊢ H : U H .In broad strokes, factorization works by partitioning all H-labeled terms in the command to H and all L-labeled terms to L ( 5 14 16 ).Label inference is implemented by solving unification constraints.
Step 14 performs factorization to eliminate y.With y labeled H, the two factor terms must be typed at H, while the other two terms can be typed at L and left out of the sum over y.Step 16 performs factorization to eliminate the recursive call z = hmm(z 0 , xs).With z labeled H, both terms in the continuation must be typed at H, so their translation is passed to the pure hmm as the continuation argument.Step 5 performs factorization to eliminate y = case data ..., which is similar.
Because noninterference guarantees that a L-labeled term behaves irrespective of the value of the H-labeled variables in Γ H , factorization additionally canonicalizes a L-labeled term: it replaces the H-labeled variables with default values.This substitution ensures that L does not refer to variables bound in Γ H and thus is well typed under the context Γ L .Finally, factorization specially treats sample R terms: it partitions x = sample R ( ) into L only if x is not needed in the H , thus scoping the integral over x to one partition only and avoiding unnecessary nested integrals.Lemma 6.1.Let Γ H be a context that binds only H-labeled variables and Γ L be a context such that Lemma 6.1 assures that factorizing a command produces two partitions that together preserve the semantics of .The notation , defined in an appendix with the proof, is a shorthand for the multiple integral with respect to the measures each denoting an x = in L .The lemma is a consequence of noninterference (Theorem 5.3).
Correctness.We prove that the variable-elimination transformation is correct.The theorem states that the transformed pure expression, when it terminates, computes the log model evidence of the original probabilistic program.As expected, the proof depends on Lemma 6.1.

EXPERIMENTAL EVALUATION
Scalability of VE compilation.We compare Mappl and SlicStan [29].Stan [13], while a popular PPL, does not support discrete parameters.In response, SlicStan features a state-of-the-art VE compiler that performs information-flow analysis and emits variable-eliminated Stan code.As a benchmark, we consider a simple HMM, for which both compilers can generate code whose running time scales linearly with the length of the observed sequence.But compilation time differs.were run on a server with a 3.6GHz CPU and 12GB of RAM.)In SlicStan, models such as HMMs are expressed by unrolling recursion into a fixed number of iterations, so it is expected that compilation time increases as the size of the inference problem increases.Figure 8 confirms this behavior and further shows that SlicStan struggles with large problem sizes: compiling the model conditioned on generating a sequence of length 60 takes over 30 minutes.In contrast, because Mappl can express the HMM as a recursive program, the compilation time is constant with respect to the problem size.
We also report the time Mappl takes to compile a version of the HMM with recursion unrolled.Figure 8 (Mappl*) suggests that the Mappl compiler exhibits better scalability than SlicStan on the same unrolled model.A probable reason for this speedup is that Mappl uses a simple twolevel lattice in the information-flow analysis, whereas SlicStan uses a meet semilattice, which, as discussed by Gorinova et al. [29], hinders efficient constraint solving.Scalability of exact inference: HMMs, PCFGs, and CRBD.We compare Mappl and PERPL [15] on recursive programs.PERPL represents a state-of-the-art approach to exact inference for recursive programs, compiling them to factor graph grammars [16] and then to systems of equations.
The benchmarks are the HMM in Figure 1a, a hierarchical HMM, a second-order HMM, the PCFG in Figure 1c, a PCFG with 6 nonterminals and 12 productions, and a discrete-time phylogenetic model.The phylogenetic model generates phylogenetic trees under the constant-rate birth-death (CRBD) assumption.This CRBD model is similar to the PCFGs in that it uses recursion (as opposed to iteration) and exhibits stochastic control flow [58].
Figure 9 shows how the inference running time scales as the size of the inference problem increases.Compilation time is not measured, as it does not vary with the problem size for either Mappl or PERPL.As PERPL uses a Python back end, to allow a fair comparison, compiled Mappl programs are further compiled to Python.We use enumeration-based exact inference implemented in Pyro [10] as an additional baseline on some benchmarks; it leads to exponentially increasing running time and runs out of memory quickly on all benchmarks.
On the two HMMs, PERPL leads to running time superlinear in the problem size, whereas Mappl recovers the linear-time forward algorithm for HMMs.For an observed sequence of length 30, PERPL inference takes over 1 minute, while Mappl inference takes 1.5 seconds.On the two PCFGs and the CRBD model, PERPL also scales less favorably than Mappl.We note that PERPL supports unbounded recursion and thus allows the PCFG and CRBD models to be specified in a more declarative way.For example, the CRBD model in PERPL, though complicated by PERPL's linearity restriction, uses an almost surely terminating function to generate the waiting time until the next speciation or extinction event, whereas the Mappl version uses a geometric distribution truncated at the remaining time steps to ensure termination.
We also assess, with the PCFGs, the performance implications of the information-flow analysis.Specifically, we evaluate the performance of a version of Mappl with command factorization disabled (Mappl#).Disabling factorization means that the VE compilation has to assume correlation between the subparses of a nonterminal, thereby hindering the discovery of recurring substructure amenable to dynamic programming.Figures 9d and 9e confirm that without factorization, VE-based inference is intractable.
Scalability of exact inference: Dice benchmarks.We compare Mappl and Dice [31].Dice is a state-of-the-art approach to exact inference for discrete, nonrecursive programs.It casts inference to weighted model counting (WMC) on binary decision diagrams (BDDs), exploiting independence structure in programs to create compact BDDs for factorized inference.We use benchmarks [31,Fig. 10] on which Dice has been shown to demonstrate superior scalability over other PPLs that support exact inference.As Dice uses a C library for WMC on BDDs, to allow a fair comparison, compiled Mappl programs are further compiled to Rust.This Rust back end of Mappl is not yet full-featured but is sufficient for these Dice benchmarks.Figure 10 shows how the running time scales as the problem size increases.Given that the Dice running time reported by Holtzen et al. [31] includes the time required for BDD generation, the Mappl running time reported here includes that for VE compilation.Mappl is competitive with Dice on these scaling benchmarks, in fact outperforming Dice in three out of four cases.A possible explanation is that since Mappl can express these benchmarks as recursive programs, compilation time does not increase with the problem size.
We also run Dice on a PCFG.Dice does not support recursion, so we follow the recipe of Chiang et al. [15, App.E] in expressing a PCFG in Dice by manually unfolding a loop that generates a parse from subparses.Figure 9d suggests that PCFGs in this encoding are intractable for Dice.
These benchmarks all contain conditional independence structure as a result of function abstractions.While Mappl may do better on such programs, we note that Dice performs better on large Bayesian networks (BN).For example, for a BN with ∼40 nodes, Dice solves the inference problem within 30 ms, while Mappl takes over 1 s.We conjecture that this is due to the known result that WMC can significantly outperform VE when models contain substantial local structure [14].

Approximate inference: Hamiltonian Monte Carlo (HMC)
. HMC [23] is a powerful sampling method for differentiable models.Discrete latent variables introduce nondifferentiability, posing challenges to applying HMC to hybrid discrete-continuous models.We consider two such models: a soft -means clustering model and a latent Dirichlet allocation (LDA) model.One way to handle them is by marginalizing out the discrete variables using VE.example, Pyro's support for HMC can handle these models by performing VE on plated factor graphs [48].Pyro performs VE at run time.So we examine whether the performance of Pyro's HMC can be improved by ahead-of-time VE compilation through Mappl.Specifically, we use NumPyro [54], which supports fast HMC inference on top of JAX [24].We run NumPyro's HMC on the original model and on the Mappl-compiled model with necessary syntax adjustments applied (including replacing the top-level logML with an invocation of NumPyro's HMC).
Figure 11 displays the running time of sampling a single chain consisting of 10,000 samples and 2,500 burn-in samples using the No-U-Turn sampler [32], while varying the number of discrete latent variables in the models.Time is measured after JIT is warmed up.As expected, ahead-of-time VE compilation leads to improved run-time performance.
Approximate inference: marginal-likelihood estimation.We examine the performance implications of Mappl's VE compilation for marginal likelihood (ML) estimation, a key task in Bayesian learning and inference.We use a family of hybrid discrete-continuous HMMs (hmm ′ in Figure 1e) as benchmarks.For the VE-compiled programs, we use importance sampling (IS) with Pyro to solve the inference subproblems (i.e., nested integrals).As a baseline, we use annealed importance sampling (AIS) [47] to solve the global inference problems directly.AIS, generalizing IS, is a widely used sampling method for ML estimation.We assess the convergence rate of the ML estimate as allowable running time increases.Running time roughly translates to the number of importance samples.We experiment with multiple hyperparameter settings for AIS.
Figures 12a and 12b show the results for an HMM with input data of length 32 and 64, respectively.Mappl/IS converges within tens of seconds in both cases.In contrast, AIS either takes thousands of seconds to converge or does not show signs of convergence even after thousands of seconds, for each hyperparameter setting tested.The quick convergence with Mappl/IS is a consequence of the VE compilation eliminating discrete r.v.s and generating single-dimensional subproblems easily solvable by a Monte-Carlo method.
In fact, the compiled program reveals that the inference problems of this experiment have exact solutions: we are able to use Mathematica to obtain closed-form solutions to the generated subproblem integrals.The dotted lines in Figures 12a and 12b represent the exact solutions to the global problems as computed from those to the subproblems.Our approach enables potentially harnessing the power of symbolic-integration engines by translating a recursive program into subproblems that have readily available closed-form solutions.
We also consider a similar HMM where the subproblems in the VE-compiled program are not known to have closed-form solutions-though they can be approximated with arbitrary precision using numerical methods.Figure 12c shows that Mappl/IS quickly converges to this approximate solution (the dotted line), whereas AIS fails to converge for the relatively small problem size of 16.
VE does not always lead to improved performance, however.As another benchmark, we consider the aforementioned CRBD model extended with two continuous latent variables for the birth and Here we focus the discussion on those on the more expressive end of the spectrum.Early versions of IBAL support VE for programs with unbounded recursion and use lazy evaluation [51].This approach allows models such as PCFGs to be specified more declaratively, but it seems to have been abandoned in favor of bounded recursion for correctness and efficiency concerns [50] in later versions of IBAL [52] and Figaro [53].PERPL [15] supports exact inference for programs with unbounded recursion, by compiling them to monotone systems of polynomial equations.Infinite data types, such as integers and strings, pose challenges to equation solving, as they would lead to infinite systems of equations.In response, PERPL uses whole-program transformations (de-and re-functionalization) to eliminate infinite data types.These transformations further necessitate a linear type system to ensure correctness.
Mappl shares the restriction of bounded recursion with a few other PPLs that support VE on PCFG-like models.Bounded recursion, while unable to express PCFGs as almost surely terminating programs, is expressive enough for Bayesian-inference queries on these models (see Figure 1c and another encoding given in an appendix), as the observed data is finite.Koller et al. [34] call such queries evidence-finite.We consider our choice to restrict attention to bounded recursion a sweet spot in the design space: it aligns well with the evidence-finite nature of many Bayesian inference problems, does not require the programmer to reason about linearity, leads to provably correct VE-compiled code with performance matching the best known PTIME algorithms, and still allows reasonably concise, readable programs.
SlicStan [29] supports VE for an imperative PPL where programs have deterministic support and variables are global.It supports loops but not recursion, and HMMs expressed via loops do not seem to type-check in SlicStan's information-flow type system.Mappl, in contrast, is a functional PPL with a wider range of features.The denotational treatment is compositional by nature: it enables a noninterference result on open terms, crucial for eliminating variables in subterms under binders.As Section 7 shows, Mappl's support for recursion, as well as its use of a two-level lattice rather than a meet semilattice, avoids the limitation of the SlicStan compiler in scaling to large models.
Solving probabilistic inference problems analytically.Exact, analytical solutions to inference problems are welcome whenever computationally efficient.Some PPLs support exact inference for nonrecursive programs with no or very restricted form of continuous variables, by compiling them into finite graph representations for efficient inference [12,17,31,59].FSPN [63] and PERPL [15] support exact inference for recursive programs, though they are known to work only for programs with discrete variables.Hakaru [46,65] and Psi [25,26] enable exact inference for programs with continuous variables (though still omitting recursion) using computer-algebra solvers.
Delayed sampling in Birch [43] and ProbZelus [8,3] allow partial analytical solutions to subprograms by exploiting conjugacy.The similarity to our approach is that both are forms of automatic Rao-Blackwellization [57, §4.2] that analytically reduce an inference problem to a better-behaved one.The distinctions are that delayed sampling derives closed-form posteriors for conjugate priors whereas our approach compiles away discrete variables, that delayed sampling is an inference-time approach based on dynamic dependence graphs whereas ours is a compile-time transformation, and that delayed sampling is not known to work with recursion.
In practice, no single inference technology is likely to excel at all problems; our approach and existing inference methods are complementary.Identifying independence is generally useful for compile-time Rao-Blackwellization; for example, our information-flow analysis can potentially be used in gradient-based methods to reduce variance for gradient estimators.On the other hand, our approach can potentially capitalize on advances in symbolic integration to solve generated subproblems analytically.
Reasoning about independence in probabilistic programs.Verifying randomized algorithms may require reasoning about independence, for which program logics have been developed [7,6,40].While these program logics enable calculational, largely manual proofs of functional correctness, an information-flow type analysis is more amenable to automation through type inference.Hur et al. [33] and Amtoft and Banerjee [2] study program slicing for probabilistic while-languages.Their reasoning is concerned with determining if two variables are correlated conditioned on the observe statements in a program.Conditional independence can sometimes be determined syntactically for Bayesian networks through the notion of active trails [49].The idea has been adapted to probabilistic programs [33,37], though a full soundness result is lacking.
Semantics of probabilistic programs.As a Cartesian-closed alternative to measurable spaces, QBSes are introduced to handle higher-order types [30,62].It is further shown that QBSes can be equipped with compatible -cpo structures to handle term-level recursion and recursive types [64].Mappl's denotational semantics uses these constructions.Another way to give semantics to PPLs is by first defining a deterministic operational semantics indexed by a randomness source and then integrating over randomness to obtain a measure semantics [11].Prior work constructs logical relations for program equivalence in this operational setting [18,66,69], while we construct logical relations for noninterference in a denotational setting.

CONCLUSION
Our approach to variable elimination and marginal inference, presented in the context of Mappl, represents a generalization and synthesis of several important ideas.
• A compiler eliminates probabilistic effects, generalizing variable elimination from graphical models to a richly expressive PPL with recursion.• It decomposes a global inference problem into subproblems, recovering and generalizing widely used dynamic-programming algorithms for recursive models.• It factorizes computations into independent partitions, repurposing information-flow typing to probabilistic programs.• Its correctness result relies on a logical-relations argument, adapting semantic models for noninterference to a measure-theoretic setting.The payoff is that Mappl allows useful recursive models to be expressed in a functional, recursive style, while enabling sound, scalable inference for a broad class of these programs.Future work could explore ways to enable programmer control over the decomposition into inference subproblems and to exploit local structure in certain models to further speed up inference.

Figure 2
defines the syntax of Mappl programs.Local variables and global variables are notated in blue.An overline denotes a sequence of zero or more elements.

Figure 2 .
Figure 2. Syntax of Mappl and selected typing rules.

Figure 3 .
Figure 3. Selected definitions of the denotational semantics for Mappl.

Figure 4 .
Figure 4. Syntax of information-flow types and selected rules of the information-flow type system.

Figure 6 . 15 T 1 = 2 = 3 =
Figure 6.Variable-elimination transformation.Typing contexts are omi ed for brevity (except in factorization judgments).A version of the transformation with complete context information is given in an appendix.

Figure 7 .
Figure 7. Compiling the hmm function in Figure 1a to that in Figure 1b.The calculation largely follows the rules in Figure 6.To simplify presentation, we use the equality C factor( 1 ); ...; factor( ) = = 1 + ... + and standard -calculus conversions without detailing the intermediate steps.

Figure 8
Figure 8  shows how compilation time scales as the size of the inference problem increases.(All experiments in Section 7 were run on a server with a 3.6GHz CPU and 12GB of RAM.)In SlicStan, models such as HMMs are expressed by unrolling recursion into a fixed number of iterations, so it is expected that compilation time increases as the size of the inference problem increases.Figure8confirms this behavior and further shows that SlicStan struggles with large problem sizes: compiling the model conditioned on generating a sequence of length 60 takes over 30 minutes.In contrast, because Mappl can express the HMM as a recursive program, the compilation time is constant with respect to the problem size.We also report the time Mappl takes to compile a version of the HMM with recursion unrolled.Figure8(Mappl*) suggests that the Mappl compiler exhibits better scalability than SlicStan on the same unrolled model.A probable reason for this speedup is that Mappl uses a simple twolevel lattice in the information-flow analysis, whereas SlicStan uses a meet semilattice, which, as discussed by Gorinova et al.[29], hinders efficient constraint solving.

Figure 9 .
Figure 9. Scaling plots comparing exact-inference methods on recursive programs.

Figure 12 .
Figure 12.Performance of ML estimation on a family of hybrid discrete-continuous HMM models (Figure1e), measured by how the negative log ML estimate changes as allowable inference time increases.
the observer is not classified at a high enough level to differentiate between two measures at ℓ.So M ℓ O is the full relation Meas ⌊ ⌋ × Meas ⌊ ⌋ .•Indistinguishability is subtler to define for the case ℓ ⊑ O. Here, we consider two measures indistinguishable if they agree on related measurable sets.The relation E O defines the notion of relatedness for measurable sets.Two measurable sets 1 and 2 are related when they are closed to one another under the value relation V O -that is, 1 ∈ 1 ⇔ 2 ∈ 2 for all ( 1 , 2 ) ∈ V O .The relations V ⊥ O , V ⊥ τ O , E ⊥ O , and M ⊥ τ O then lift V O , V τ O , E O ,and M τ O to account for partiality.Proc.ACM Program.Lang., Vol. 8, No. PLDI, Article 218.Publication date: June 2024.