Distributions for Compositionally Differentiating Parametric Discontinuities

Computations in physical simulation, computer graphics, and probabilistic inference often require the differentiation of discontinuous processes due to contact, occlusion, and changes at a point in time. Popular differentiable programming languages, such as PyTorch and JAX, ignore discontinuities during differentiation. This is incorrect for parametric discontinuities—conditionals containing at least one real-valued parameter and at least one variable of integration. We introduce Potto, the first differentiable first-order programming language to soundly differentiate parametric discontinuities. We present a denotational semantics for programs and program derivatives and show the two accord. We describe the implementation of Potto, which enables separate compilation of programs. Our prototype implementation overcomes previous compile-time bottlenecks achieving an 88.1x and 441.2x speed up in compile time and a 2.5x and 7.9x speed up in runtime, respectively, on two increasingly large image stylization benchmarks. We showcase Potto by implementing a prototype differentiable renderer with separately compiled shaders.


INTRODUCTION
Automatic di erentiation (AD) is the automated computation of the derivative of a function given just the de nition of the function itself.AD has been applied in many domains such as computer graphics and vision [Li et al. 2018], robotics [Bangaru et al. 2021;Hu et al. 2020], and probabilistic inference [Lee et al. 2018] for optimization and uncertainty quanti cation.Many of these computations are conceived as automatically di erentiating continuous functions, however, discontinuous functions also arise naturally.
Discontinuities arise in computer graphics due to object boundaries, occlusion, and sharp changes of color.In robotics and physical simulation, computations can model contact, which causes a discontinuous change in the velocity of an object.In probabilistic inference, the model can have discontinuities.For example, a time-series modeling problem can have a discontinuous change in behavior at a point in time.Figure 1a illustrates an example of a discontinuity.A discontinuity in a function is a point at which the left and right limits approach di erent values.In many applications, the occurrence of an integral can introduce a parametric discontinuity.A parametric discontinuity is a discontinuity speci ed by a condition whose value depends on at least one parameter and at least one real-valued variable of integration.The shaded region depicts ( ) = ∫ 1 0 [ ≤ ] , where the Iverson bracket [ ] is 1 if the proposition holds and is 0 otherwise.Since there is a variable of integration and parameter , the discontinuity [ ≤ ] is a parametric discontinuity.
State of the Art.Popular AD tools ignore discontinuities during di erentiation [Abadi et al. 2015;Bradbury et al. 2018;Paszke et al. 2019].Ignoring discontinuities that are not parametric discontinuities is correct almost everywhere [Lee et al. 2020].However, ignoring parametric discontinuities produces incorrect results.In optimization, this leads to slower convergence or even divergence [Bangaru et al. 2021;Lee et al. 2018;Li et al. 2018].
In standard AD systems, when applications include integrals, the typical strategy is to discretize the integral by evaluating the integrand at samples and summing the result.Standard AD then di erentiates this discretized program.Figure 1b shows the discretization of the integral in Figure 1a and Figure 1c that it returns 0 because the derivative of each sample is 0 [Abadi et al. 2015;Bradbury et al. 2018;Paszke et al. 2019].This is incorrect.
Potto. Figure 1d depicts that the correct derivative of the integrand is instead the Dirac delta distribution ( − ) that integrates to 1 if lies in the domain of integration: We present a di erentiable programming language, Potto, that uses distributions to di erentiate integrals with parametric discontinuities (Teodorescu et al. [2013] Section 1.3.7).Distributions are a generalization of functions that can represent the derivatives of a discontinuous function.In particular, derivatives of Potto programs denote distributions.We extend the sampling approach in standard AD to additionally sample at parametric discontinuities, where derivatives are non-zero resulting in a correct estimate of the derivative.As a result, Potto supports compositional evaluation and therefore, separate compilation, which was not possible in prior work, Teg [Bangaru et al. 2021].
We provide an example of using Potto for probabilistic inference (risk minimization) and implement both a 2D and a 3D di erentiable renderer.Potto signi cantly improves compile time and is slower in runtime for smaller programs and faster on larger programs (with more discontinuities and more complex discontinuities, i.e., a larger expression in the condition) when compared to Teg [Bangaru et al. 2021].We nd an 88.1x and 441.2x speed up in compile time and a 2.5x and 7.9x speed up in runtime respectively on two increasingly large image stylization benchmarks.
In this paper, we present the following contributions: • We introduce Potto1 , a language for distribution programming, that is the rst to di erentiate parametric discontinuities, while supporting compositional evaluation (Section 2).We provide a mathematical introduction to distribution theory (Section 3).• We de ne the syntax of a core language, Langur (Section 4).A Langur term is an integrand and a Langur program is the integral of a term.• We provide a type system de ning well-formed terms (Section 5).We present the denotational semantics of Langur terms (Section 6) and their derivativess (Section 7).We provide a type system and denotational semantics for programs and their derivatives (Section 8).We prove that the derivative of the denotation of a program is equivalent to the (distributional) derivative denotation of that program.• We prove that the operational semantics of Langur supports compositional evaluation and therefore, separate compilation (Section 9).• We implement a 3D renderer in the surface language, Potto2 , and use separate compilation to e ciently swap among shaders.We compare Potto with Teg [Bangaru et al. 2021] on di erentiable rendering tasks and nd that it signi cantly improves compilation times, a bottleneck in Teg, and that it is slower in runtime on small programs, but it is faster on larger programs (Section 10).
With Potto, we expand the scope of di erentiable programming languages to account for parametric discontinuities.Potto is a rst-order language that supports separate compilation, leading to better performance in work ows involving many small changes to a larger program.We envision that our theoretical development and programming language design will lead to more expressive di erentiable programming languages that better serve application domains such as computer graphics, robotics, and probabilistic inference.

EXAMPLE: RISK MINIMIZATION
Risk minimization is a fundamental problem in machine learning (Chapter 1.2 of Vapnik [1998]).We present a pedagogical example where the goal is to nd parameters that minimize the risk: where the squared error loss ℓ (ℎ ( ), ( )) = (ℎ ( ) − ( )) 2 , the bounds are = −10, = 10, the parameters = [ , , ], and ℎ ( ) = N ( ; , 5) [ ≤ ≤ ] is the (unnormalized) truncated normal density, and ( ) = N ( ; 2, 5) is the normal density.Recall that a parametric discontinuity is a conditional containing one or more real-valued variables (of integration) and parameters in the condition.The two parametric discontinuities in (ℎ ) arise due to ≤ and ≤ because the parameters and are compared to the variable .
The task is to automatically identify the optimal truncation points , and mean .At the minimal , the risk (ℎ ) will have parameters ↦ → −10, ↦ → 10, and ↦ → 2.

Optimization using AD.
A standard approach to solving arg min (ℎ ) is to use gradient descent, which requires taking the gradient of with respect to and then updating in the direction of descent according to the gradient.Modern AD systems enable one to write as a program and rely on the system to automatically generate the gradient of .

Di erentiating Parametric Discontinuities
We present the problem of di erentiating parametric discontinuities in the context of risk minimization and compare standard AD techniques with Potto.
Standard AD. Figure 2a depicts the (unnormalized) truncated normal and normal densities.Figure 2b shows the descent direction (black arrow) at initialization − for standard AD.We use the notation to denote the derivative computed by standard AD.The descent direction − is correct, which is 0 and therefore incorrect for truncation points and .Figure 2c illustrates that gradient descent can optimize the mean, but cannot optimize the truncation points.
Figure 2d depicts the descent direction computed by Potto.Potto correctly computes the derivative for the truncation points by sampling at and (orange stars) to account for the parametric discontinuities.At initialization, Potto computes the same descent direction for as standard AD. Figure 2e shows that the optimization is successful, moving the desired curve by moving the mean right and widening the truncation points.
A standard AD algorithm as implemented in common AD frameworks [Abadi et al. 2015;Paszke et al. 2019] computes the gradient of risk by applying the assumption that the derivative of the integral equals the integral of the derivative.However, and ( ) = N ( ; 2, 5).However, the derivative and integral typically do not commute in the presence of discontinuities and therefore, theoretically, the meaning of the resulting expression is not well-de ned. Figure 2c illustrates that when used within a gradient-based optimization algorithm, the resulting derivative does not result in the algorithm materially optimizing the parameters.
To show what goes wrong, we continue the derivation.In the next step of di erentiating the risk, we follow the power rule and chain rule to produce: By linearity of the derivative and since does not depend on , we have that (ℎ ( ) − ( )) = ℎ ( ).Standard AD ignores the parametric discontinuities in ℎ and makes the following step: On its own, the derivative above is correct at every point except = and = , where the (standard AD) derivative of the program is zero, but is unde ned in math.However, because the conditional is integrated over, the derivative is incorrect everywhere, not just at two points.
Automatic Di erentiation in Potto.Figure 2e shows the result of applying Potto to calculate the gradient of risk to be used in gradient descent.During the optimization, the truncation points a and b widen and the mean of the truncated normal density shifts toward the normal density.
Using distribution theory [Teodorescu et al. 2013], we show that the derivative of the integral is the integral of the parametric distributional derivative of the integrand lifted to a distribution as long as it satis es a mild integrability condition and a transversality condition.Informally, these conditions ensure the integral is well-de ned and that the discontinuities do not coincide with each other, because in general, the product of distributions is not well-de ned [Schwartz 1954].For example, we can not have a product of two indicator functions [ < ] [ < ] where the discontinuities are equal ( = ).Returning to our running example, the following equality holds by de nition (De nition 3.8): where the operator lifts to a distribution and therefore has a parametric distributional derivative.
In the case of risk minimization, the squaring [ ≤ ≤ ] 2 results in the parametric discontinuities coinciding and therefore not satisfying the transversality condition.However, we can rewrite the integrand so that there are no products of coincident discontinuities.Speci cally, we expand the square and use the identity that the functions Now that the integrand is written to satisfy the transversality condition, we can apply the product rule (Lemma 7.1).
We simplify the integrand to: and then lift it to a distribution.Since the transversality condition is satis ed, the rules for the parametric distributional derivatives match the rules of calculus, except for the derivative of discontinuous functions.We show this new derivative rule in detail.The parametric distributional derivative of ℎ ( ) is: The rst term (1) is the same as in standard AD: the piecewise sum of the function's pieces.Introducing the three terms that follow makes the integral of the derivative correct.The product of terms ( 2) and ( 4) accounts for the parametric discontinuity at and the product of terms (3) and (4) accounts for parametric discontinuity at .The parametric distributional derivative introduces a Dirac delta distribution, (•), for each di erentiated parametric discontinuity.
The arrows in Figure 2a depict the directions of the derivative of the risk with respect to the parameters (terms 2 and 4), (terms 3 and 4), and (term 1) that Potto computes.We have that The derivative ( − ) = −1.As a result, ℎ ( ) = − ( − )N ( ; , 5).Putting these results together, we have that: where the last equality follows from the sifting property (Teodorescu et al. [2013] Equation 1.47): We conclude that an in nitesimal perturbation to the parameter shifts it to the right, decreasing the area under the truncated normal by N ( ; , 5).
Standard AD versus Potto Results.Standard AD and Potto use Monte Carlo integration to estimate the integral in the derivative of the risk.Both systems average the evaluation of the integrand at 50 uniformly random samples from [−10, 10].Potto implements the sifting property by sampling and evaluating at the points at the parametric discontinuities.
Figure 3 depicts the changes in each parameter using standard AD.The derivative at the truncation points is zero for standard AD (blue), so the loss remains nearly constant because only mu is optimized.Potto (orange) more rapidly approaches the minimum and achieves a loss that is over three orders of magnitude lower than the loss from standard AD.

Separate Compilation
The Potto prototype implementation supports di erentiation of parametric discontinuities and separate compilation.Previous work, Teg [Bangaru et al. 2021] supported the di erentiation of parametric discontinuities, but not separate compilation.
Problem as a Program.In Potto, we can implement risk minimization as follows: 1 # functions .The function de ned in risk.poimplements the risk function speci ed by Equation 1 with the integrand implemented as in Equation 2. Lines 4-5 de ne an integral representing the area under the curve de ned by the integrand from x=-10 to 10.The rst argument to the integral speci es the domain of integration, the second argument is the integrand, and the third argument dx declares the variable x.The let bindings de ne a probability density function, pdf, depending on a parameter mu and the target density target as the normal density centered at 2. The main.po le does a single step of gradient descent for the parameters of the truncated normal density.Line 2 imports the derivative of the normal and truncated normal densities from densities.po.
The import_deriv speci es the functions to di erentiate and as gives a name to the derivative.
Line 6 unpacks the parameters in params and declares new parameters.The one_hot(i, n) command returns a list of length that is one at index and zero otherwise.The decrement on Line 7 implements gradient descent with a step size of 400.For example, to di erentiate with respect to the parameter a, we set the in nitesimals da, db, dmu to one_hot(0, 3), which is (1, 0, 0).Since drisk returns a pair of the evaluation of risk and its derivative, we extract the derivative by indexing.
Teg. Teg [Bangaru et al. 2021] is implemented as a series of rewrites that are applied exhaustively to a given program.For consistency, we use a similar syntax to represent Teg programs as we did for Potto programs.
For example, a Teg program that arises in the computation of risk minimization is: which matches up with the left-hand side of Equation 3and denotes that expression.The delta syntax denotes the Dirac delta distribution.To evaluate this expression, Teg applies a rewrite rule that eliminates Dirac deltas by applying the sifting property (Equation 4), producing the program:3 -normal(a, mu) if -10 < a < 10 else normal(a, mu) , which matches the right-hand side Equation 3. Teg evaluates the integral-free and delta-free program to a number, in the same way any other compiler would.
Potto.Potto collects samples from the integral and the derivative of the discontinuities in the integrand (i.e. the Dirac deltas) and then evaluates the program at a given sample for each le (densities.po and risk.po).
In the example above, Potto samples the point x=a.Potto makes no changes to the code and evaluates the integrand directly at x=a as follows.First, Potto accepts the sample if it is inside the range [−10, 10] and rejects the sample otherwise.If accepted, Potto then evaluates the delta at a and returns one because a -a equals zero (otherwise it returns zero).Potto multiplies one by the normal density evaluated at a, producing the same result as Teg.
Separate Compilation.In risk minimization, di erent approximating densities change the quality of results.Figure 4 shows how a user can replace the truncated normal density with a normal density and compose both it (and its derivative) with the risk function (and its derivative).Ideally, the user should be able to interchange the approximating densities by replacing the density (and its derivative) rather than di erentiating the risk function again.
Support for separate compilation achieves this by enabling the division of a program into distinct source les, di erentiating each individually, and composing the derivatives together to form an executable.This process enables e cient code reuse.The challenge is to support separate compilation of derivatives for programs with parametric discontinuities.
Potto supports separate compilation by directly sampling discontinuities and evaluating the program at a given sample for each le.Because Teg relies on a rewrite rule to apply the sifting property, it requires that code for the integral and integrand are within the same le, preventing separate compilation.
Results.We present timing results in our prototype implementation of Potto.Di erentiating the risk separately from the normal density and truncated normal density takes 50% less time than di erentiating the composition of the risk function with the normal density and with the truncated normal density (as averaged over three runs on an Intel i9-9980HK).
Techniques from Teg [Bangaru et al. 2021] can be used in addition to Potto as a static analysis that eliminates Dirac deltas when the code for an integral and integrand are within the same le.However, Potto signi cantly improves compile time and tends to be slower in runtime on smaller programs and faster on larger programs (with more discontinuities and more complex discontinuities, i.e., a larger expression in the condition).
For risk minimization, we nd that Potto (with separate compilation) compiles about 377x faster than Teg, and that the runtime of Potto is about 27x slower than Teg.On workloads in computer graphics (Section 10), we nd that compilation time is a barrier to common work ows and that Potto has better runtime performance than Teg on larger programs (see Figure 15).The derivative of a discontinuous function does not exist in calculus.Distributions are a generalization of functions that give meaning to derivatives of discontinuous functions.
We present an example of di erentiating the Heaviside step function ( ) = [ ≥ 0], which has a jump discontinuity at = 0.One way to approach this problem is to note that if a sequence of smooth functions ( ) converges to a smooth function ( ) as goes to in nity, then the sequence of derivatives of ( ) converges to the derivative ( ) as goes to in nity. 4We know that this reasoning will fail for ( ) because it is discontinuous, but we will see what happens anyway.
The sequence of smooth functions de ned by ( ) The derivative of ( ) is The arrow in the rightmost gure indicates an in nite spike at = 0.This sequence diverges, since there is no real-valued function that represents the limit of an in nite spike at x=0 This is similar to how a convergent sequence of rational numbers may not converge to a rational number (e.g., Leibniz formula for is a sequence = =0 (−1) 2 +1 that approaches 4 ).However, every convergent sequence approaches a real number.Similar to how real numbers generalize rational numbers, distributions generalize functions.In the case above, the sequence ( ) converges to the Dirac delta distribution ( ).
The Dirac Delta Distribution.Informally, ( ) is in nite at = 0 and 0 everywhere else.Additionally, the integral over the real line of each ( ) and ( ) is one.We can formalize the Dirac delta as a distribution using the sifting property: for every test function , ∫ For example, the constant function ( ) = 0 and the bump function are test functions.On the other hand, ( ) = [−1 < < 1] is not smooth and ( ) = 1 does not have compact support, so both are not test functions.
Distributions are designed to be a generalization of functions where the regular rules of calculussuch as integration by parts, changes of variables, and derivatives-still hold.
(2) is linear: for every , ∈ R and We de ne changes of variables in terms of di eomorphisms.Figure 5 depicts an example of a di eomorphism Ψ that transforms a 2D grid using the sigmoid function.
De nition 3.4.A -di eomorphism Ψ : R → R is a -times di erentiable, invertible function.△ Every -di eomorphism Ψ (with ≥ 1) satis es the change of variable formula: △ This de nition is inspired by integration by parts, where the boundary term is 0 because has compact support.
We want to be able to lift a function to a distribution so that it can then be di erentiated.Locally integrable functions can be lifted to distributions.
De nition 3.6.A function is locally integrable if it is integrable on every compact set of its domain. △ ) is locally integrable, but not every locally integrable function is integrable.For example, for every compact set the integral Every probability density function (PDF) de ned over the reals (e.g., the uniform PDF and the normal PDF) is integrable because it integrate to 1 and therefore locally integrable. 6  De nition 3.7.A regular distribution is a distribution that is equal to the (Lebesgue) integral over a locally integrable function

△
The Dirac delta is not locally integrable because it is not a real-valued function and it is not a regular distribution because no locally integrable function satis es the sifting property (Teodorescu et al. [2013]

page 19).
Derivative of Heaviside is Delta.We now revisit the example of the Heaviside step function ( ) = [ ≥ 0].It is not di erentiable at = 0, but it is locally integrable and can be lifted to a regular distribution with the distributional derivative: Continuing, we apply the de nition of the Heaviside step function, and then simplify the integral using the fact that has compact support: We can thus conclude that: Our Desiderata.We need a notion of derivative that applies to discontinuous integrands.Furthermore, classical theory severely limits products of distributions [Schwartz 1954], preventing ( ) ( ) from being a valid distribution.For example, the distributional derivative of a program with nested conditionals and coincident conditions such as (1 if x > a else 0) if x > a else 0 may not satisfy the product rule. 7Our theory provides a su cient condition for the product rule to hold (See Appendix B.1).For example, the derivative of (1 if x > a else 0) if x > a + 1 else 0 satis es the product rule because the conditions (x > a + 1 and x > a) are distinct as the case for numerically stable programs (see Section 7.3.2). 6 This result is important because it means probabilistic programs interpreted as integrators [Shan and Ramsey 2017] also de ne distributions.Note that while all Radon measures as integrators de ne distributions, not all distributions denote Radon measures (e.g. ). 7 We expect that ( − ) • ( − ) = ( − ).The derivative of both sides is 2 ( − ) ( − ) = ( − ), which means (0) = 1/2.However, repeating this for Our Solution.We resolve these problems by formalizing parametric distributions and parametric distributional derivatives.A parametric distribution : R → D ′ (R ) is a family of distributions over free parameters ∈ R .
De nition 3.8.The parametric distributional derivative of a parametric distribution is: as long as the derivative of the integral on the right-hand side exists.△

SYNTAX
In this section, we introduce syntax for the core language, Langur, for implementing Potto programs.
Langur is a rst-order language.All terms are of base type real and Langur has rst-order letbindings.We describe how the surface language maps to the core language at the end of Section 9.
The grammar for Langur terms is: We use the metavariables , , and for terms and the metavariable for programs.

Terms
We now brie y describe the syntax of Langur terms.
Conditionals.Conditionals ifge0 then else are such that must be variable-free (it can contain parameters ) and have ≥ 0 implicitly (in OCaml, if ≥ 0 then else ).For example, in the case study, there is one variable, , and there are three free parameters , , and .The condition must be free of variables, preventing parametric discontinuities.
Di eomorphic Conditionals.Di eomorphic conditionals, ifge0 ⌊Ψ⌋ then else , have di erentiable, invertible conditions, ⌊Ψ⌋, de ned outside the core language.For example, the truncation points ≤ ≤ in Line 6 of distributions.po are speci ed by nesting the di eomorphic conditionals: Ψ1 ( , ) = − and Ψ2 ( , ) = − because if − ≥ 0 then ≤ and if − ≥ 0 then ≤ .The arguments to Ψ are all variables and free parameters (we make this choice to simplify the presentation and could easily modify the language so that the number of variables could range from 1 to and the number of parameters could range from 1 to ).

Programs
The term int d( 1 , . . ., ) represents integration of a term over R with respect to variables 1 , . . ., .Programs can be di erentiated with respect to parameters ∈ Params.

TYPE SYSTEM
Figure 6 presents a type system that characterizes when well-formed terms denote real functions.

Types and Type Contexts
Langur has real as its only type.Langur lacks arrow types because it is a rst-order language.The typing judgments are expressed in terms of two type contexts: one for variables Δ and one for parameters Γ.We distinguish contexts for variables so that we can easily state when a statement does or does not contain variables.A type context is a mapping from variable names to types: The contexts Δ and Γ are disjoint so that variables and parameters do not overlap.We use syntactic sugar : real := : real, • for simplicity.

Terms
The typing judgment Δ; Γ ⊢ : real indicates that the term is of type real given that each of the variables in Δ and parameters in Γ are of type real.
Arithmetic Primitives.The typing rules for arithmetic primitives operate over reals and are standard.For both sums and products if the arguments are real then the result is real.
Conditionals.The typing rule for conditionals ifge0 then else states that a conditional is of type real if the condition term is real and variable-free and if the terms and are real.
Di eomorphic Conditionals.To have variables arise in the conditionals, we have to use di eomorphic conditionals ifge0 ⌊Ψ⌋ then else .Using the isomorphism between variable names and ordered inputs ì ↦ → Ψ( ì , ì ) is a 1 -di eomorphism as in De nition 3.4.
The programmer de nes a di eomorphism in the surface language, Potto, as a pair of tuples, where the rst tuple speci es Potto code to compute the di eomorphism and the second tuple speci es its inverse.Future work could automate (piecewise) inversion, which is a long-standing and challenging problem studied in both the programming languages and graphics communities [Anderson et al. 2017;Lutz 1986;Matsuda and Wang 2020].
Let Bindings.The let = in primitive introduces a fresh variable into the typing context Γ.The term must be free of variables.For example, the type system accepts let = 1 + 2 in + if 1 : real, 2 : real ∈ Γ and rejects let = in .

DENOTATIONAL SEMANTICS OF TERMS
The denotational semantics, ( , ), maps a well-typed term Δ; Γ ⊢ : real to a simply decomposable function.Informally, the class of simply decomposable functions are piecewise-di erentiable functions with nitely many piecewise-invertible discontinuities.Using the isomorphism between mappings from variable or parameter names to real numbers Δ R and Γ R , where the sizes of the contexts |Δ| = and |Γ| = , we can write : R × R → R. The following semantics are standard and serve as sca olding for the following section.

Types, Type Contexts, and Value Contexts
The only type in Langur is real, and it denotes real numbers: real = R. Type contexts Γ and Δ denote mappings to real numbers and are de ned by: The empty context denotes the empty function and the denotation of a non-empty context denotes a disjoint union of functions.The value context : Vars → R maps from variable names to real numbers.The value context : Params → R maps from parameters to real numbers.We say that ∈ Δ if for every in the domain of Δ, we have ( ) ∈ Δ( ) .In words, for every variable in the typing context, its value is an element of its type.We de ne ∈ Γ analagously.

Terms
Now that we have de ned the denotation of each type, we give a denotation to each term in Figure 7.The denotation of a term Δ; Γ ⊢ : real is a function from the denotation of the free variables Δ × Γ to the denotation of the type real .As a result, we provide inputs , to the denotation of a term, where ∈ Δ and ∈ Γ .
We discuss only notable cases.
Conditionals.For conditionals ifge0 then else , variables are not allowed in the condition to prevent parametric discontinuities.
Di eomorphic Conditionals.For the conditional ifge0 ⌊Ψ⌋ then else , the rst dimension 0 Ψ : Δ × Γ → R corresponds to the condition, the other dimensions serve to make Ψ a di eomorphism (De nition 3.4).We require that the condition is a di eomorphism in order to give meaning to the derivative.
Let Bindings.The denotation of a let binding let = in is the denotation of with the variable bound to the denotation of .

Results
In this section, we prove results that serve primarily as a sca olding for results in the following sections.We show that the denotation of a term is a simply decomposable function.

De nition 6.1. A function
) is a nite sum of a nite product of simple discontinuities times a -times di erentiable function in ì .Concretely, where and all are sets of nitely many indices, each is di erentiable in ì , and each [Φ ( ì , ì ) ≥ 0] is a simple discontinuity.△ For simplicity, we call 1 simply decomposable functions just simply decomposable.For example, [sin( ) ≥ 0] is not simply decomposable because sin( ) is not invertible and cannot be decomposed into nitely many invertible pieces.Theorem 6.1.Every well-typed term Δ; Γ ⊢ : real denotes a simply decomposable function .« Proof.We provide a full proof in Appendix H. □

DENOTATIONAL SEMANTICS OF DERIVATIVES OF TERMS
A simply decomposable function can be lifted to a regular distribution as de ned in Equation 8.While simply decomposable functions do not necessarily have a well-de ned derivative, they do have a well-de ned distributional derivative .

Types, Type Contexts, and Value Contexts
Types now denote distributions real = D ′ (R).The denotation of the typing context for variables remains the same: Δ = Δ .The typing context for parameters, Γ, has new bindings for the in nitesimal perturbation: where the parameter is a real number and the in nitesimal perturbation ′ is a distribution.The meaning of the derivative of the empty context is still the empty context.In a nonempty context, there are twice as many parameters because each parameter has an associated in nitesimal.For instance, the total derivative of ( , ) = is ( , , ′ , ′ ) = ′ + ′ with in nitesimals ′ , ′ .We can then, for example, recover the partial derivative by setting ′ = 1 and ′ = 0.The value contexts and remain as before, but ′ is a mapping from parameters to distributions representing the values of the in nitesimal perturbations to the parameters.

Terms
Figure 8 presents the denotational semantics for derivatives of terms as distributions ( , , ′ ) ∈ D ′ (R ), where ∈ Δ , the real values for parameters, , and the distributions for in nitesimal perturbations, ′ , are such that ( , ′ ) ∈ Γ , and the number of variables = |Δ|.We write the parametric distributional derivative with respect to as .
Arithmetic Primitives.The derivative denotation of a sum is the sum of the derivative denotations.The denotation of the derivative of a product is the denotation of the derivative of the rst term times the denotation of the second term lifted to a distribution plus the derivative of the second term times the denotation of the rst term lifted to a distribution.

Results
We now prove that the denotational semantics is sound: for every term, the derivative denotation of a term equals the derivative of the denotational term.To do so, we extend the theory.
, where ì ∈ R , that is simply decomposable for all ∈ R .△ Proposition 7.1.For every well-typed term Δ; Γ ⊢ : real, if the is locally integrable, then it can be lifted to a distribution , that is a 1 simple distribution.« Proof.Since is a locally integrable function, it can be lifted to a distribution .By Theorem 6.1, the density is simply decomposable, so, is a simple distribution.□ 7.3.1The Transversality Condition.In 1D, the derivative is not de ned when there are discontinuities that correspond to the same at equality.For instance, the (parametric) distributional derivative of [ ≤ ] [ ≤ ] violates the product rule 7 due to a fundamental restriction of distribution theory [Schwartz 1954].In higher dimensions, degeneracies only occur when the zero sets of Ψ are not transverse.So, two submanifolds of R are transverse when they are not tangent to each other at any point of their intersection.Such a point of tangency creates a degeneracy similar to the 1D case of colocated discontinuities.
De nition 7.2.A family of submanifolds M in R are mutually transverse when, for any subfamily M ⊆ M and any point in the intersection of those submanifolds M , the normal spaces of M at are linearly independent.△ We specialize this de nition to R , and therefore use normal space rather than the tangent space for simplicity.For example, two planes in R 3 that only intersect along a line are transverse.If they do not intersect at all, then they are (vacuously) transverse.However, when the two planes coincide, i.e. they are the same plane, their intersection is not transverse as the normals for the planes are equivalent.The following examples are not mutually transverse: two osculating circles and three lines in R 2 that intersect at the same point.
The derivative of a di eomorphic conditional introduces a Dirac delta located along the zero-level set of the di eomorphism.Figure 9 presents such an example.The initial function (9a) has a zerolevel set of the di eomorphism that introduces a ridge representing the Dirac delta in the derivative (9b), which is not present when the delta is not included (9c).For a product of di eomorphic conditionals, if the corresponding zero-level sets are not transverse, then the denotation of the conditionals is degenerate.For example, if [ ≥ ] [ ≥ ] are equivalent [ ≥ ] (which is the case for functions) then the derivative double-counts the jump 2[ ≤ ] ( − ) versus ( − ).

Justification for the Transversality Condition.
Terms can often be written in a way that satis es the transversality condition.For example, in math, some rewrites are: replacing [ ≥ ] 2 with [ ≥ ] or replacing sin([ ≥ ]) with [ ≥ ] sin(1) prior to di erentiation.Practical cases that cannot be written this way typically pose problems beyond di erentiability, such as numerical instability.For instance, the construction of a quadrilateral from two triangles violates the transversality condition because the two triangles have a coincident edge.Operations on the quadrilateral, such as rotation and translation, lead to numerical errors because edges that are supposed to be coincident are instead distinct.Such numerical instabilities in geometry cause problems such as light leaks, where light from behind a mesh shines through an object due to a crack and missed collisions, where an object fails to collide with another object and passes through instead [Woop et al. 2013].
So the violation of transversality poses practical challenges beyond that of di erentiability.As a result, graphics engines generally encode geometry as a partition of space (each edge is only represented once).Hence, although practical computations can be represented as programs that violate transversality, it is often desirable to implement the computation in an alternative way that restores transversality [Dalstein et al. 2014].
If the transversality condition fails, the program may not have a well-de ned distributional derivative (e.g., ifge0 ⌊Ψ⌋ then (ifge0 ⌊Ψ⌋ then 1 else 0) else 0).In this case, the interpreter will return results that we do not assign a meaning to.Similarly, Pytorch [Paszke et al. 2019] and Tensor ow [Abadi et al. 2015] produce results at discontinuous points, although the derivative is ill-de ned at these points. of a term is the derivative denotation of the integrand (soundness) and (2) the derivative of a term is a parametric distribution that is the sum of a 0 simply decomposable function and Dirac deltas multiplied by 0 simply decomposable functions in Figure 10.Proof.We provide the full proof in Appendix I. □ We now characterize the semantic domain of the derivatives of well-typed terms.
Theorem 7.1.For every well-typed term Δ; Γ ⊢ : real, if the is locally integrable and the zero level-set of all di eomorphisms in the derivative are mutually transverse then ( , , ′ ) is a sum of a 0 simple distribution and a nite sum of delta distributions multiplied by 0 simply decomposable functions.Concretely, using the isomorphism between variable names and ordered variables, ( , , ′ ) can be written as the distribution: where and ℎ for all and are 0 simply decomposable and 1 is the lifting of the one function 1( ì ) into a distribution (De nition 3.7).« Proof.It is su cient to prove the theorem for (Lemma 7.1 and Lemma 7.1 cover the other cases).Theorem 6.1 gives us that is 1 simply decomposable.We can then take the derivative ì of each of the summands ( ì , ì ) Φ ( ì , ì ) ≥ 0 from the de nition of simply decomposable.Using Lemma 7.1 to apply the product rule, we nd that the derivative is: where and are in nite sets and the functions ≠ Φ ( ì , ì ) ≥ 0 are 0 simply decomposable.We can use linearity and set ( ì , ì , ì ′ ) = ( ì , ì , ì ′ ), which is simply decomposable.□

DENOTATIONAL SEMANTICS OF PROGRAMS AND THEIR DERIVATIVES
Langur programs are the integrals and the derivative of integrals of Langur terms.We discuss integrability and prove a soundness theorem that shows that the derivative denotation of a program equals the derivative of the denotation of that program.

Type System
Figure 11 depicts the typing rules for Langur programs.The syntax int d( 1 , . . ., ) introduces variables 1 , . . ., into the typing context for .A well-typed program integrates over a well-typed term with no free variables, preventing nested integration.

Denotational Semantics
Figure 12 depicts the denotational semantics for Langur programs and their derivatives.The integral primitive denotes -dimensional Lebesgue integration.We use the notation d to introduce variables 1 , . . ., in the integrand.The denotation of the derivative of an integral is the integral of the distributional derivative of the integrand.

Results
We now brie y discuss integrability and the correctness of the derivative.Recall from Section 2 that a program in the surface language has a compact domain of integration speci ed by the constant bounds of integration.Its encoding in the core language results in an integrand with compact support.
For example, a program in the surface language integral ([-10, 10]) x dx has a compact domain of integration [−10, 10].In the core language, this program can be encoded as: where the integrand has compact support [−10, 10].
As a result of this property, we can show that programs implemented in the surface language have a well-de ned denotation in the core language.Proof.Theorem 7.1 shows that the integrand is a 0 simple distribution plus a nite sum of Dirac deltas and scaled by 0 simply decomposable functions each of which has compact support.The sum of Dirac deltas is bounded because the integrand is bounded.Since the integral of a continuous function with compact support is bounded, and there are nitely many continuous pieces, the result is bounded.□ Derivative Correctness.The derivative of the denotation equals the derivative denotation: Theorem 8.1 (Derivative correctness).Let ( , ′ ) ∈ Γ be given.For every well-typed program Δ; Γ ⊢ int d( 1 , . . ., ) : real, the denotation and derivative commute: where the equivalence above is almost everywhere equivalence, and we assume that the program and its derivative int d( 1 , . . ., ) (•, , ′ ) are well-de ned and that the zero level-set of all di eomorphisms in the derivative are mutually transverse.« Proof.We provide the full proof in Appendix J. □

SEPARATE COMPILATION
In this section, we discuss an interpreter for the core language Langur, formalize separate compilation, and relate the surface language Potto to Langur.The operational semantics for programs follows from the denotational semantics (Appendix D).We now formalize the relationship between the two.
Operationally, we can commute the derivative and the integral (Figure 12): where is a parametric distribution and the last equality applies the isomorphism between mappings from variable or parameter names to ordered real numbers.The mechanism of building separate compilation from compositional operational semantics is well studied [Cardelli 1997].Separate compilation is important in programming and speci cally in computer graphics where many small changes are made to program parts in the design process.Langur separately compiles program parts and simply evaluates them at di erent points to produce results rather than requiring global program manipulations as in Teg [Bangaru et al. 2021].
Example.We now discuss the Potto program from Section 2 (code blocks 1, 2, 3).We brie y discuss this conversion from the Potto to the Langur.Recall from the type system that variables and parameters are passed in the environment.
The risk(distro, a, b, mu) becomes the integral on the last line because we have applied the function trunc_normal.We take the liberty of using the names normal, , and trunc_normal for parameters rather than 1 , 2 , and 3 for clarity.
We need a little more machinery to introduce risk and therefore gain the bene ts of separate compilation.We introduce a nite looping construct that is syntactic sugar and is unrolled.Using the same argument as for let binding, we can separately compile the body of the loop and iteration.

EVALUATION
In this section, we evaluate the prototype programming language, Potto, and implement a di erentiable renderer with swappable shaders, meaning that each shader can be separately compiled and therefore e ciently interchanged.We describe implementation details and optimizations on top of those presented in the operational semantics in Appendix G.We compare Potto and Teg on a renderer from Teg and two synthetic examples as well [Bangaru et al. 2020].

A Di erentiable Renderer with Swappable Shaders
We implement a di erentiable renderer [de La Gorce et al. 2011;Li et al. 2018Li et al. , 2020;;Loper and Black 2014;Zhao et al. 2020].Domains ranging from autonomous driving to robotics to computergenerated imagery, use di erentiable renderers to recognize the 3D shapes of cars, signs, and pedestrians, to reconstruct a 3D scene for a robot, and to capture the 3D properties of actors' faces.
A renderer is a program that takes in the geometry and color of each object in the scene and outputs an image.Renderers are built out of programs called shaders.A di erentiable renderer computes the change in the color of a pixel with respect to changes in parameters, such as the location or color of an object.These derivatives are useful for optimization.
Figure 13a depicts the scene: a line of light shining diagonally downward.The color of a single pixel is the average of light within the pixel area:  dc=1, producing a number representing the derivative.The program then does a step of gradient descent to nd the value of resulting in a pixel most similar to the pixel in a given image.
Separate Compilation.Potto separately compiles the derivative of the renderer and shaders, making it possible to compile di erent program parts to di erent hardware (e.g., the CPU and GPU).As a result, Potto compiles the derivative of the renderer and each shader once rather than needing to compile the derivative of the renderer and shaders three times.Since calculating the derivative of the renderer is so time-consuming, it results in a 2.78x speedup.

A Ray tracing Di erentiable Renderer with Swappable Shaders
A ray tracer is a renderer that computes the color of each pixel in the image by shooting rays from a camera to identify which objects are visible and what color they have.The ray tracer shoots rays into the scene and computes the color at the intersection of the ray with the geometry [Shirley 2020].The subset of parameters that control the color, re ectiveness, and other properties of surfaces reside in the shader programs.Changing shader parameters changes the stylization of the scene.
Di erentiable rendering [Li et al. 2018;Loper and Black 2014] is useful for solving inverse problems.For example, an artist might draw a 2D image and want a 3D scene that, when viewed from a particular angle, looks as similar to their drawing as possible.Gradient-based optimization can automatically set these shader parameters.
A production renderer has many shaders that artists can select to achieve a desired e ect [Perlin 1985].Rendering often involves large geometries within the scene, making it important to be able to quickly swap out shaders without recompiling the whole rendering engine.
As a result, engines typically allow for separate compilation of the renderer from the component shaders.However, boundary sampling techniques are either limited to a ne discontinuities [Li et al. 2018] or do not support separate compilation [Bangaru et al. 2021].To solve inverse problems with multiple shaders without incurring this overhead, it is imperative to be able to swap out shaders and their derivatives in the derivative of a renderer.
Each pixel in the rendered image is a rectangular cell in an imaginary screen in front of the camera.Its color is the average (integral) of the light rays that pass through the cell.
Swappable Shaders.We would like to swap between di erent discontinuous shaders, often called toon shaders based on their cartoon stylized look.Often these come from standard physical shaders that are thresholded.Our example shaders are 1) a z-depth shader, which gives di erent colors to objects based on how far away from the camera they are, and 2) a thresholded Lambert shader, which models the re ectance of a matte surface under a point light, which sits in front of the triangle and has linear intensity fallo [Lake et al. 2000].
Figure 14 depicts images produced by a Potto implementation of a di erentiable ray tracer.The derivative of Figure 14a is Figure 14b and the derivative of Figure 14c is Figure 14d.The scene consists of a single triangle, where the upper left corner is closer to the camera than the others.We di erentiate with respect to the vertex positions of the triangle as it expands in space, as well as a threshold parameter in the shaders.The upper left corner comes towards the camera, while the other corners spread outward.
The z-depth shader has a discontinuity at a xed depth in space.The triangle is colored red if it is close enough to the camera, and green otherwise.As the triangle expands, more black pixels along its border become colored and more of the triangle becomes red.
The thresholded Lambert shader has a discontinuity on the triangle surface based on the amount of light from the point light source that is received.When the light value is less than a certain threshold, the triangle is colored red.As the triangle expands, the transition curve moves around on the triangle.

Comparison to Teg
We compare to Teg [Bangaru et al. 2021] on an image stylization task and two microbenchmarks.
Methodology.Teg provides three backends: a vectorized Python implementation using NumPy, a C backend, and a CUDA backend.We use the vectorized NumPy backend because it most closely matches our implementation.We also observed overhead from starting the C compiler that made evaluation in the generated C code slower than in NumPy on these benchmarks.We run all evaluations using a desktop with an Intel i9-10900k 10-core CPU and 64GB of memory.10.3.1 Renderer for Image stylization.Bangaru et al. [2021] presents a renderer for image stylization in Figures 5 and 6. Figure 15 shows the (derivative) compilation time, evaluation time, total time, and AST size for linear and quadratic shader stylization.
Teg creates duplicate expressions during compilation, leading to increased compile time, evaluation time, total time, and code size.Teg does not perform common sub-expression elimination (same for Potto), so the evaluation time increases.Separate compilation in Potto results in smaller ASTs and faster compilation and evaluation.10.3.2Microbenchmarks.We design a set of small benchmarks to compare Potto and Teg in terms of compile time (to calculate the derivative), evaluation time, total time (both compilation and evaluation), and code size.We report the geometric mean across runs and set = 10 for the rst benchmark and = 3 for the second due to time constraints.
Increasing the Number of Parametric Discontinuities.We study how the performance of Potto and Teg scale as we increase the number of parametric discontinuities to be di erentiated.The program takes a function f and a number n representing the number of Heavisides and produces an expression evaluating the sum of Heavisides multiplied by the function.We bind f to programs rst-order functions a+x and a-x.The number n controls the number of Heavisides and therefore the number of Dirac deltas in the derivative.Separate Compilation.We study the impact of separate compilation by building an expression with 90 parametric discontinuities representing the geometry of a scene and scaling the number of times we swap between shaders n. Figure 17 depicts the trends as the number of shader swaps increases.Teg must repeatedly compile the whole expression every time we swap the shader, leading to a slow compile time that reaches our timeout of 20 minutes at 15 swaps.On the other hand, since Potto can separately compile the two modules, the compilation cost increases slowly.Potto needs about a minute at 50 swaps.

RELATED WORK
We extend distribution theory to di erentiate under integrals.Our implementation draws on concepts from di erentiable and probabilistic programming.For applications, we build on a body of research in di erentiable rendering and probabilistic inference.

Distribution Theory
Distribution theory was formalized by Schwartz [1950].We adapt the theory to address the product of distributions problem [Schwartz 1954].Recent programming languages work uses distribution theory to (equationally) reason about derivatives of discontinuous programs in a higher-order language [Azevedo de Amorim and Lam 2022].However, they do not extend directly the theory, which poses a challenge for reasoning about nested conditionals, preventing applications as simple as rendering a triangle.Their lack of di eomorphic conditional means that the implementation is similarly restricted.

Sampling Along Discontinuities
Early work hand-coded samplers for discontinuities [Li et al. 2018].Concurrent work automatically handled a ne discontinuities with applications to stochastic variational inference Lee et al. [2018].Recent work also di erentiated parametric discontinuities using an integration primitive [Bangaru et al. 2021].The paper claims to support higher-order derivatives, but this is not the case (di eomorphisms are often not preserved under di erentiation).We provide a formal foundation that encompasses their work and support additional language features useful in graphics (e.g., rst-order functions and separate compilation).

Smoothing Out Discontinuities
Rather than directly sampling discontinuities, one can smooth them out [Chaudhuri and Solar-Lezama 2010;Laine et al. 2020;Liu et al. 2019].This typically leads to approximate results and requires additional parameters [Inala et al. 2018;Yang et al. 2022].We focus on consistent, unbiased estimation of derivatives, but are interested in the practical trade-o s between these approaches.Recent work showed that unbiased estimation is possible for smoothed out discontinuities, but the work is yet to be systematized [Bangaru et al. 2020].
Automatic Di erentiation.Automatic di erentiation has been a known technique for quite some time [Wengert 1964].Interest in developing e cient AD systems has grown signi cantly [Abadi et al. 2015;Bradbury et al. 2018;Paszke et al. 2019;Pearlmutter and Siskind 2008;Yu et al. 2014].Researchers have also developed theoretical techniques for proving correctness [Abadi and Plotkin 2019;Elliott 2018;Lee et al. 2020;Mazza and Pagani 2021;Sherman et al. 2021].Sherman et al. [2021] study a language with derivatives and integrals but returns vacuous results when di erentiating parametric discontinuities.
Probabilistic Programming.Saheb-Djahromi [1978] and Kozen [1981] performed early work on the semantics of probabilistic programs.Recent work develops e cient probabilistic programming languages [Bingham et al. 2019;Cusumano-Towner et al. 2019;Stan Development Team 2015].Compositionality is also of interest in these languages.For example, Cusumano-Towner et al.
[2019] introduces a generative function interface that is compositional and should similarly support separate compilation.We believe distributional semantics are also relevant to other inference tasks [Gehr et al. 2020;Lew et al. 2019;Shan and Ramsey 2017;Zhou et al. 2019].
Recent work, ADEV provides a framework for reasoning about the composition of gradient estimators and providing proofs of correctness [Lew et al. 2023].It lacks support for di erentiating parametric discontinuities as discussed in their related work section in the subsection "AD of Languages with Integration."Our paper provides such a sound gradient estimator that could potentially compose with other estimators using an appropriate extension of their framework.This would require changing their type system to admit programs with parametric discontinuities beyond discrete distributions that can either be enumerated or estimated via REINFORCE.
12 CONCLUSIONS Parametric discontinuities arise in applications spanning computer graphics [Li et al. 2018], robotics [Hu et al. 2020], and probabilistic programming [Lee et al. 2018].We design a theory providing a semantic model for di erentiating parametric discontinuities that arise in these applications.Using the insight from our theory, we implement a system that can separately compile programs, allowing us to build a di erentiable renderer with swappable shaders.In the future, we hope that di erentiable programming languages will support di erentiation of parametric discontinuities to better serve application domains.

Fig. 1 .
Fig. 1.Di erentiate Before You Discretize.(a) the integral (shaded) ∫ 1 0 [ ≤ ] = = 0.5.(b) shows a discretization of integral as in an implementation in a language without integration.(c) shows the derivative of the integral as computed by Po o, which integrates a Dirac delta spike at and returns 1.(d) depicts that standard AD di erentiates the discretized program and incorrectly returns 0.
Proc.ACM Program.Lang., Vol. 8, No. OOPSLA1, Article 126.Publication date: April 2024.Distributions for Compositionally Di erentiating Parametric Discontinuities 126:3 Fig. 2. At initialization (a), the truncated normal density is far from the normal density.The descent direction in standard AD (b) is defined as − and is correct for , but is 0 and therefore incorrect for the truncation points and .The black dots in (b) and (d) are shared samples that standard AD and Po o use to estimate the risk .Standard AD (c) optimizes but fails to optimize and .Po o (d) samples at and (orange stars) to account for the parametric discontinuities and has the same descent direction for as standard AD.Po o converges (e) to the desired curve by moving the mean right and widening the truncation points.

Fig. 3 .
Fig. 3.The first two images show that standard AD (blue) has zero gradient for the truncation points, so they remain at their initialization = −1 and = 4. Po o (orange) accounts for the parametric discontinuities optimizing the truncation points to their optimal values = −10 and = 10.The third image shows that the mean converges to the optimum = 2 about 10x faster for Po o than standard AD.The fourth image shows that the loss for Po o is orders of magnitude lower than Standard AD.

Fig. 4 .
Fig. 4. Po o supports separate compilation, which enables the reuse of the derivative of risk.
Fig. 6.The type system specifying well-formed terms.
Fig. 10.A well-typed term denotes a 0 simply decomposable function (le ).Its derivative (right) is a sum of a simple distribution (constant interior color) and Dirac deltas multiplied by 0 simply decomposable functions (the boundary lines).
Fig. 11.The type system specifying well-typed programs.
Fig. 13.We use a di erentiable renderer implemented in Po o turn scenes containing a single line of light pointing diagonally downward with di erent light a enuations into images.Swapping between di erent shaders is a critical part of the design process.In Po o, programs and their derivatives can be separately compiled and composed to e iciently swap among shaders.

Fig. 14 .
Fig.14.Discontinuous shaders at a 40 × 40 resolution.We display the sparsity pa ern of the derivative shaders.For example, the derivative shader is red, if the derivative is nonzero in the red channel.

Fig. 15 .
Fig. 15.A bar chart, where smaller is be er, comparing Po o to Teg on an image stylization.Po o is so much faster in compile time that the bars on not visible.Compile time was the bo leneck in Teg.
Fig. 17.A comparison of how Po o and Teg scale with the number of swaps between shaders.

Figure 16
Figure16depicts the trends as the number of Dirac deltas increases.Potto compiles code faster than Teg, which is a result of the global program transformations needed in Teg.Potto has slower runtime performance than Teg, which is due to Potto doing dynamic tracing and the fact that it is written in pure Python.Potto takes less total time to compute the result.Potto and Teg have similar AST sizes that grow linearly in the number of deltas.