Turaco: Complexity-Guided Data Sampling for Training Neural Surrogates of Programs

Programmers and researchers are increasingly developing surrogates of programs, models of a subset of the observable behavior of a given program, to solve a variety of software development challenges. Programmers train surrogates from measurements of the behavior of a program on a dataset of input examples. A key challenge of surrogate construction is determining what training data to use to train a surrogate of a given program. We present a methodology for sampling datasets to train neural-network-based surrogates of programs. We first characterize the proportion of data to sample from each region of a program's input space (corresponding to different execution paths of the program) based on the complexity of learning a surrogate of the corresponding execution path. We next provide a program analysis to determine the complexity of different paths in a program. We evaluate these results on a range of real-world programs, demonstrating that complexity-guided sampling results in empirical improvements in accuracy.


INTRODUCTION
Surrogate programming -the act of replacing a program with a surrogate model of its behaviorhas been increasingly leveraged to solve a variety of software development challenges [Renda et al. 2021]. For example, Esmaeilzadeh et al. [2012] train neural networks to mimic existing numerical programs (a fast Fourier transform, a triangle intersection detection algorithm, a JPEG encoder, etc.) and execute these neural networks on specialized hardware, achieving an average speedup of 2.3× across benchmarks at the cost of 10% accuracy in these numerical programs. Kustowski et al. [2020] train a neural network to mimic computer simulations of inertial confinement fusion, then fine-tune the neural network on a small amount of data from real experiments to improve the accuracy of the simulated predictions. Renda et al. [2020] train a neural network to mimic a CPU simulator, then exploit the fact that this neural surrogate is differentiable to optimize the CPU simulator's simulation parameters with gradient descent.

Surrogate Programming
Beyond these examples, surrogate programming has developed into a diverse set of techniques and applications across many areas of computing and science. Renda et al. [2021] organize the uses of neural network based surrogates of programs into three categories.
Surrogate compilation. In surrogate compilation, programmers develop a surrogate that replicates the behavior of a program to deploy to end-users in place of that program. For example, in addition to Esmaeilzadeh et al. [2012] (who use a surrogate to speed up numerical programs), Mendis [2020] use a surrogate to speed up compiler autovectorization by replacing an integer linear program (ILP) solver with a surrogate. Key benefits of surrogate compilation include the ability to execute the surrogate on different hardware and the ability to bound or to accelerate the execution time of the surrogate.
Surrogate adaptation. In surrogate adaptation, programmers first develop a surrogate of a program and then further train that surrogate on data from a different task. For example, in addition to Kustowski et al. [2020] (who use a surrogate to improve the accuracy of inertial confinement fusion simulations), Tercan et al. [2018] use this technique to improve the accuracy of computer simulations of plastic injection molding. Key benefits of surrogate adaptation include that it makes it possible to alter the semantics of the program to perform a different task of interest and that it may be more data-efficient or result in higher accuracy than training a model from scratch for the task.
Surrogate optimization. In surrogate optimization programmers develop a surrogate of a program, use gradient descent against the surrogate to identify program inputs that optimize a downstream objective, then use the inputs for executing the original program. In addition to Renda et al. [2020] (who use this technique to optimize CPU simulation parameters), Tseng et al. [2019] use this technique to find parameters for camera pipelines that lead to the most photorealistic images. She et al. [2019] use this technique to find inputs that trigger branches that may cause bugs in the original program.
The key benefit of surrogate optimization is that it can identify optimal inputs faster than optimizing directly against the program, due to the faster execution speed of the surrogate and that the surrogate is differentiable even when the original program is not (allowing the use of gradient descent).

Dataset Generation
In each of these scenarios, training a surrogate of a program requires measuring the behavior of the program on a dataset of input examples. There are three common approaches to collecting this dataset. The first is to use data instrumented from running the original program on a workload of interest [Esmaeilzadeh et al. 2012;Renda et al. 2020]. In the absence of an available workload, another is to uniformly sample (or sample using another manually defined distribution) from the input space of the program [Kustowski et al. 2020;Tseng et al. 2019]. The third is to use active learning [Settles 2009], a class of online methods that iteratively query labels for the data points that are most useful (however defined) for further training the surrogate [İpek et al. 2006;Pestourie et al. 2020;She et al. 2019].
Each of these approaches face challenges on programs with different behaviors in different regions of the input space. For example, Renda et al. [2020, Section IV.A] identify a scenario in which an instrumented dataset does not exercise a set of control flow paths in the program enough times for the surrogate to learn the program's behavior along those paths, resulting in a surrogate that generates highly inaccurate predictions for inputs in the regions of the input space corresponding to those paths.

Our Approach: Complexity-Guided Sampling
Rather than treating the program as a black box, our approach uses the source code and semantics of the program under study to guide dataset generation for training a surrogate of the program. The core concept is to allocate samples based on both the the complexity of learning the program's behavior on a given path and the frequency of that path in the input data distribution.
Complexity-guided sampling. Our objective is to find how many samples to allocate to each region of the input space to minimize the expected error of the resulting surrogate. To reason about the error of a surrogate, we use neural network sample complexity bounds for learning analytic functions [Agarwala et al. 2021;Arora et al. 2019]. These bounds give an upper bound on how many samples are required to learn a surrogate of an analytic function to a given error as a function of a complexity measure of that function. Our approach calculates a complexity measure for the function induced by each control flow path in the program and combines that with the frequency of each path according to an input data distribution. The output of the approach is the proportion of samples to allocate to each region of the input space, minimizing an upper bound on the stratified surrogate's error.
Stratified functions. Our core modeling assumption is to represent the program as a stratified function, a piecewise 1 function across different regions (strata) of the input space. We use stratified surrogates to model such functions. To construct a stratified surrogate, we train independent surrogates of each component of the stratified function. During evaluation, a stratified surrogate uses the original program to check which stratum an input is in, then applies the corresponding surrogate.
Complexity analysis. We present a programming language, Turaco, in which programs denote stratified functions with well-defined complexity measures (specifically, stratified analytic functions). We provide a static program analysis for Turaco programs that automatically calculates an upper bound on the complexity of each component of the stratified function that the program denotes.
Evaluation. To demonstrate that complexity-guided sampling using our complexity analysis improves surrogate error on downstream tasks, we evaluate our approach on a range of programs, finding that across this selection of programs complexity-guided sampling improves error relative to baseline distributions by around 5%. We demonstrate that a 5% improvement in error of a surrogate can result in a 28% improvement in execution speed in an application with a maximum error threshold. We then analyze the classes of problems for which complexity-guided sampling excels, finding potential improvements in error of up to 30%, and the classes of problems for which complexity-guided sampling using Turaco's complexity analysis sampling fails, finding deteriorations in error of up to 500%.
Renderer demonstration. We further present a case study of learning a surrogate of a renderer in a video game engine. We show that our complexity-guided sampling approach results in between 17% and 44% lower error than training using baseline distributions that do not take into account path complexity. These error improvements correspond with perceptual improvements in the generated renders.
Contributions. We present the following contributions: • An approach to allocating samples among strata to train stratified neural network surrogates of stratified analytic functions that minimizes an upper bound on the surrogate's error. • A programming language, Turaco, in which all programs are stratified analytic functions, and a program analysis to bound the complexity of learning surrogates of those programs. • Empirical evaluations on real-world programs demonstrating that complexity-guided sampling using Turaco's complexity analysis results in empirical improvements in error, and that these improvements in error result in improvements in downstream applications. • Further empirical evaluations of the classes of problems where complexity-guided sampling using Turaco's complexity analysis succeeds and fails. We lay the groundwork for analyzing complexity-guided sampling approaches for training surrogates of programs. Our results hold out the promise of surrogate training approaches that intelligently use the program's semantics to guide the design and training of surrogates of programs.    Figure 1a presents an example distilled from our evaluation (Section 5.2) that we use to demonstrate how complexity-guided sampling results in a more accurate surrogate than frequency-based sampling, sampling according to the frequency of paths alone.

EXAMPLE
Program under study. Figure 1a presents a graphics program that calculates the luminance (i.e., brightness) at a point in a scene as a function of sunPosition, the height of the sun in the sky (i.e., the time of day), and emission, which describes how reflective the material is at that point.
The program first checks whether it is nighttime (Line 2), and sets the ambient lighting variable to zero accordingly. The program next checks whether the sun position is below a threshold indicating direct sunlight (Line 7) and sets the emission variable accordingly. The output is then the sum of the ambient light and the light emitted by the material. Figure 1b presents the output of this program over the valid input range of sunPosition and emission (i.e., between −1 and 1 for both variables).
The path conditions (Lines 2 and 7) partition the program into three traces: nighttime, when sunPosition is less than 0 ( Figure 1c); twilight, when sunPosition is between 0 and 0.1 ( Figure 1d); and daytime, when sunPosition is greater than 0.1 (Figure 1e). These paths are separated by dashed black lines in Figure 1b.
Complexity. Training a surrogate of this program poses a particular challenge because these traces have not only different behavior but also different relative complexities: when sunPosition is less than 0.1 the function is linear, but when sunPosition is above 0.1 the function is quadratic. This notion of complexity is quantified by the sample complexity of each trace: traces that are more complex   require more samples to learn to a given error than traces that are less complex. Figure 2a presents the error as a function of the training dataset size of surrogates of each trace trained in isolation, showing that indeed the quadratic daytime path the has the highest error, followed by twilight then nighttime.
Complexity-guided sampling. Our objective is to find the number of data points to sample from each path to minimize the expectation of error of a surrogate of the overall program, given a data distribution and a data budget. To accomplish this, our approach leverages the complexity of each path and the frequency of each path in the data distribution, prioritizing sampling paths that are more complex (requiring more samples to learn) and that are more frequent (and thus more important to learn).
First our approach determines the sample complexity of each trace along each path, the number of samples required to learn a surrogate of the trace (in isolation) to a given error. Our approach extends the sample complexity results of Agarwala et al. [2021], who give an upper bound on the number of samples required to learn a neural network approximation of a given analytic function. Using this bound (as implemented by our Turaco analysis described in Section 4), our approach determines that the twilight path takes 1.4× as many samples to train a surrogate to a given error as the nighttime path, and the daytime path requires 3.7× as many samples.
Then given a distribution with the frequency of each path, our approach determines the complexityguided sampling rates for each path. In this example we assume that the data has a uniform distribution over inputs between −1 and 1, resulting in path frequencies for the nighttime path (sunPosition < 0) of 50%, the twilight path (0 < sunPosition < 0.1) of 10%, and the daytime path (0.1 < sunPosition) of 40%. With this, our approach determines that the nighttime path should be sampled at 36.9% of the data budget (undersampling relative to its frequency because it is simple to learn), the twilight path at 14.0%, and the daytime path at 49.1% (oversampling relative to its frequency because it is complex to learn).
Stratified surrogates. The class of surrogate model for which we derive the above approach is that of a stratified neural surrogate -a set of disjoint neural networks which are applied based on which path the inputs induce in the program. Concretely, this means that we train one surrogate per path, and pick which to apply for each input at evaluation time. For this example program, picking which surrogate to apply just requires comparing sunPosition against constant threshold values.
Results. Figure 2b presents the error as a function of the training dataset size of stratified surrogates of the entire program for a baseline of sampling according to path frequency alone and for complexity-guided sampling. Figure 2b shows that the complexity-guided sampling approach results in lower error than sampling according to path frequency alone. For datasets of total size below 70 samples, the surrogate trained with complexity-guided sampling has a geometric mean decrease in error of 27.5%. For datasets of total size above 70 samples, the surrogate trained with complexity-guided sampling has a geometric mean decrease in error of 5.5%. Across the entire range of dataset sizes evaluated in this plot, the surrogate trained with complexity-guided sampling has a geometric mean decrease in error of 15%. In sum, our approach results in a surrogate that produces a more accurate luminance calculation, and therefore a better final output from the graphics program, than a surrogate trained using frequency-based path sampling.

COMPLEXITY-GUIDED SAMPLING
In this section we present the stratified surrogate sample allocation problem and derive our solution, complexity-guided stratified surrogate dataset selection.

The Stratified Surrogate Sample Allocation Problem
Our goal is to learn a stratified surrogate,ˆ, of a stratified function, , constrained by a sample budget, , that defines the number of data samples to be used by the learning algorithm.
We approach this problem through the definition of a stratified function as a piecewise function; we term each piece a stratum. We then define a stratified surrogate as a stratified function itself, with each stratum a surrogate of a corresponding stratum of the stratified function. Learning a stratified surrogate therefore requires learning a surrogate for each stratum.
We assume that we have a technique for learning a surrogate of a function, , given a sample budget. Our precise goal is thus to partition the overall sample budget, , into per-stratum sample budgets for each stratum of the stratified function, with the objective of minimizing the overall error of the stratified surrogate.
3.1.1 Stratified Functions and Surrogates. We define a stratified function as follows: where and each is a function from inputs : X to outputs : Y, is the number of strata, { } =1 are strata, and where ∪ = X and ∀ ≠ . ∩ = ∅. We define a stratified surrogateˆas a stratified function with componentsˆ.
3.1.2 The Stratified Surrogate Sample Allocation Problem. To restate, our goal is to learn a stratified surrogateˆof a stratified function .
Formally, we define a learning algorithm, a function that learns a surrogate of a given input function, as a random function tr : (X → Y) × D × N × (Y × Y → R ≥0 ) → (X → Y) that takes a function : X → Y from inputs : X to outputs : Y, a distribution : D over inputs , a number of training examples : N, and a loss function ℓ : Y × Y → R ≥0 which measures the cost of an incorrect prediction, and returns a function (representing the output surrogate)ˆ: X → Y.
We also define notation for data distributions . Let ( ) be the probability that is sampled from , and ∫ ∈ ( )d be the probability mass of all data points within over (reducing to a summation for discrete distributions). Let ( | ), the distribution of within stratum , be defined as: We next define a stratified learning algorithm. A stratified learning algorithm learns a stratified surrogate of a stratified function by learning each component surrogate independently (given their respective dataset budgets). We use the following notation to denote the operation of a stratified learning algorithm, where ì is a vector of sample budgets for each stratum: We formalize stratified surrogate sample allocation with the following optimization problem: The objective of this problem is to find a vector of per-stratum sample budgets ì that in the expectation over the outcomes of the stratified surrogate learning algorithm (the outer expectation) minimize the expected loss over the data distribution (the inner expectation), subject to a constraint that the total number of samples used is no more than .

Complexity-Guided Stratified Surrogate Dataset Selection
In this section, our goal is to solve Equation (1). To solve this optimization problem, we need to model the relationship between the sample budget afforded to the learning algorithm for each stratum and the error of the resulting surrogate. We leverage the PAC learning framework for neural networks to derive a conservative probabilistic upper bound on the error of the surrogate. We then solve this optimization problem with our derived upper bound in place of our original objective.

PAC Learning.
To reason about the error of a surrogate, we use the probably approximately correct (PAC) learning framework [Valiant 1984]. The PAC learning framework bounds the number of examples needed to learn a surrogate as a function of the allowable error threshold for the surrogate. Equation (2) defines a given function as probably approximately correctly learnable [Valiant 1984] (abbreviated as learnable) for a given learning algorithm tr and loss function ℓ if for all distributions , with probability 1 − the learning algorithm returns a surrogateˆthat approximately matches the original function over the distribution (i.e., the expectation of the error is bounded by ):

Neural Network Sample Complexity Measures.
It is an open problem to determine the exact relationship between the number of samples and the target error threshold in the PAC bound (Equation (2)) for neural networks on arbitrary target functions . Rather than use the exact relationship, we use an upper bound on as a function of , and minimize the induced upper bound. Arora et al. [2019] and Agarwala et al. [2021] present such an upper bound for learning analytic functions with neural networks. Agarwala et al. define a sample complexity measure ( ) ∈ R ≥0 , where higher values denote functions that require more samples to learn to a given error . With this sample complexity measure ( ), Equation (2) holds for all analytic , , , , and with: where is an unknown constant.
The tilde measures the magnitude of each coefficient of 's analytic representation; this is a measure of the influence of hard-to-model higher-order terms. We work with the following generalization of the tilde to multivariate analytic functions, where ì∥1 denotes concatenating a 1 to ì: Agarwala et al. present the multivariate generalization; we contribute the novel generalization to ì∥1, which allows us to handle functions that are not analytic around 0 such as log.
With the definition of the tilde, we now present Agarwala et al.'s core theorem, which says that the tilde induces a sample complexity measure for analytic functions: Theorem 1. For a sufficiently wide (see Arora et al. [2019, Theorem 5.1]) 2-layer neural network trained with gradient descent for sufficient steps (ibid.), if is analytic, ì is on the -dimensional unit sphere, and ℓ is 1-Lipschitz, then ( ì) is learnable in the sense of Equations (2) and (3) with: We present the proof of this theorem in Appendix B. The proof is a novel extension to inputs ì∥1 of Agarwala et al. [2021]'s proof.

Complexity-Guided Stratified Surrogate Dataset Selection.
Coming back to the stratified surrogate sample allocation problem (Equation (1)), our goal is to find per-stratum sample budgets ì that minimize the expectation of error of the stratified surrogate. To help solve this optimization problem, we can refactor Equation (1) to separate out each stratum as follows: This refactoring exploits that a stratified learning algorithm learns the surrogate for each stratum independently: 2 we can decompose the expected loss of the learning algorithm into the expectation over strata (the outermost expectation in Equation (6)) of the expectation over the outcomes of the surrogate learning algorithm on that stratum (the middle expectation) of the expected loss of the expected loss over the data distribution on that stratum (the inner expectation).

Predicted
Error of a Surrogate. Instead of optimizing Equation (6) directly, our approach is to optimize the conservative probabilistic upper bound given by the PAC framework for each surrogate. We define the predicted errorˆ, , of a stratified surrogate component to be the upper bound (with probability 1 − ) of the error of the surrogateˆagainst the function . Concretely, the predicted error is the error for a given and assuming that Equation (3) is tight with = 1 (the value of cancels out in our analysis, so this choice is just for notational convenience): We then replace the expectation of error in Equation (6) in each stratum with the predicted error of that stratum, resulting in the objective that our approach optimizes: The objective of this problem is to find a vector of per-stratum sample budgets ì that in the expectation over strata (the outer expectation) minimize the predicted error of the surrogate for that stratum, subject to a constraint that the total number of samples used is no more than . Finally we can solve this optimization problem. For a given stratified function , sample budget , and per-stratum failure probabilities : Theorem 2 defines how much data our complexity-guided sampling approach samples from each stratum. Specifically, data is sampled from each stratum proportionally to: This term incorporates the frequency of that stratum ( ∫ ∈ ( )d ), the complexity of that stratum ( ( )), and a term from the failure probability . We present the proof of Theorem 2 in Appendix A.
For convenience, throughout the rest of this paper we assume that all are set to be equal (∀ , . = ). Because each surrogate training is independent, this induces an overall PAC failure probability = 1 − (1 − ).

Tightness of the Predicted Error Optimization.
We note that the optimal solution to Equation (8) is not necessarily the optimal solution to Equation (1). First, optimizing the predicted error is not the same as optimizing the expectation of error: specifically, there is a gap between the optimal solution to Equation (8) and the optimal solution to Equation (1). We note that assuming that the per-example loss is bounded by some value , the expectation of error found by the optimal ì for Equation (8) is bounded by: Second, the bound on the predicted error itself may be loose. We note that while the predicted error itself may be a loose bound on the error, our approach does not require exact values from these bounds, but instead compares the predicted error of each different component of the stratified function to minimize the overall predicted error.

TURACO: PROGRAMS AS STRATIFIED FUNCTIONS
In this section we present Turaco, a programming language in which programs denote learnable stratified functions. We provide a program analysis for Turaco programs that calculates an upper bound on the complexity of each component of the stratified function that the program denotes.  Turaco supports analytic functions (e.g., sin, exp), including those which are analytic only on a subset of the reals (e.g., log). We restrict the supported operations to those required to implement the evaluation in Section 5. Standard execution semantics. Figure 4 presents the big-step evaluation relation for expressions in Turaco. The expression relation ⟨ , ⟩ ⇓ says that under variable store (assigning values to all variables in ), the expression evaluates to value . These semantics are standard to IMP-like languages with the exception of that for log{ }( ): note that the expression log{ }( ) takes an additional parameter and requires | − | < . We discuss this requirement in Section 4.2 and Appendix E.

Syntax and Standard Interpretation
The big-step evaluation relation for statements and full Turaco programs are standard and are presented in Appendix C.

Complexity Analysis
We now present a program analysis that gives an upper bound on the complexity of traces of Turaco programs, sequences of statements without if statements. The analysis uses two core concepts: a complexity interpretation of expressions to calculate an upper bound on the tilde of expressions (Section 3.2.2), and a standard dual-number execution [Griewank and Walther 2008;Wengert 1964] of the complexity interpretation to calculate the derivative of the upper bound on the tilde, which as we show below is also an upper bound on the derivative of the tilde. The result of the dual-number execution allows us to upper bound the complexity of a trace of a Turaco program.

Program Analysis.
First we walk through the rules of the program analysis, presented as a big-step evaluation relation. Figure 5 presents the relation used to calculate the tilde for expressions in Turaco. The relation ⟨˜, ⟩⇓ (˜,˜′) says that under the variable complexity mapping˜(mapping variables to tuples with their respective tilde and tilde derivative), the expression has˜≤˜and˜′ ≤˜′.
Broadly, we define the rules in Figure 5 using the definition of the tilde, the fact that the tilde is compositional (as we prove in Section 4.2.2) and the definition of a dual-number execution. For instance, the tilde of a constant is the absolute value | | of that constant with a derivative of 0, and the tilde of 1 + 2 is the sum of the tilde of each expression with a derivative of the sum of their derivatives.
The most complex rule is the rule for log{ }( ). To handle that log( ) is not analytic around 0, log{ }( ) expands log( ) around = . The value of is a nuisance parameter that must be set to allow the expansion around = to converge for all inputs (inducing the | − | < requirement in Figure 4) while minimizing the overall program complexity. Note that the condition >˜√ 2 + 1 can always be satisfied by applying the identity log( ) = log + log( ).

Tilde
Calculus. This section presents the core lemma stating that the upper bound on the tilde is compositional, and that the derivative of the upper bound is an upper bound on the derivative. The bounds on the tilde are from Agarwala et al. [2021]; we extend these bounds to also bound the derivative of the tilde.
Lemma 4.1. The tilde and its derivative have upper bounds that are compositional with respect to : The proof of this lemma is presented in Appendix B.

Soundness
This section proves that the Turaco complexity analysis is sound: that it computes an upper bound on the true complexity of learning a trace. We prove this by induction on expressions and traces. Our approach is based on the observation that at a given program point the value of each variable was computed by some function applied to the input. We use the notation { } as shorthand for { | ∈ }, a set of functions indexed by ∈ . Our inductive hypothesis requires that˜contain the tilde and tilde derivative of each of these functions (evaluated at 1, as in Theorem 1). We use the notation˜⊢ { } to denote the predicate that each have tilde and tilde derivative bounded by˜: We also note that the standard execution semantics big-step relation ⇓ both for expressions and for traces is a function. We use · as notation to refer to that function for expressions and · to refer to that function for traces followed by taking the value of the variable : We use the notation • to denote function composition. The functions denoted by expressions and traces have multiple inputs; in this context, composition with a set of functions { } is defined as follows: Now we state the core theorems of correctness for the Turaco analysis: Lemma 4.2. The tilde big-step relation for expressions upper bounds the tilde and tilde derivative: The tilde big-step relation for traces upper bounds on the tilde and tilde derivative: Theorem 3. The complexity relation computes an upper bound on the true complexity: The proofs of Lemmas 4.2 and 4.3 and Theorem 3 are presented in Appendix D.

Extensions
Our implementation extends Turaco to support vector-valued variables, applying all operations elementwise. Following Agarwala et al. [2021], we define the complexity of learning a vector-valued function to be the sum of the complexity of learning each output component. Our implementation also supports other syntactic sugar including a minus operation and division by constants.
Loops. Our implementation of Turaco also supports fixed-and bounded-length loops, though they are not required for any case study in the paper (thus we do not present them in this paper). However, unbounded loops pose a challenge because our approach trains a distinct surrogate per path, which is not possible with unbounded loops. This restriction to statically bounded length loops is a common feature of analyses that reason about numerical approximation, including reliability analyses [Boston et al. 2015;Carbin et al. 2013;Misailovic et al. 2014] and floating-point error analyses [Darulova and Kuncak 2014;Magron et al. 2017;Solovyev et al. 2018]. Reasoning about loops with dynamic, input-dependent bounds requires separate techniques (e.g., Boston et al. [2015]).

EVALUATION
In this section we evaluate our complexity-guided sampling approach using Turaco's complexity analysis to determine sampling rates for a range of benchmark programs. We demonstrate that complexity-guided sampling consistently results in more accurate surrogates than those trained using baseline distributions (the frequency distribution of paths and the uniform distribution of paths). We also demonstrate that such an improvement in surrogate error can result in an improvement in execution speed in an application with a maximum error threshold. In Section 5.1 we first evaluate across both a set of real-world programs, showing expected error improvements in practice, and also a set of synthetic programs, showing cases where the complexityguided sampling approach shines and cases where it fails. Then in Section 5.2 we dive into a case study on a specific large-scale program, a demonstration 3D renderer [Lettier 2019], such as forms the core of a graphics rendering pipeline for a movie or 3D game engine [Christensen et al. 2018;Tatarchuk 2006].

Evaluation Across Programs
In this section we evaluate our complexity-guided sampling approach using Turaco's complexity analysis to determine sampling rates for a range of benchmark programs. We evaluate both a set of real-world programs, showing expected error improvements in practice, and also a set of synthetic programs, showing cases where the complexity-guided sampling approach shines and cases where it fails.
Methodology. Following the input scale assumptions from Agarwala et al. [2021], we sample each input variable uniformly between [−1, 1] or [0, 1] as appropriate for the program. We insert scale factors as appropriate given the expected data distribution of the original program. We then uniformly sample inputs from these ranges. This induces both a data distribution over inputs and a path frequency distribution.
For all benchmarks other than the Jmeint benchmark, we evaluate using a training data budget using 10 points logarithmically spaced between 10 and 1000. For the Jmeint benchmark, which is more data intensive, we evaluate using a training data budget using 10 points logarithmically spaced between 1000 and 10000. When computing the complexity-guided sampling distribution, we use = 0.1.
For each path in each benchmark, we train a 1-hidden-layer MLP with 1024 hidden units with a ReLU activation, using 10,000 steps of Adam with learning rate 0.0005 and batch size 128. We run the training for 5 trials. We report both the predicted error (Equation (7)) improvement and the empirical improvement, the geometric mean improvement in error across trials. As in Section 5.2, improvement is defined as the mean percentage error between the predicted error for complexity-guided sampling and the baseline sampling method.
5.1.1 Results. Table 1 presents the results of the evaluation across 6 benchmark programs: Luminance, Huber, BlackScholes, Camera, EQuake, and Jmeint. Table 2 presents path and distribution statistics for each benchmark program.  Table 4 in Appendix A of Renda et al. [2021], with an added column of "Error -5%".
Embedding Width Error Error -5% Speedup over W=128  We find that across this selection of programs, from predicted error improvements of 3.33% against the frequency sampling baseline, complexity-guided sampling results in an empirical improvement of 4.81%; from predicted error improvements of 4.33% against the uniform sampling baseline, complexity-guided sampling results in an empirical improvement of 5.43%. Such a magnitude of decrease in error can significantly affect a system end-to-end. For example, consider Table 3. This table shows the results of a hyperparameter search to choose the fastest-to-execute neural network that meets an error threshold of 10% 3 (this table is a replication of Table 4 in Appendix A of Renda et al. [2021], with an added column of "Error -5%"). Renda et al. chose the network with size 64, which has a 1.57× speedup; however, a decrease in error of 5% would result instead in choosing the network with size 32, which has a 2.01× speedup, a 28% improvement in application performance.
Against the frequency baseline, compared to a predicted error improvement of 2.58%, complexityguided sampling results in an improvement of 15.01%. Against the uniform baseline, compared to a predicted error improvement of 6.97%, complexity-guided sampling results in an improvement of 15.17%.
Against the frequency baseline, compared to a predicted error improvement of 0.49%, complexityguided sampling results in an improvement of 8.15%. Against the uniform baseline, compared to a predicted error improvement of 1.93%, complexity-guided sampling results in an improvement of 9.54%.
BlackScholes. Figure 10b presents the BlackScholes benchmark, which performs a part of the Black Scholes option pricing model (with otype positive for puts and negative for calls), for inputs uniform in [0, 1] (other than otype, which is uniform in [−1, 1]). This benchmark is a fragment of the Black-Scholes benchmark in the AxBench benchmark suite [Yazdanbakhsh et al. 2017]. This benchmark has 2 paths: when otype > 0 (path l with complexity 165.72; for puts) and when otype < 0 (path r with complexity 485.23; for calls).
Against the frequency baseline, compared to a predicted error improvement of 4.43%, complexityguided sampling results in an improvement of 3.61%. Against the uniform baseline, compared to a predicted error improvement of 1.30%, complexity-guided sampling results in an improvement of 4.00%.
Camera. Figure 18 in Appendix F presents the Camera benchmark, which performs a part of the conversion from blackbody radiator color temperature to the CIE 1931 x,y chromaticity approximation function, for inputs x ∈ [−1, 1], y ∈ [−1, 1], invKiloK ∈ [0, 1], and T ∈ [0.1, 0.5] (note that T is used exclusively to determine the path). This benchmark is included in the Frankencamera platform [Adams et al. 2010], and is based off of an implementation by Kang et al. [2002]. This benchmark has three paths: when T < 0.2222 (path ll with complexity 0.86), when 0.2222 < T < 0.4 (path lrl with complexity 0.81), and when 0.4 < T (path rrr with complexity 9.53).
Against the frequency baseline, compared to a predicted error improvement of 2.83%, complexityguided sampling results in an improvement of 0.56%. Against the uniform baseline, compared to a predicted error improvement of 0.22%, complexity-guided sampling results in an improvement of 1.36%.
EQuake. Figure 19 in Appendix F presents the EQuake benchmark, which computes the displacement of an object after one timestep in an earthquake simulation. This benchmark is a fragment of the 183.equake benchmark in the SPECfp2000 benchmark suite [Henning 2000]. This benchmark has 2 paths: when t > 0.5 (path l with complexity 56.29) and when t < 0.5 (path r with complexity 1169.50).
Against both the frequency and uniform baselines, compared to a predicted error improvement of 7.45%, complexity-guided sampling results in an improvement of 2.25%.
Jmeint. Figure 20 in Appendix F presents the Jmeint benchmark, which calculates whether two 3D triangles intersect, and several auxiliary variables related to their intersection. All inputs are sampled from [−1, 1]. This benchmark is a fragment of the Jmeint benchmark in the AxBench benchmark suite [Yazdanbakhsh et al. 2017]. This benchmark has 18 paths; each path has the same complexity of 72361000, but with different frequencies.
Against the frequency baseline, compared to a predicted error improvement of 2.34%, complexityguided sampling results in an improvement of 0.01%, a negligible change in error. Against the uniform baseline, compared to a predicted error improvement of 8.44%, complexity-guided sampling results in an improvement of 1.02%.
We note that this benchmark has the highest complexity of any evaluated program (requiring more samples and still resulting in higher overall errors), and also that empirically some paths do appear to be significantly easier to learn despite the matching complexities.
5.1.2 Analysis: Complexity-Guided Sampling Successes. In this section, we demonstrate examples of where the complexity-guided sampling technique results in significantly better error than baselines.
Complex paths. The first case is when some paths are significantly more complex than others: neither the frequency nor the uniform baseline take into account path complexity, so we expect both baselines to undersample the complex path. Figure 11a presents an example of such a case. In this example, the complexity of the x < 0.5 path (l) is 137677, while the complexity of the x < 0.5 path (r) is 57. The frequency of both the l and the r paths are 50%. The complexity-guided sampling approach samples the l path with probability 93% and the r path with probability 7%.
Against both the frequency and uniform baselines, compared to a predicted error improvement of 22.72%, complexity-guided sampling results in an improvement of 10.9%.
Skewed frequency distribution. The second case is when some paths are significantly more frequent than others. This confers advantages over the uniform baseline, which does not take into account path frequency, and also over the frequency baseline, which does not take into account the functional form of the learning bound in Equation (3) (i.e., that error decreases proportionally to the square root of the number of samples). Figure 11b presents an example of such a case. In this example, all paths have a complexity of 14, while the frequencies are either 10% (for paths l, rl, and rrl) or 70% (for path rrr). The complexity-guided sampling approach samples the 10%-frequency paths with probability 15%, and the 70%-frequency path with probability 55%.
Against the frequency baseline, compared to a predicted error improvement of 3.75%, complexityguided sampling results in an improvement of 28.38%. Against the uniform baseline, compared to a predicted error improvement of 14.08%, complexity-guided sampling results in an improvement of 20.76%.
Note that this type of path distribution (with rare paths below 1% of the input data distribution) matches the distribution of paths in the renderer evaluation in Section 5.2, and results in a similar significant overperformance of the predicted error improvement. Complexity imprecision. The first case is when the complexity results in too loose of an upper bound on the resulting error of a surrogate of that function. In this case, the complexity-guided sampling approach can oversample from the corresponding path. Figure 11c presents an example of such a case. In this example, the complexity of the l path is 18638 and the complexity of the r path is 16. Though we are not aware of any tighter bounds on the complexity of learning sin(4 ) for ∈ [0.5, 1], in practice we find that neural networks are able to learn this function to low error with relatively few samples.
The frequency of each path is 50%. The complexity-guided sampling approach samples the l path with probability 90.9% and the r path with probability 9.1%. Against both the frequency and uniform baselines, compared to a predicted error improvement of 20.88%, complexity-guided sampling results in a change of −92.59%, a significant increase in error.
Nonuniform complexity. The second case is when the complexity of a learning function varies significantly across different scales. In this case, the complexity-guided sampling approach can oversample from the corresponding path.

fun (x , y ) {
if ( x < 0.5) { y = y + ( y * 100) ; y = y -( y /101) *100; y = y * 2; } else { y = sin (2* y ) ; } return y ; } (e) Failure: synthetic example with a function on which the Turaco analysis is imprecise. Fig. 11. Examples of complexity-guided sampling successes and failures. Figure 11d presents an example of such a case. In this example, the complexity of the l path is 12129 and the complexity of the r path is 2.38. This causes an issue with the complexity-guided sampling because with a large target error (e.g., > 0.01), the l path is essentially zero (and therefore should have low complexity). However, with a small target error (e.g., < 0.00001), the l path is very complex. Because the sample complexity bounds themselves are scale-independent upper bounds, they do not by default incorporate this knowledge.
The frequency of each path in this example is 50%. The complexity-guided sampling approach samples the l path with probability 92.9% and the r path with probability 7.1%. Against both the frequency and the uniform baselines, compared to a predicted error improvement of 22.7%, complexity-guided sampling results in a change of −351%, a 3.5× increase in error.
Analysis imprecision. The third case is when Turaco's analysis of the complexity is imprecise: Theorem 3 proves that Turaco's complexity analysis computes an upper bound on the tilde, but this upper bound may also be loose (as discussed in Section 4.4). In this case, the complexity-guided sampling approach can oversample from the corresponding path. Figure 11e presents an example of such a case. In this example, the calculated complexity of the l path is 161604 and the calculated complexity of the r path is 56.6. If we were to perform algebraic simplification (which Turaco does not), we would find the complexity of the l path to instead be 4.  The frequency of each path is 50%. With Turaco's computed complexities, the complexity-guided sampling approach samples the l path with probability 93.3% and the r path with probability 7.7%. Against both the frequency and the uniform baselines, compared to a predicted error improvement of 23.03%, complexity-guided sampling results in a change of −491.22%, a 5× increase in error.

Renderer Demonstration
In this section we present a case study of our complexity-guided sampling results and complexity analysis. The program under study is a demonstration 3D renderer [Lettier 2019], such as forms the core of a graphics rendering pipeline for a movie or 3D game engine [Christensen et al. 2018;Tatarchuk 2006]. Figures 12a and 12b show scenes that the renderer generates. We demonstrate that the sampling and analysis techniques in Sections 3 and 4 consistently result in more accurate surrogates than those trained using baseline distributions (the frequency distribution of paths and the uniform distribution).  Compared to training surrogates on the frequency distribution of paths, complexity-guided sampling decreases error by 17%. Compared to training on the uniform distribution of paths, complexityguided sampling decreases error by 44%. These improvements in error correspond to perceptual improvements in the generated images, as shown in Figures 12c to 12e.  section of one core shader, totaling 60 lines of code. 4 Figure 21 in Appendix G presents the code for the renderer case study.
Input-output specification. This program is a shader which assigns colors to pixels in the image based on the scene geometry, materials, lights, and other properties. The program is called for each pixel that is rendered in the image. Each invocation the of program takes as input a set of 11 fixed-size vectors, totaling 35 inputs. The program returns as output a set of 4 fixed-size vectors, totaling 8 outputs. These outputs represent two RGBA colors, the first representing the base color of the pixel, and the second representing the color and intensity of a specular map at that pixel.
Scenes and datasets. We evaluate the renderer on four different scenes, which we combine into nine different datasets. Figures 12a and 12b present two of the four different scenes under consideration; the four scenes are all combinations of views from the front and top, during the day and night. We combine these scenes into nine datasets: a dataset with each scene, a dataset combining each scene from each angle (front day and front night, top day and top night), a dataset combining each scene from each time of day, and a dataset combining all scenes. Figure 22 in Appendix G presents the full set of scenes under study.
Paths. The program is a conjunction of 48 different paths, 9 of which are exercised by the renderer. The top part of Table 4 presents statistics about the paths under study, showing the identifier (a trace of l and r characters denoting which branch of each if statement the path takes), the lines of code in the corresponding trace, and the complexity of the corresponding trace according to the analysis in Section 4.2. The paths are broken up into a path for rendering smoke particles from the chimney, water particles in the river, and the solids of the ground and house. Each set of paths is duplicated for twilight, nighttime, and daytime. Within each time of day, the smoke paths are the least complex, followed by water then solids. Across different times, twilight paths are the least complex, followed by nighttime then daytime. Figure 13 shows side-by-side comparisons of the three classes of paths: water, smoke, and solids. In each of these images, one path returns red for all pixels while the other paths return black for all pixels. The base scene is the front daytime scene in Figure 12a. Table 4 also presents the observed distribution and the complexity-guided distribution of paths for each dataset. In general, the twilight paths are rarer than the nighttime paths, which are rarer than the daytime paths: this is because data collection for the nighttime scenes extends through twilight  and into the morning. For all datasets, the smoke paths are rarer than the water paths, which are in turn rarer than the solids paths; this is purely due to the scene geometry.

Surrogate Training and Deployment Methodology.
To create and deploy a surrogate of the renderer, we train a surrogate of each path, then create a stratified surrogate which branches on the set of path conditions and applies the corresponding surrogate.
Our goal is to compare the errors achieved by training on the complexity-guided distribution of paths against those of baseline distributions of paths. We compare the approaches across different training datasets, different total numbers of training data points, and evaluating across different evaluation sets, all with multiple trials.
For the design of each surrogate, we use a simple MLP architecture with a single hidden layer of 512 units and a ReLU activations. This architecture matches that of Agarwala et al. [2021], except using 512 rather than 1000 hidden units (we found that the accuracy of each path surrogate plateaued by 512 hidden units). We train the surrogate using the Adam optimizer with a learning rate of 0.0001 and a batch size of 256 for 50,000 steps. We run 5 trials of all experiments, and report the error as an arithmetic mean when reported in isolation for a given surrogate (e.g., as in Figure 15) and a geometric mean error when comparing relative error rates across different settings (e.g., as in the headline error improvement numbers in Table 5). Table 5 presents the geometric mean decrease in error of using complexityguided path sampling compared to each baseline, on each dataset. Across most datasets, complexityguided path sampling results in lower error than both frequency-based path sampling and uniform path sampling. On datasets with few paths (Front Day) and in which all paths are well represented (minimum 5% frequency), the gap is minimal, and frequency-based path sampling matches or outperforms complexity-guided path sampling. On datasets with more and rarer paths (e.g., Top Night), the gap widens and complexity-guided path sampling outperforms frequency-based path sampling; we discuss this phenomenon in Section 5.1.2. On all datasets, complexity-guided path sampling outperforms uniform path sampling. Figure 14 presents the correlation between the predicted error (Equation (7)) and the observed empirical error for each dataset, showing a strong correlation. The left plot shows this correlation for surrogates trained with frequency-based path sampling, and the right plot shows this correlation for surrogates trained with uniform path sampling. The x axis is the decrease in predicted error (specifically, the mean percentage error between the predicted error for complexity-guided sampling and the sampling method in the plot), and the y axis is the decrease in empirical error (the mean percentage error between the error observed from complexity-guided sampling and the sampling method in the plot). Each point represents a different dataset (e.g., front-day, top-night, etc.). The red dotted line shows the line of best fit. For the frequency-based surrogates, the Pearson correlation is = 0.86 and the Spearman correlation is = 0.97. For the uniform surrogates, the Pearson correlation is = 0.82 and the Spearman correlation is = 0.87. Figure 15 presents the error of surrogates on each dataset. Each plot shows the error for a different dataset. Each plot has three different lines, respectively showing the error of each surrogate training distribution (complexity-guided, frequency-based, and uniform). Each x axis is the total training data budget. Each y axis is the error of the resulting stratified surrogate.

Surrogate Errors.
For a given dataset budget, our complexity-guided sampling approach results in lower error than baseline sampling approaches. Generally, increasing this dataset budget also results in lower error for all approaches. These two approaches to decreasing surrogate error (better sampling techniques and sampling more data) are not in conflict with each other.
In Figure 15, the sampling approaches converge in error with large dataset budgets. This convergence is due to our evaluation methodology: following the convention of prior work which established these bounds [Agarwala et al. 2021;Arora et al. 2019], we use a fixed width for all neural networks, resulting in neural networks that saturate in error with large datasets. An alternative methodology would be to grow the width of the neural network along with the size of the dataset, requiring a full hyperparameter search at each network scale. With such a methodology, the errors would not plateau in the way that they do in Figure 15. Figure 12 presents the renderings generated by the surrogates for the Front Day and the Top Night scene. These budgets correspond to the smallest budget that lead to a validation error less than 2%, which was qualitatively chosen as a threshold around which surrogate renders converge on the ground truth (i.e., the rendered scenes visually approach the quality of the original scene).

Visualization.
The top row shows the Front Day scene using surrogates trained on the Front Day dataset. In this scene, the complexity-guided and frequency-based surrogates result in similarly accurate renders, with the primary difference being that the frequency-based surrogate rendering has slightly darker green shadows on the front of the house. This similarity is expected given the similar errors observed in Table 5. Uniform sampling results in an inaccurate render, as expected given its high error.
The bottom row show the Top Night scene using surrogates trained on the dataset combining all scenes. In this scene the complexity-guided surrogate has the most accurate render, as expected given the errors observed in Table 5. The frequency-trained surrogate colors everything much darker purple. The uniform-trained surrogate in contrast colors everything much more tan.
In sum, the error improvements in Table 5 correspond with improvements in the rendered images.

RELATED WORK
In this section we survey related work for each contribution.
Optimal stratified sampling. Optimal stratified sampling is a classic area in statistics [ Thompson 2012]. Most work in this domain focuses on optimal parameter estimation, and uses stratified sampling to reduce the variance of estimates by ensuring sufficient independent samples are taken from each subpopulation. Our approach is novel in the assumptions we make for training stratified surrogates of programs, and in the specific sample complexity bounds we base our results on. Santner et al. [2018] survey sampling techniques for computer experiments. Chapter 5.2.3 discusses stratified random sampling in particular, showing optimality criteria for sampling for unbiased estimators. These approaches are generic for minimizing the variance of estimators, and do not consider specifically training a neural network. These approaches also do not consider the different complexity of different strata. Cortes et al. [2019] present an active learning approach for learning in the regime where the input space is partitioned into separate regions (strata, using our terminology) and a separate hypothesis (surrogate) is trained of each, and derive a similar allocation of data points. This approach has several differences from our approach. First, it assumes a different form for sample complexity and derives correspondingly different sampling bounds than ours. The definition of complexity ( in our formalism) that Cortes et al. use is a function of the number of hypotheses in the hypothesis class, the total number of data points used, and the number of data points for a given stratum that have been queried thus far; it is not a function of any complexity metric of the function being learned. More concretely, Cortes et al.'s approach assumes a small, finite hypothesis class (the set of possible outputs of the training algorithm) of binary classifiers, and has runtime proportional to the size of the hypothesis class, requires samples proportional to the log of the size of the hypothesis class, and bounds the error of the result as a function of the log of the size of the hypothesis class. In their evaluation, Cortes et al. use a hypothesis class of a set of 3000 random hyperplanes. However, this approach is not tractable when using a neural network as the hypothesis class: a neural network with 43,000 32-bit floating point weights (as in the case study in Section 5.2) induces a hypothesis class of size 10 414,217 . This results in intractable runtime and large or meaningless bounds. Beyond these distinctions, Cortes et al.'s approach is also an active learning approach that determines whether or not to query a label of a given data point for an input stream of data points, whereas our approach operates offline. Cortes et al.'s approach is thus a better fit when learning stratified functions of unknown complexity (e.g., non-analytic functions) using a finitely sized hypothesis class (not a neural network), and is targeted at the online setting when given a sampler of the overall data distribution but not one for each stratum. Our approach is a better fit when learning a priori known stratified analytic functions with neural networks, and is targeted at the offline setting when given a sampler for each stratum.
Sample complexity program analysis. Program analysis is a broad set of techniques to determine properties of programs [Cousot and Cousot 1977;Nielson et al. 1999]. Our analysis in Section 4.2 is a novel nonstandard interpretation calculating the tilde, combined with a standard implementation of forward-mode automatic differentiation [Griewank and Walther 2008;Wengert 1964] and a standard symbolic execution which executes all paths in the program [Cadar et al. 2008;King 1976]. Bao et al. [2012] present a program analysis that decomposes programs into continuous regions, with the goal of characterizing the sensitivity of each continuous region to input noise. This analysis computes a different notion of complexity than ours, and does not represent the sample complexity of learning a surrogate of each region. Hoffmann and Hofmann [2010] present a program analysis that calculates the algorithmic complexity of a program. This complexity again does not lead to bounds on the sample complexity of learning a surrogate of the program.

ASSUMPTIONS AND LIMITATIONS
Our contributions in Sections 3 and 4 make assumptions about the programs under study, the functions that those programs denote, and the surrogate training algorithms. Here we document these assumptions and note possible failure modes for our techniques.
Assumptions imported from prior work. Our sample complexity results are subject to all assumptions from the prior work that gives the sample complexity bounds for neural networks that we use [Agarwala et al. 2021]. These sample complexity bounds only apply to analytic functions. They further assume that inputs come from the unit sphere; this does not match many practical applications, including those in Section 5. Finally, these sample complexity bounds assume that the neural network under study is a 2-layer, sufficiently wide neural network trained with SGD with an infinitely small step size, using a 1-Lipschitz loss function.
Despite these assumptions, Agarwala et al. [2021, Appendix B.2] empirically verify that the sample complexity bounds hold. We also show in Sections 2 and 5.2 that the theoretical sample complexity bounds correlate with empirical sample complexity results.
Complexity-guided sampling. The first assumption is that we know the distribution of inputs ahead of time ( ), both in terms of the distribution of strata ∫ ∈ ( )d and the distribution of inputs within a given stratum ( | ). The second assumption is that optimizing the upper bound of the per-stratum loss results in a reasonable optimum for the combined surrogate. The third is the assumption we make that ∀ , . = , which we make to ensure a closed-form solution; this is not guaranteed to be optimal.
Convergence in the limit of strata. In the limit of infinite strata (lim →∞ ), the complexity-guided sampling approach induced by Theorem 2 converges to sampling each stratum with probability proportional to ∫ ∈ ( )d 2 3 (see Note A.1). In the limit, this distribution does not account for complexity. However in practice the complexity still guides sampling. Section 5.2 evaluates an example with all complexities ( ) ≥ 5899 and = 0.01. For the log( −1 ) term to match the contribution of the complexity term there would need to be ≈ 10 2600 strata; this example only has 9. Thus while in the infinite limit of strata our approach is complexity-agnostic, in practice it is dominated by the complexity. We note that this property necessarily occurs with any underlying PAC-style bound with a term that sums complexity and log 1 (e.g., those of Vapnik and Chervonenkis [1971] and Valiant [1984]): almost surely, paths are sampled with probability that does not depend on their complexity (see Note A.2).
Program analysis. The main assumption here is that Agarwala et al. [2021]'s algebra on tilde functions results in a sufficiently precise upper bound on the tilde. This is not always the case, as discussed in Section 4.4. We also note that Turaco's program analysis could be make more precise in multiple well-known ways. For instance, we could perform constant propagation, algebraic simplification, and automatic inference of constraints (which would be useful for log expressions). We have excluded such extensions for the sake of simplicity of presentation.
Analysis compute cost. For a given path, computing the tilde and its derivative has essentially the same cost as executing the path twice. Thus in the most pessimistic case this would allocate 2 more samples per path to the baseline approaches. However, this pessimistic case assumes that sampling a program input is free, which it may not be: for example the renderer case study in Section 5.2 requires executing the video game engine (including running physics simulations) to get a program input for the shader of which we learn a surrogate.
Stratified surrogates. We provide sample complexity bounds for constructing stratified surrogates, assuming that for a given program every path is a different function. This assumes both that it is tractable to compute which stratum a given input resides in before applying the surrogate. This evaluation-time stratum check must not preclude the use of the surrogate for its downstream task. We therefore adopt a standard modeling assumption in the approximate computing literature: that precisely determining paths is an acceptable cost during approximate program execution [Carbin et al. 2013;Sampson et al. 2011]. 5,6 This also assumes that there are a tractable number of paths, which excludes programs with a large number of if statements or loops. The assumption that there are a tractable number of paths is a common assumption among techniques like concolic testing [Cadar et al. 2008;King 1976;Sen et al. 2005]. Similar to prior work, we find that in practice the programs we evaluate only use a fraction of the syntactically possible paths (e.g., the Jmeint benchmark in Section 5 uses 18 out of 1728 possible paths).
Empirical evaluation limitations. We note two limitations in our empirical evaluations. The first is that some evaluations are in the ultra-low-data regime, where rounding to an integer number of data samples affects the accuracy. The second is that the parameter is set to an arbitrary value.

CONCLUSION
We present an approach to allocating samples among strata to train stratified neural network surrogates of stratified functions. We also present a programming language, Turaco, in which all programs are learnable stratified functions and a program analysis to determine the complexity of learning surrogates of those programs. Our results take a step towards a cohesive, end-to-end methodology for programming using surrogates of programs.

A COMPLEXITY-GUIDED SAMPLING
This appendix provides the proof for Theorem 2 in Section 3.
For convenience we duplicate the definitions of learnability. A learnable function is one such that: Following Arora et al. [2019] and Agarwala et al. [2021] we study functions and learning algorithms for which the relationship between , , and in Equation (2) is: We define a stratified function as follows: where is the number of strata, { } =1 are strata, ∀ ≠ . ∩ = ∅, and ∪ = X.
We define a stratified surrogateˆas a stratified function with componentsˆ. For a data distribution , let ( ) be the probability that is sampled from , and ( ) be the total probability mass of all data points within over (i.e., ( ) = ∫ ∈ ( )). Let ( | ) be the probability of a data point sampled from if ∈ .
Our goal is to learn a stratified surrogateˆof a stratified function , where each component function is learnable. We are given a data distribution , a maximum sample budget , a learning algorithm tr, a loss function ℓ, and a failure probability . As described in Section 3, we formalize this with the following optimization problem: Theorem 2. Equation (8) is minimized at: Proof. The task is to find for each surrogateˆthat find a minimal error using (using the upper bound in Equation (3)), while also meeting the total sample size constraint: ≤ . For convenience, define: We must show the following KKT conditions: We note that the objective is convex; therefore this is the globally optimal solution. Expanding , we have: . Note A.2. For a given = 1 − (1 − ), all choices of result in most strata being dominated by the log −1 term in the limit of infinite paths: Proof. The constraint is equivalent to: To minimize E 1 log −1 > subject to 1 − (1 − ) ≥ , we must set as many as small as possible without exceeding log −1 > . To do this, we set = − for as many as possible. However, because of the constraint, we can do this for no more than Theorem 1. For a sufficiently wide (see Arora et al. [2019, Theorem 5.1]) 2-layer neural network trained with gradient descent for sufficient steps (ibid.), if is analytic, ì is on the -dimensional unit sphere, and ℓ is 1-Lipschitz, then ( ì) is learnable in the sense of Equations (2) and (3) with: Proof. The proof is similar to that of Theorem 8 in Agarwala et al. [2021], with two deviations. First, we note that Equation 15 in Agarwala et al. [2021] has a typo, which we correct below: Thus, the | 0 | term is not necessary, meaning that the˜(0) term in Equation 14 in Agarwala et al. [2021] is not necessary.
Second, we note that the proof of Corollary 3 in Agarwala et al. [2021] involves appending a 1 to the neural network input; thus, the complexity of learning any function ( ì) is the same as learning the complexity of a function ( ì∥1).
Other than these two modifications, the proof is identical to Theorem 8 in Agarwala et al. [2021]. □ We now prove each component of Lemma 4.1: Lemma 4.1. The tilde and its derivative have upper bounds that are compositional with respect to : Aspects of these are shown without proof in Agarwala et al. [2021]. We flesh out the proof (using approaches from Agarwala et al. [2021]), and further prove bounds on the tilde derivatives.
Proof. Given: Note that other choices of are possible and may result in a smaller˜(hence the imprecision noted in Section 4.4). □ Proof. Given: Proof. The proof is similar to that of Corollary 2 in Agarwala et al. [2021], following fairly directly from Fact 1 in Agarwala et al. [2021]: The tilde turns the function and its derivative into a monotonic function for ≥ 0: The derivative of a power series with all nonnegative coefficients is also a power series with all nonnegative coefficients:˜′ For > 0, and any power series with all nonnegative coefficients , ( ) ≥ 0. Thus, the derivative of and˜′ are both nonnegative everywhere, thus they are both monotonic. □

C TURACO STATEMENT BIG-STEP SEMANTICS
This appendix presents the remaining semantics for Turaco not presented in Section 4. Figure 16 presents the big-step evaluation relation for statements in Turaco. The statement relation ⟨ , ⟩ ⇓ ′ says that under variable store , the statement evaluates to a new variable store ′ . Figure 17 presents the big-step evaluation relation for Turaco programs. The program relation ⟨ , fun ( 0 , 1 . . . , ) { ; return }⟩ ⇓ says that under variable store (representing the inputs to the program), the program evaluates to value .

D TURACO ANALYSIS PROOF OF SOUNDNESS
This appendix proves that the Turaco complexity analysis is sound: that it computes an upper bound on the true complexity of learning the program. We first note that the standard execution semantics big-step relation ⇓ both for expressions and for traces is a function. We use · as notation to refer to that function for expressions and · to refer to that function for traces followed by taking the value of the variable : We use the notation { } as shorthand for { | ∈ }, a set of functions indexed by ∈ . We use the notation˜⊢ { } to denote the predicate that have tildes and tilde derivatives that are bounded by˜: We use the notation • to denote function composition: Case:

E LOG RULE
The log{ }( ) rule requires the parameter to expand around since log is not analytic around 0. The value set for this parameter must satisfy the | − | < condition for all inputs in the standard interpretation (to ensure that all values are in the radius of convergence) and the >˜√ 2 + 1 condition in the tilde interpretation, but is otherwise free to be set to a value that minimizes the upper bound on the complexity.
As an example of a program that uses value of value other than 1, consider the following program: fun ( x ) { x = log {3.88}(0.75 * x ) ; x = x * x ; return x ; } This program cannot use = 1 (since it fails the condition in the tilde interpretation:˜= 0.75 sõ √ 2 + 1 > ). Instead, this program requires > 3 √ 7 , and is minimized around = 3.88. It would be interesting to automatically infer a value for this parameter that satisfies the constraints and optimizes the complexity bound, but this is future work requiring separate techniques that are outside of the scope of this paper.

G EXTENDED RENDERER DEMONSTRATION
This appendix provides further details on the evaluation on the renderer program in Section 5.2. Figure 21 presents the full code for the renderer program under study. 7 Figure 22 presents the full set of scenes used in our evaluation. This program is a good candidate to train a surrogate of for several reasons. First, it is an approximable program: as long as the outputs of a surrogate of the program are sufficiently close to the ground-truth outputs, the generated image will be perceptually indistinguishable. Second, its paths are all determined by uniform input variables, variables in GLSL that are constant across each invocation of the shader. This means that relative to the cost of executing the program, it is cheap to determine which path a given input induces in the program (and thus which surrogate to apply). Third, its execution environment is well suited to be replaced with a neural network, since the original program itself performs batch processing on a GPU. Figures 23 and 24 present larger versions of the renders in Section 5.2.4 for ease of comparison. The training budgets for these surrogates are 61 samples for the front-day surrogate and 2335 samples for the top-night surrogate. Received 2023-04-14;accepted 2023-08-27