Deep Bayesian Active Learning for Accelerating Stochastic Simulation

Stochastic simulations such as large-scale, spatiotemporal, age-structured epidemic models are computationally expensive at fine-grained resolution. While deep surrogate models can speed up the simulations, doing so for stochastic simulations and with active learning approaches is an underexplored area. We propose Interactive Neural Process (INP), a deep Bayesian active learning framework for learning deep surrogate models to accelerate stochastic simulations. INP consists of two components, a spatiotemporal surrogate model built upon Neural Process (NP) family and an acquisition function for active learning. For surrogate modeling, we develop Spatiotemporal Neural Process (STNP) to mimic the simulator dynamics. For active learning, we propose a novel acquisition function, Latent Information Gain (LIG), calculated in the latent space of NP based models. We perform a theoretical analysis and demonstrate that LIG reduces sample complexity compared with random sampling in high dimensions. We also conduct empirical studies on three complex spatiotemporal simulators for reaction diffusion, heat flow, and infectious disease. The results demonstrate that STNP outperforms the baselines in the offline learning setting and LIG achieves the state-of-the-art for Bayesian active learning.


INTRODUCTION
Computational modeling is more than ever at the forefront of infectious disease research due to the COVID-19 pandemic.Stochastic simulations play a critical role in understanding and forecasting infectious disease dynamics, creating what-if scenarios, and informing public health policy making [11].More broadly, stochastic simulations [2,53] produce forecasts about complex interactions among people, environment, space, and time given a set of parameters.They provide the numerical tools to simulate stochastic processes in finance [33], chemistry [21] and many other scientific disciplines.
Unfortunately, stochastic simulations at fine-grained spatial and temporal resolution can be extremely computationally expensive.In example, epidemic models for realistic diffusion dynamics simulation via in-silico experiments require a large parameter space (e.g.characteristics of a virus, policy interventions, people's behavior).Similarly, reaction-diffusion systems that play an important role in chemical reaction and bio-molecular processes also involve a large number of simulation conditions.Therefore, hundreds of thousands of simulations are required to explore and calibrate the simulation model with observed experimental data.This process significantly hinders the adaptive capability of existing stochastic simulators, especially in "war time" emergencies, due to the lead time needed to execute new simulations and produce actionable insights that could help guide decision makers.
Learning deep surrogate models to speed up complex simulation has been explored in climate modeling and fluid dynamics for deterministic dynamics [5,24,52,54,62], but not for stochastic simulations.These surrogate models can only approximate specific system dynamics and fail to generalize under different parametrization.Especially for pandemic scenario planning, we desire models that can predict futuristic scenarios under different conditions.Furthermore, the majority of the surrogate models are trained passively using a simulation data set.This requires a large number of simulations beforehands to cover different parameter regimes of the simulator and ensure generalization.
We propose Interactive Neural Process (INP), a deep Bayesian active learning framework to speed up stochastic simulations.Given parameters such as disease reproduction number, incubation and infectious periods, mechanistic simulators generate future outbreak states with time-consuming numerical integration.INP accelerates the simulation by guiding a surrogate model to learn the inputoutput map between parameters and future states, hence bypassing numerical integration.
The deep surrogate model of INP is built upon Neural Process (NP) [20], which lies between Gaussian process (GP) and neural network (NN).NPs can approximate stochastic processes and therefore are well-suitable for surrogate modeling of stochastic simulators.They learn distributions over functions and can generate prediction uncertainty for Bayesian active learning.Compared with GPs, NPs are more flexible and scalable for high-dimensional data with spatiotemporal dependencies.We design a novel Spatiotemporal Neural Process model (STNP) by introducing a time-evolving latent process for temporal dynamics and integrating spatial convolution for spatial modeling.
Instead of learning passively, we design active learning algorithms to interact with the simulator and update our model in "real-time".We derive a new acquisition function, Latent Information Gain (LIG), based on our unique model design.Our algorithm selects the parameters with the highest LIG, queries the simulator to generate new simulation data, and continuously updates our model.We provide theoretical guarantees for the sample efficiency of this procedure over random sampling.We also demonstrate the efficacy of our method on large-scale spatiotemporal epidemic and reaction diffusion models.In summary, our contributions include: • Interactive Neural Process: a deep Bayesian active learning framework for accelerating large-scale stochastic simulation.• New surrogate model, Spatiotemporal Neural Process (STNP), for high-dimensional spatiotemporal data that integrates temporal latent process and spatial convolution.• New acquisition function, Latent Information Gain (LIG), based on the inferred temporal latent process to quantify uncertainty with theoretical guarantees.• Real-world application to speed up complex stochastic spatiotemporal simulations including reaction-diffusion system, heat flow, and age-structured epidemic dynamics.

RELATED WORK
Bayesian Active Learning and Experimental Design.Bayesian active learning, or experimental design is well-studied in statistics and machine learning [6,10].Gaussian Processes (GPs) are popular for posterior estimation e.g.[25] and [69], but often struggle in high-dimension.Deep neural networks provide scalable solutions for active learning.Deep active learning has been applied to discrete problems such as image classification [19] and sequence labeling [57] whereas our task is continuous time series.Our problem can also be viewed as sequential experimental design where we design simulation parameters to obtain the desired outcome (imitating the simulator).Foster et al. [15] propose deep design networks for Bayesian experiment design but they require a explicit likelihood model and conditional independence in experiments.Kleinegesse and Gutmann [32] consider implicit models where the likelihood function is intractable, but computing the Jacobian through sampling path can be expensive and their experiments are mostly limited to low (<=10) dimensional design.In contrast, our design space is of much higher-dimension and we do not have access to an explicit likelihood model for the simulator.Neural Processes.Neural Processes (NP) [20] model distributions over functions and imbue neural networks with the ability of GPs to estimate uncertainty.NP has many extensions such as attentive NP [29] and functional NP [38].However, NP implicitly assumes permutation invariance in the latent variables and can be limiting in modeling temporal dynamics.Singh et al. [58] proposes sequential NP by incorporating a temporal transition model into NP.Still, sequential NP assumes the latent variables are independent conditioned on the hidden states.We propose STNP with temporal latent process and spatial convolution, which is well-suited for modeling the spatiotemporal dynamics of infectious disease.We apply our model to real-world large-scale Bayesian active learning.Note that even though Garnelo et al. [20] has demonstrated NP for Bayesian optimization, it is only for toy 1-D functions.
Stochastic Simulation and Dynamics Modeling.Stochastic simulations are fundamental to many scientific fields [53] such as epidemic modeling.Data-driven models of infectious diseases are increasingly used to forecast the evolution of an ongoing outbreak [1,11,39].However, very few models can mimic the internal mechanism of a stochastic simulator and answer "what-if questions".GPs are commonly used as surrogate models for expensive simulators [22,27,41,50], but GPs do not scale well to high-dimensional data.Likelihood-free inference methods [40,44,49,64] learn the posterior of the parameters given the observed data.They do neural density estimation, but require a lot of simulations.For active learning, instead of relying on Monte Carlo sampling, we directly compute the information gain in the latent process.Qian et al. [50] use GPs as a prior for a SEIR model for learning lockdown policy effects, but GPs are computationally expensive and the simple SEIR model cannot capture the real-world large-scale, spatiotemporal dynamics considered in this work.We demonstrate the use of deep sequence model as a prior distribution in Bayesian active learning.Our framework is also compatible with other deep sequence models for time series, e.g.Deep State Space [51], Neural ODE [7].

METHODOLOGY
Consider a stochastic process { 1 , • • • ,   }, governed by timevarying parameters   ∈ R  , and the initial state  0 ∈ R  .In epidemic modeling,   can represent the effective reproduction number of the virus at a given time, the effective contact rates between individuals belonging to different age groups, the people's degree of short-or long-range mobility, or the effects of time varying policy interventions (e.g.non-pharmaceutical interventions).The state   ∈ R  includes both the daily prevalence and daily incidence for each compartment of the epidemic model (e.g.number of people that are infectious and number of new infected individuals at time ).
Stochastic simulation uses a mechanistic model  ( ; ) to simulate the process where the random variable  represents the randomness in the simulator.Let  := ( 0 ,

Interactive Neural Process
INP is used to train a deep surrogate model to mimic the stochastic simulator.As shown in Figure 1, given parameters  , we query the simulator, i.e., the mechanistic model to obtain a set of simulations We train a NP based model to learn the probabilistic map from parameters to future states.Our NP model can be spatiotemporal to capture complex dynamics such as the disease dynamics of the epidemic simulator.During inference, the model needs to generate predictions ( x1 , • • • , x ) at the target parameters  corresponding to different scenarios.
Instead of simulating at a wide range of parameter regimes, we take a Bayesian active learning approach to proactively query the simulator and update the model incrementally.Using NP, we can infer the latent temporal process ( 1 , • • • ,   ) that encodes the uncertainty of the current surrogate model.Then we propose a new acquisition function, Latent Information Gain (LIG), to select the  ★ with the highest reward.We use  ★ to query the simulator, and in turn generate new simulation to further improve the model.Next, we describe each of the components in detail.

Spatiotemporal Neural Process
Neural Process (NP) [20] Here  () is the prior distribution for the latent variable.We use  1: as a shorthand for ( 1 , • • • ,   ).The prior distribution  () is conditioned on a set of context points   ,   1: as  (|  1: ,   ).However, the global latent variable  in NP can be limiting for non-stationary, spatiotemporal dynamics in the epidemics.We propose Spatiotemporal Neural Process (STNP) with two extensions.First, we introduce a temporal latent process ( 1 , • • • ,   ) to represent the unknown dynamics.The latent process provides an expressive description of the internal mechanism of the stochastic simulator.Each latent variable   is sampled conditioning on the past history.Second, we explicitly model the spatial dependency in   ∈ R  .Rather than treating the dimensions in   as independent features, we capture their correlations with regular grids or graphs.For instance, the travel graph between locations can be represented as an adjacency matrix  ∈ R  × .
Given parameters { }, simulation data { 1: }, and the spatial graph  as inputs, STNP models the conditional distribution  (  2 visualizes the graphical models of our STNP, the original NP [20] model and Sequential NP [58].The main difference between STNP and baselines is the encoding procedure to infer the temporal latent process.Compared with STNP which directly embeds the history for  inference at the current timestamp, NP ignores the history and SNP only embeds the partial history information from the previous . We implement STNP following an encoder-decoder architecture.The encoder parametrizes the mean and standard deviation of the variational posterior ( 1: | 1: , , ) and the decoder approximates the predictive distribution  ( 1: | 1: , , ).To incorporate the spatial graph information, we use a Diffusion Convolutional Gated Recurrent Unit (DCGRU) layer [35] which integrates graph convolution in a GRU cell.We use multi-layer GRUs to obtain hidden states from the inputs.Using re-parametrization [30], we sample   from the encoder and then decode   conditioned on   in an auto-regressive fashion.To ensure fair comparisons, we adapt NP and SNP to graph-based settings and use the same architecture as STNP to generate the hidden states.Noted if the spatial dependency is regular grid-based, then the DCGRU layer is replaced to Convolutional LSTM layer [37,56,63,65,66], and there is no adjacency matrix  in Equation 2.

Bayesian Active Learning
Algorithm 1 details a Bayesian active learning algorithm, based on Bayesian optimization [17,55].We train an NP model to interact with the simulator and improve learning.Let the superscript ( )  denote the -th interaction.We start with an initial data set S 1 = { (1) ,  We evaluate the current models' predictions with an acquisition function  ( x1: | 1: ,  ) and select the set of parameters { (+1) } with the highest reward.We query the simulator with { (+1) } to augment the training data set S +1 and update the NP model for the next iteration.
The choice of the reward (acquisition) function  depends on the goal of the active learning task.For example, to find the model that best fits the data, the reward function can be the log-likelihood  = log  ( x1: |, ).To collect data and reduce model uncertainty in Bayesian experimental design, the reward function can be the mutual information.In what follows, we discuss different strategies to design the reward/acquisition function.We also propose a novel acquisition function based on information gain in the latent space tailored to our STNP model.

Reward/Acquisition functions
For regression tasks, standard acquisition functions for active learning include Maximum Mean Standard Deviation (Mean STD), Maximum Entropy, Bayesian Active Learning by Disagreement (BALD) or expected information gain (EIG), and random sampling [19].We explore various acquisition functions and their approximations in the context of NP.We also introduce a new acquisition function based on our unique NP design called Latent Information Gain (LIG).The details of Mean STD and Maximum Entropy are shown in the Appendix B.4.
BALD/Expected Information Gain (EIG).BALD [25] quantifies the mutual information between the prediction and model posterior  ( x1: | ) −  ( x1: | 1: ,  ), which is equivalent to the expected information gain (EIG).Computing the EIG for surrogate modeling is challenging since  ( x1: | 1: ,  ) cannot be found in closed form in general.The integrand is intractable and conventional MC methods are not applicable [16].One way to get around this is to employ a nested MC estimator with quadratic computational cost for sampling [45,61], which is computationally infeasible.To reduce the computational cost, we assume  ( x1: | 1: ,  ) follows multivariate Gaussian distribution.Each feature of x1: can be parameterized with mean and standard deviation predicted from the surrogate model, assuming output features are independent with each other.This distribution assumption can be limiting in the high-dimensional spatiotemporal domain, which makes EIG less informative.
Latent Information Gain (LIG).To overcome the limitations mentioned above, we propose a novel acquisition function by computing the expected information gain in the latent space rather than the observational space.To design this acquisition function, we prove the equivalence between the expected information gain in the observational space and the expected KL divergence in the latent processes w.r.t. a candidate parameter  , as illustrated by the following proposition.
Proposition 1.The expected information gain (EIG) for Neural Process is equivalent to the KL divergence between the prior and posterior in the latent process, that is See proof in the Appendix A. where KL(•∥•) denotes the KL-divergence between two distributions.
In this way, conventional MC method becomes applicable, which helps reduce the quadratic computational cost to linear.At the same time, although  1: are assumed to be multivariate Gaussian and are parameterized with mean and standard deviation, they are only in the latent space not the observational space.Moreover, LIG is also more computationally efficient and accurate for batch active learning.Due to the context aggregation mechanism of NP, we can directly calculate LIG with respect to a batch of  in the candidate set.This is not available for baseline acquisition functions.They all require calculating the scores one by one for all  in the candidate set and select a batch of  based on their scores.Such approach is both slow and inaccurate as acquiring points that are informative individually are not necessarily informative jointly [31].

Theoretical Analysis
We shed light onto the intuition behind choosing adaptive sample selection over random sampling via analyzing a simplifying situation.Assume that at a certain stage we have learned a feature map Ψ which maps the input  of the neural network to the last layer.Then the output  can be modeled as  = ⟨Ψ( ),  * ⟩ + , where  * is the true hidden variable,  is the random noise.
Our goal is to generate an estimate ẑ, and use it to make predictions ⟨Ψ( ), ẑ⟩.A good estimate shall achieve small error in terms of ∥ ẑ −  * ∥ 2 with high probability.In the following theorem, we prove that greedily maximizing the variance of the prediction to choose  will lead to an error of order O () less than that of random exploration in the space of  , which is significant in high dimension.
See proofs in the Appendix A.2.

EXPERIMENTS
We

Experimental Setup
We experiment with the following four stochastic simulators.SEIR Compartmental Model.To highlight the difference between NP and GP, we begin with a simple stochastic, discrete, chain-binomial SEIR compartmental model as our stochastic simulator.In this model, susceptible individuals () become exposed () through interactions with infectious individuals ( ) and are eventually removed (), details are deferred to the Appendix B.1.
We set the total population  =  + + + as 100, 000, the initial number of exposed individuals as  0 = 2, 000, and the initial number of infectious individuals as  0 = 2, 000.We assume latent individuals move to the infectious stage at a rate  ∈ [0.25, 0.65] (step 0.05), the infectious period  −1 is set to be equal to 1 day, and we let the basic reproduction number  0 (which in this case coincides with the transmissibility rate ) vary between 1.1 and 4.0 (step 0.1).Here, each (, ) pair corresponds to a specific scenario, which determines the parameters  .We simulate the first 100 days of the epidemic with a total of 300 scenarios and generate 30 samples for each scenario.
We predict the number of individuals in the infectious compartment.The input is (, ) pair and the output is the 100 days' infection prediction.As the simulator is not spatiotemporal, we use the vanilla NP model with the global latent variable .For each epoch, we randomly select 10% of the samples as context.Implementation details are deferred to Appendix B. 5.
Reaction Diffusion Model.The reaction-diffusion (RD) system [60] is a spatiotemporal model that simulates how two chemicals might react to each other as they diffuse through a medium together.The simulation is based on initial pattern, feed rate ( 0 ), removal rate ( 1 ) and reaction between two substances.We use an RD simulator to generate sequences from 0 to 500 timestamps, sampled every 100 timestamps, resulting into 5 timestamps for each simulated sequence.Every timestamp is a 3D tensor (2 × 32 × 32) with dimension 0 corresponds to the two substances in the reaction and dimension 1, 2 are the image representation of the reaction diffusion processes.Each sequence is simulated with a unique feed rate  0 ∈ [0.029, 0.045] and kill rate  1 ∈ [0.055, 0.062] combination.There are 200 uniformly sampled scenarios, corresponding to ( 0 ,  1 ) combinations.
We implement STNP to mimic the reaction diffusion simulator with feed rate ( 0 ) and kill rate ( 1 ) as input.The initial state of the reaction is fixed.We use multiple convolutional layers with a linear layer to encode the spatial data into latent space.We use an LSTM layer to encode the latent spatial data with  0 ,  1 to map the input-output pairs to hidden features  1:5 .With ( 0 ,  1 ), and  1:5 sampled from the posterior distribution, we use an LSTM layer and deconvolutional layers to simulate reaction diffusion sequence.For each epoch, we randomly select 20% samples as context sequence.
Heat Model.The model is to predict the spatial solution fields of the Heat equation [47].The ground-truth data is generated from the standard numerical solver used in [34].The experiment setting also follows [34].The examples are generated by solvers running with 32 × 32 meshes.The corresponding output dimension is 1024.The input consists of three parameters that control the thermal conductivity and the flux rate.
Local Epidemic and Mobility model.The Local Epidemic and Mobility model (LEAM-US) is a stochastic, spatial, age-structured epidemic model based on a metapopulation approach which divides the US in more than 3,100 subpopulations, each one corresponding to a each US county or statistically equivalent entity.Population size and county-specific age distributions reflect Census' annual resident population estimates for year 2019.We consider individuals divided into 10 age groups.Contact mixing patterns are age-dependent and state specific and modeled considering contact matrices that describe the interaction of individuals in different social settings [43].LEAM-US integrates a human mobility layer, represented as a network, using both short-range (i.e., commuting) and long-range (i.e., flights) mobility data, see more details in Appendix B.2.
We separate data in California monthly to predict the 28 days' sequence from the 2nd to the 29th day of each month from March to December.Each  includes the county-level parameters of LEAM-US and state level incidence and prevalence compartments.The total number of dimension in  is 16, 912, see details in Appendix B.2. Overall, there are 315 scenarios in the search space, corresponding to 315 different  with total 16, 254 samples.We split 78% of the data as the candidate set, and 11% for validation and test.For active learning, we use the candidate set as the search space.
We instantiate an STNP model to mimic an epidemic simulator that has  at both county and state level and   at the state level.We use county-level parameter  together with a county-to-county mobility graph  in California as input.We use the DCGRU layer [35] to encode the mobility graph in a GRU.We use a linear layer to map the county-level output to hidden features at the state level.For both the state-level encoder and decoder, we use multi-layer GRUs.For each epoch, we randomly select 20% samples as context sequence.

Offline Learning Performance
We compared our proposed STNP with vanilla NP [20], SNP [58], Masked Autoregressive Flow (MAF) [48], the RNN baseline with variational dropout (RNN) [68], and VAE-based deep surrogate model for multi-fidelity active learning (DMFAL) [34].The implementation details can be seen in Appendix B.6.The key innovation of STNP is the introduced temporal latent process.To ensure fair comparison, we modified baselines for the RD and Heat model by adding convolutional layers for data encoding and deconvolutional layers for sequence generation.For the LEAM-US model, we modified NP by adding the convolutional layers with diffusion convolution [36] to embed the graphs.Similarly, we modified SNP by replacing the convolutional layers with diffusion convolution.The rest baselines do not support LEAM-US surrogate modeling.Table 1 shows the testing MAE of different NP models trained in an offline fashion.Our STNP significantly improves the performance and can accurately learn the simulator dynamics for all three experiments.
Figure 3 left compares the NP and GP performance on one scenario in the held-out test set.It shows the ground truth and the predicted number of infectious population for the first 50 days.We also include the confidence intervals (CI) with 5 standard deviations for ground truth and NP predictions and 1 standard deviation for GP predictions.We observe that NP fits the simulation dynamics better than GP for mean prediction.Moreover, NP has closer CIs to the truth, reflecting the simulator's intrinsic uncertainty.GP shows larger CIs which represent the model's own uncertainty.Note that NP is much more flexible than GP and can scale easily to highdimensional data.Figure 3 middle indicates STNP can accurately predict various patterns corresponding to different ( 0 ,  1 ).This confirms that our STNP is able to capture the high-dimensional spatiotemporal dependencies in RD simulations.Figure 3 right visualize the STNP predictions in four key compartments of a typical scenario with  0 = 3.1 from March 2nd to March 29th.The confidence interval is plotted with 2 standard deviations.We can see that both the mean and confidence interval of STNP predictions match the truth well.These two results demonstrate the promise that the generative STNP model can serve as a deep surrogate model for RD and LEAM-US simulator.

Active Learning
Implementation Details.We compare 6 different acquisition functions with NP for SEIR model and STNP for RD, Heat, and LEAM-US model.For SEIR, the initial training dataset has 2 scenarios and we continue adding 1 scenario per iteration to the training set until the test loss converges to the offline modeling performance.We also include GP with 3 different acquisition functions and Sequential Neural Likelihood (SNL) (see Table 3).For the RD and Heat model, all acquisition functions start with the same 5 scenarios randomly picked from the training dataset.Then we continue adding 5 scenarios per iteration to the training set until the test loss converges.Similarly, the LEAM-US model begins with 27 training data and we continue adding 8 scenarios per iteration to the training set until   the validation loss converges.We measure the average performance over three random runs and report the MAE for the test set.Active Learning Performance.Figure 4 shows the testing MAE versus the percentage of samples included for training.The percentage of data is linearly proportional to the overall running time.This figure shows our proposed LIG always has the best MAE performance until the convergence for SEIR, RD, and LEAM-US tasks.As shown in figure 4 left, we compare different acquisition functions on both NP and GP for SEIR model.It shows none of the GP methods converge after selecting 4.07% of the data for training while NP methods converge much faster.Our proposed acquisition function LIG is the most sample efficient in acquisition functions used for NP.It takes only 4.07% of the data to converge and reach the NP offline performance, which uses the entire training set for training.Moreover, there is an enormous gap between LIG and EIG with respect to the active learning performance.This validates our theory that the uncertainty of the deep surrogate model is better measured on the latent space instead of the predictions.Similarly in figure 4 middle and right, we compared LIG with other acquisition functions on STNP for RD and LEAM-US model.It shows LIG converges to the offline performance using only 21.87% of data for RD experiment and 31.4% of data for LEAM-US experiment.Therefore, it is consistent among all three experiments that our proposed LIG always has the best MAE performance until convergence.Notice that for figure 4 right, it shows the log scale MAE versus the percentage of samples included for training.Detailed performance comparison including mean and standard deviation for all four tasks including Heat can be seen in Appendix C.2.Our proposed LIG also has the best MAE performance for the Heat task.
Exploration Exploitation Trade-off.To understand the large performance gap for LIG vs. baselines, we visualize the values of test MAE and the acquisition function score for each Bayesian active learning iteration for SEIR model, shown in Figure 5.For EIG, Mean STD, and Maximum Entropy, they all tend to exploit the region with large transmission rate for the first 2 iterations.Including these scenarios makes the training set unbalanced.The MAE in the region with small transmission rate become worse after 2 iterations.Meanwhile, Random is doing pure exploration.The improvement of MAE performance is not apparent after 2 iterations.Our proposed LIG is able to reach a balance by exploiting the uncertainty in the latent process and encouraging exploration.Hence, with a small number of iterations ( = 2), it has already selected "informative scenarios" in the search space.

CONCLUSION
We propose a unified framework Interactive Neural Processes (INP) for deep Bayesian active learning, that can seamlessly interact with existing stochastic simulators and accelerate simulation.Specifically, we design STNP to approximate the underlying simulation dynamics.It infers the latent process which describes the intrinsic uncertainty of the simulator.We exploit this uncertainty and propose LIG as a powerful acquisition function in deep Bayesian active learning.We perform a theoretical analysis and demonstrate that our approach reduces sample complexity compared with random sampling in high dimension.We also did extensive empirical evaluations on several complex real-world spatiotemporal simulators to demonstrate the superior performance of our proposed STNP and LIG.For the future work, we plan to leverage Bayesian optimization techniques to directly optimize for the target parameters with auto-differentiation.

A THEORETICAL ANALYSIS A.1 Latent Information Gain
Proposition 1.The expected information gain (EIG) for Neural Process is equivalent to the KL divergence between the prior and posterior in the latent process, that is Proof of Proposition 1.The information gained in the latent process , by selecting the parameter  and generate x is the reduction in entropy from the prior to the posterior IG( ) =  ( x) −  ( x |,  ).Take the expectation of IG( x,  ) under the marginal distribution, we obtain from the conditional independence of  and  that

Sample Efficiency of Active Learning
From the main text we know that in each round, the output random variable  = Ψ( ),  * + . ( We further assume that the random noise  is mean zero and -subGaussian. Using this information, we treat  as an unknown parameter and define a likelihood function so that  ( |;  ) has good coverage over the observations: Let the prior distribution over  be Here we use  instead of () in the Algorithm 1 to represent the number of iterations.We can form a posterior over  in the -th round: Focusing on the random variable  ∼  (•| 1 ,  1 , . . .,   ,   ), the estimate of the hidden variable, we can express it at -th round as: where ẑ and   is a standard normal random variable.We can either choose action  randomly or greedily.A random choice of  corresponds to taking A greedy procedure is to choose action   in the -th round to optimize KL ( (| x,  )∥ () , where we denote the estimated output variable x given  and  as x = ⟨Ψ( ), ⟩.This optimization procedure is equivalent to maximizing the variance of the prediction: For both approaches, we assume that the features Ψ( ) are normalized.We compare the statistical risk of this approach with the random sampling approach.
We now analyze random sampling of  versus greedy search for  .
If the feature map Ψ(•) = id, then from random matrix theory, we know that for  randomly sampled from a normal distribution and for large , which is of order Ω(1/).This will lead to an appealing risk bound for ∥ ẑ −  * ∥ 2 on the order of O / √  .However, in high dimension, this feature map is often far from identity.In the proof of Theorem 1 below, we demonstrate that even when Ψ(•) is simply a linear random feature map, with i.i.d.normal entries, random exploration in  can lead to deteriorated error bound.This setting is motivated by the analyses in wide neural networks, where the features learned from gradient descent are close to those generated from random initialization [13,42].
Theorem 1 (Formal statement).Assume that the noise  in equation 5 is -subGaussian.For a normalized linear random feature map Ψ(•), greedily optimizing the KL divergence, KL ( (| x,  )∥ ()) (or equivalently the variance of the posterior predictive distribution defined in equation equation 8) in search of  will lead to an error ∥ ẑ −  * ∥ 2 = O /

√
with high probability.
On the other hand, random sampling of  following equation 7 will lead to with high probability.
Proof of Theorem 1.For a linear random feature map, we can express Ψ( ) = Ψ , where entries in Ψ ∈ R  × are i.i.d.normal.The entries of Ψ are then normalized.
• For random exploration of  , the matrix containing the feature vectors becomes   = ΨΘ  , where matrix Θ  ∈ R  × collects all the  column vectors of { 1 , . . .,   }.Then    T  = ΨΘ  Θ T  Ψ T .From random matrix theory, we know that the condition number of Ψ is equal to  with high probability [8].Hence for normalized Ψ and  ,  min ΨΘ  Θ T  Ψ T ≥  2 min (Ψ)  min Θ  Θ T  = 1  2  min Θ  Θ T  .The inequality holds because the smallest singular value is the inverse of the norm of the inverse matrix.We then use the fact from random matrix theory that for normalized random  , the asymptotic distribution of the eigenvalues of We then analyze the error associated with greedy maximization of the posterior predictive variance.We first note that the variance of the posterior predictive distribution in equation equation 8 can be expressed as follows using equation equation 6: where the expectations are with respect to  ∼  (•| 1 ,  1 , . . .,   −1 ,   −1 ).
We perform a singular value decomposition .In words, when we use greedy method and maximize the variance of the prediction, it corresponds to taking Ψ(  ) in the direction of the smallest eigenvector of   −1 .
Since every Ψ( ) is normalized and we initialize uniformly:  0 =  , the process is equivalent to scanning the orthogonal spaces of normalized vectors in R  for ⌊/⌋ times.For large , entries in Λ  Λ T  are approximately uniform and are all larger than or equal to ⌊/⌋.Then  min    T  = Ω(/).Plugging into the bound of Lemma 1, we obtain that

□
Proof of Lemma 1.We first express the estimate ẑ as follows.
Assuming that noise   is -subGaussian, then so is     since   is a unitary matrix.Multiplied by the diagonal matrix Λ which has zero, Λ     2 ≤  √ .Therefore,

SEIR Model
Our SEIR simulator is a simple stochastic, discrete, chain-binomial compartmental model.In this model, susceptible individuals () become exposed () through interactions with infectious individuals ( ).Exposed individuals which are infected but not yet infectious transition to infectious compartment at a rate  that is inversely proportional to the latent period of the disease.Lastly, infectious individuals transition to the removed compartment at a rate  which is inversely proportional to the infectious period.Removed individuals () are assumed to be no longer infectious and they are to be considered either recovered or dead.All transitions are simulated by randomly drawn from a binomial distribution.

B.2 LEAM-US Model
LEAM-US integrates a human mobility layer, represented as a network, using both short-range (i.e., commuting) and long-range (i.e., flights) mobility data.Commuting flows between counties are obtained from the 2011-2015 5-Year ACS Commuting Flows survey and properly adjusted to account for differences in population totals since the creation of the dataset.Instead, long-range air traveling flows are quantified using origin-destination daily passenger flows between airport pairs as reported by the Official Aviation Guide (OAG) and IATA databases (updated in 2021) [26,46].In addition, flight probabilities are age and country specific.The model is initialized using a multi-scale modeling approach that utilizes GLEAM, the Global and Epidemic Mobility model [3,4,9,12,59,67], to simulate a set of 500 different initial conditions for LEAM-US starting on February 16th, 2020.The disease dynamics are modeled using a classic SEIR-like model and initial conditions are determined using the Global and Epidemic Mobility model [3,4,59,67] calibrated to realistically represent the evolution of the COVID-19 pandemic [9,12].Lastly, travel restrictions, mobility reductions, and government interventions are explicitly modeled to mimic the real timeline of interventions of the events that occurred during the COVID-19 pandemic.

B.3 Spatiotemporal NP Model
As shown in Figure 6, our model has  at both county and state level and   at the state level.We use county-level parameter  together with a county-to-county mobility graph  as input.We use the DCGRU layer [35] to encode the graph in a GRU.We use a linear layer to map the county-level output to hidden features at the state level.For both the state-level encoder and decoder, we use multi-layer GRUs.
The input  1: is the county-level parameters for LEAM-US with a dimension of 10.The county level embedding uses 1 layer DCGRU with a width of 16.The internal state is at state level with dimension of 16.The state level encoder and decoder use 3 layer GRUs with width of 128.The dimension of the latent process  1: is 32.The dimension of output  1: is 24, including the incidence and prevalence for 12 compartments.We trained STNP model for 500 steps with learning rate fixed at 10 −3 using Adam optimizer.We perform early stopping with 50 patience for both offline learning and Bayesian active learning.

B.4 Acquisition Function
Maximum Mean STD.Mean STD [18] is a heuristic used to estimate the model uncertainty.For each augmented parameter  , we sample multiple  1: and generate a set of predictions { x1: }.For a length  sequence with dimension , we compute the standard deviation  , for time step  and feature .Mean STD computes σ = 1     =1  =1  , for each parameter  .We select the  with the maximum σ.Empirically, we found that Mean STD often becomes over-conservative and tends to explore less.
Maximum Entropy.Maximum entropy computes the maximum predictive entropy as In general, entropy is intractable for continuous output.Our NP model implicitly assumes the predictions follow a multivariate Gaussian, which allows us to compute the differential entropy [28].We follow the same procedure as Mean STD to estimate the empirical covariance Σ ∈ R   ×  and compute the differential entropy for each parameter as  = 1 2 ln |Σ| +   2 (1 + ln 2).We select the parameter  with the maximum entropy.
B.5 Implementation Details.We implement STNP to mimic the reaction diffusion simulator with feed rate ( 0 ) and kill rate ( 1 ) as input.The initial state of the reaction is fixed.We use multiple convolutional layers with a linear layer to encode the spatial data into latent space.We use an LSTM layer to encode the latent spatial data with  0 ,  1 to map the input-output pairs to hidden features  1:5 .With ( 0 ,  1 ), and  1:5 sampled from the posterior distribution, we use an LSTM layer and deconvolutional layers to simulate reaction diffusion sequence.For each epoch, we randomly select 20% samples as context sequence.
We use the likelihood-free inference code [14] to implement Masked Autoregressive Flow (MAF) model and Sequential Neural Likelihood (SNL) framework.We use the VAE-based deep surrogate model code for Deep Multi-fidelity Active Learning (DMFAL) [34] to implement DMFAL.For DMFAL, we set the number of fidelity levels to 1 to meet our task setting.Note that we only use DMFAL model for offline learning test as their proposed acquisition function is EIG applied to multiple fidelity levels, which is equivalent to EIG once we reduce the number of fidelity levels to 1.We also include the RNN baseline with variational dropout, which only uses the NP decoder for surrogate modeling.We use this baseline for ablation study to show the effectiveness to include the latent processes of Neural processes model.
We follow the same hyperparameter setting used in DMFAL [34] for the standard Heat simulation task.For other experiments, we tune the hyperparameters of baseline models including the learning rate and the hidden state size to optimize their performance.

Figure 1 :
Figure 1: Illustration of the interactive Neural Process (INP).Given simulation parameters and data, INP trains a surrogate model (e.g.STNP) to infer the latent process.The inferred latent process allows prediction and uncertainty quantification.They are used to calculate the acquisition function (e.g.LIG) to select the next set of parameters to query, and simulate more data.

1 :
} and use it to train our NP model and learn the latent process.During inference, given the augmented parameters  , we use the trained NP model to predict the future states ( x1 , • • • , x ).

Figure 3 :
Figure 3: Prediction visualizations, Left: Accuracy and uncertainty quantification comparison between Neural Process (NP) and Gaussian process (GP) in SEIR simulator.Middle: STNP predictions for spatiotemporal patterns of substances in Reaction-Diffusion simulator.Right: STNP predictions for the number of individuals in Infectious and Removed compartments in LEAM-US simulator.

Figure 4 :
Figure 4: MAE loss versus the percentage of samples for Bayesian active learning.The Black dash line shows the offline learning performance with the entire data set available for training.Left: GP and NP for SEIR.Middle: STNP for RD.Right: STNP for LEAM-US.

Figure 5 :
Figure 5: Acquisition function behavior visualization in SEIR model.For each iteration, top row is the current MAE mesh in infectious population for all (, ) candidates.Bottom row is the acquisition function score.Yellow dots are existing parameters.Red stars are the newly selected parameters.

Figure 6 :
Figure6: Visualization of the STNP model architecture.For both the encoder and the decoder, we use a diffusion convolutional GRU (DCGRU)[36] to capture spatiotemporal dependency.
1.Inspired by this fact, we propose a novel acquisition function computing the expected KL divergence in the latent processes and name it LIG.Specifically, the trained NP model produces a variational posterior given the current dataset S Figure 2: Graphical model comparison: Neural Process, Sequential Neural Process and our Spatiotemporal Neural Process.as ( 1: |S).For every parameter  remained in the search space, we can predict x1: with the decoder.We use x1: and  as input to the encoder to re-evaluate the posterior  ( 1: | x1: , , S).
LIG computes the distributional difference with respect to the latent process  1: as E  ( x1: | ) [KL ( ( 1: | x1: , , S)∥ ( 1: |S))], evaluate our proposed STNP for its surrogate modeling performance in the offline learning setting and LIG acquisition function for active learning performance.We aim to verify that (a) LIG outperforms other acquisition functions in the NP and GP model setting for deep Bayesian active learning on non-spatiotemporal surrogate modeling, (b) STNP outperforms other existing baselines for spatiotemporal surrogate modeling in the offline learning set- ting, and (c) LIG outperforms other acquisition functions in the STNP model setting for deep Bayesian active learning on spatiotemporal surrogate modeling.The implementation code is available at https://github.com/Rose-STL-Lab/Interactive-Neural-Process.

Table 1 :
Surrogate model performance comparison using MAE in Reaction-Diffusion, Heat, and LEAM simulator (population divided by 1000).
scaling comes from the fact that  is normalized[23].Hence for large ,  min Θ  Θ T  ≥ 1 − Plugging this result into Lemma 1, we obtain that the error ∥ ẑ −  * ∥ 2 for random exploration in the space of

Table 2 :
Performance comparison of different acquisition functions in NP model for SEIR simulator