Projected Gaussian Markov Improvement Algorithm for High-dimensional Discrete Optimization via Simulation

This paper considers a discrete optimization via simulation (DOvS) problem defined on a graph embedded in the high-dimensional integer grid. Several DOvS algorithms that model the responses at the solutions as a realization of a Gaussian Markov random field (GMRF) have been proposed exploiting its inferential power and computational benefits. However, the computational cost of inference increases exponentially in dimension. We propose the projected Gaussian Markov improvement algorithm (pGMIA), which projects the solution space onto a lower-dimensional space creating the region-layer graph to reduce the cost of inference. Each node on the region-layer graph can be mapped to a set of solutions projected to the node; these solutions form a lower-dimensional solution-layer graph. We define the response at each region-layer node to be the average of the responses within the corresponding solution-layer graph. From this relation, we derive the region-layer GMRF to model the region-layer responses. The pGMIA alternates between the two layers to make a sampling decision at each iteration; it first selects a region-layer node based on the lower-resolution inference provided by the region-layer GMRF, then makes a sampling decision among the solutions within the solution-layer graph of the node based on the higher-resolution inference from the solution-layer GMRF. To solve even higher-dimensional problems (e.g., 100 dimensions), we also propose the pGMIA+: a multi-layer extension of the pGMIA.We show that both pGMIA and pGMIA+ converge to the optimum almost surely asymptotically and empirically demonstrate their competitiveness against state-of-the-art high-dimensional Bayesian optimization algorithms.


INTRODUCTION
Optimization via simulation (OvS) refers to the class of methodologies for optimizing a problem whose objective function and/or constraints at a feasible solution must be estimated via stochastic simulation.When an OvS problem has a discrete solution space, the problem is further categorized as a discrete OvS (DOvS) problem.This paper focuses on a DOvS problem whose feasible solution space is a high-dimensional hyperbox deined on the integer lattice.To motivate our research, we illustrate two practical applications below.(i) Cevik et al. (2016) study calibration of the University of Wisconsin breast cancer simulation model (UWBCS).The input parameters of the UWBCS include breast cancer natural history, detection, treatment parameters, as well as non-breast cancer mortality parameters.Calibration targets include the cancer incidence and mortality rates.Each combination of parameters is assigned with a score that measures the diference between the simulation results and the targets.The UWBCS has ten natural history parameters that amount to 378,000 feasible combinations in total.
(ii) Hofman et al. (2018) study condition-based maintenance scheduling for a manufacturing line.Each machine's state is represented by an integer-valued health index, which stochastically degrades over time.The vector of decision variables consists of a health index threshold at which maintenance is to be scheduled for each machine.Maintenance resources are shared across all machines.The objective is to minimize the expected cost of operation.The problem dimension is determined by the number of machines, which may be large for a complex manufacturing system.
In both examples, each solution can only be evaluated via stochastic simulation.The stochastic error variance of simulation output tends to be large and the simulation run-time is nontrivial.Both examples have highdimensional discrete solution spaces that can be embedded in the integer lattice, where the number of feasible solutions increases exponentially in the problem dimension.As such, it is diicult to apply a DOvS algorithm that requires all solutions to be simulated (e.g., ranking and selection).Meanwhile, it is reasonable to assume that two solutions that are close in the integer lattice are likely to have similar responses, e.g., the expected operating costs of two maintenance policies are similar if their health index thresholds are close.Therefore, a DOvS method that exploits spatial inference may signiicantly improve eiciency in solving these types of problems.
There are several existing inferential DOvS algorithms in which the responses at the feasible solutions are modeled as a realization of a stochastic process.Taking the Bayesian view, a prior distribution is assumed for the stochastic process, which is then updated to the posterior as the algorithm simulates solutions.The updated posterior distribution provides inference on the responses at any feasible solutionsÐincluding those that have not yet been simulatedÐand the algorithm selects which solution to simulate next by evaluating a sampling criterion.A popular choice for the prior is a Gaussian process (GP) as its posterior update is convenient; combined with the assumption that the simulation output has Gaussian noise, the posterior is still a GP.Inferential OvS is closely related to Bayesian Optimization (BO), which has become popular for global optimization of a black-box function.BO also adopts GP as a workhorse model while mainly focusing on continuous-domain problems with deterministic objective functions.
Compared to the continuous counterpart, the literature on inferential DOvS (or discrete BO) is rather slim.Baptista and Poloczek (2018) focus on the binary variables while Oh et al. (2019) and Roustant et al. (2020) consider a combinatorial solution space with categorical variables.Mes et al. (2011) also study categorical solutions and have commonality with this work in that they build hierarchical Bayesian models to represent the solution space with decreasing resolution.However, they do not model spatial correlation among solutions and their target problem is much smaller-scale (thousands of solutions) than what is considered here.Closer to our work are Sun et al. (2014) and Xie et al. (2016) that study integer solution spaces embedded in the Euclidean space and adopt a continuous GP covariance kernel to model the spatial correlation among the discrete solutions.In the same setting, Garrido-Merchán and Hernández-Lobato (2020) discuss a way to improve the covariance kernel's inference on the discrete solution space via rounding.
Meanwhile, Salemi et al. (2019) empirically demonstrate that some popular choices of the covariance kernel may result in poor inference when applied to solutions on the integer lattice.Instead, they model the responses at the solutions on the -dimensional integer lattice as a realization of a Gaussian Markov random ield (GMRF) and propose the Gaussian Markov improvement algorithm (GMIA).A GMRF is a GP deined on a graph of nodes, where the connectivity between nodes on the graph determines the correlation structure of the GMRF; its precision matrix (inverse of the covariance matrix) has nonzero elements if the corresponding nodes are connected by an edge on the graph.Thus, the precision matrix tends to be sparse when the graph is sparse, and updating the precision matrix is extremely cheap.However, to evaluate a sampling criterion at solutions, the posterior variances of the solutions are needed.Even if the precision matrix is sparse, computing its inverse is expensive for a large-scale DOvS problem; the cost of matrix operations involved in sampling criterion evaluation increases at least quadratically in the number of feasible solutions.
To reduce the computational complexity of GMIA, improvements have been proposed.Semelhago et al. (2017) adopt sparse linear algebra techniques to signiicantly reduce the cost of precision matrix inversion.For further speed-up, Semelhago et al. (2021) restrict the sampling decisions within a promising subset of solutions for several rapid-search iterations.In contrast, Li and Song (2020) achieve the same order of computational cost as in Semelhago et al. (2021) while making global inference by exploiting the linear algebraic techniques tailored for sparse matrices.However, all of these approaches require inverting a precision matrix at least intermittently.As the problem dimension increases, the GMIA eventually hits the limit on the size of the precision matrix that can be stored and inverted.
Reducing the computational overhead of a high-dimensional BO problem has been actively researched in recent years.These approaches can be categorized into two groups as summarized below.For a more comprehensive review on high-dimensional BO, see Binois and Wycof (2022).
The irst direction is batching the dimensions into groups and constructing an additive GP model from the batches.Kandasamy et al. (2015) decompose the solution space into disjoint groups of dimensions and represent the objective function as the sum of independent GPs deined for each group.They apply the sampling criterion in each group separately, then aggregate them to decide the next solution to sample.Wang et al. (2017) propose an adaptive decomposition of dimensions using a multinomial distribution whose posterior is updated throughout the algorithm.Rolland et al. (2018) and Mutny and Krause (2018) extend the idea to overlapping dimension groups.Wang et al. (2018) partition the solution space using a Mondrian process and learn a local GP for each subspace and the additive structure.
The second approach is to reduce the problem dimension by projection assuming that the black-box function has a lower-dimensional active subspace.Wang et al. (2013) irst apply the projection idea to high-dimensional BO using a random projection matrix.On the other hand, Djolonga et al. (2013) apply a low-rank recovery algorithm to compute the projection matrix from an initial design.However, when the feasible solution space is bounded, the projection-based method does not perform well when a substantial chunk of the embedded subspace is projected to outside of the feasible solution space of the original problem.Letham et al. (2020) and Binois et al. (2020) each discuss lower-dimensional embedding approaches that avoid this phenomenon.Eriksson and Jankowiak (2021) only consider a sparse axis-aligned subspace arguing that inferring a non-axis-aligned subspace is substantially more expensive and leads to overitting.Lu et al. (2018) and Moriconi et al. (2020) consider nonlinear embedding through variational autoencoders (VAEs).
Both batching and projection approaches have limitations.In batching, the objective function is assumed to be a sum of functions of the batched dimensions.Therefore, unless two dimensions are in the same batch, their interaction efects cannot be modeled.Moreover, the additive model's covariance can be constructed easily from the batch GPs, however, for a GMRF, spatial correlation is implicitly determined by the precision matrix and it is not straightforward to construct the precision matrix of the additive model from the batch GMRFs.For the projection methods, it is typically assumed that an upper bound of the active subspace's dimension is known, but without prior knowledge on the objective function, such an upper bound is diicult to obtain.Furthermore, Mathesen et al. (2019) point out that projection-based BO algorithms may perform poorly when a lower-dimensional active subspace does not exist for the given problem; the same can be observed in our numerical experiments.
In this paper, we propose the projected Gaussian Markov Improvement Algorithm (pGMIA) and pGMIA+.The former establishes a framework for the GMIA to scale up in the problem dimension, while the latter extends the framework to solve signiicantly higher-dimensional problems than the former.The pGMIA and pGMIA+ reap the beneits of both batching and projection approaches while tackling their limitations.
The pGMIA partitions the dimensions into two batches, solution-layer and region-layer dimensions, then projects the solution space onto the latter.Since this is an axis-aligned projection, the integer lattice (graph) structure of the solution space is preserved after projection.The name, region-layer, comes from that a node in the projected graph can be mapped to a region of solutions in the original solution space.We refer to the projected graph and its node as region-layer graph and region-layer node, respectively.Inversely, each region-layer node corresponds to a solution-layer graph that contains all solutions projected to the node.Note that only the solution-layer dimensions are represented in the solution-layer graph, which makes it lower dimensional than the original solution space.
There are two major diferences between the pGMIA and other projection methods.First, the pGMIA does not assume an active lower-dimensional subspace.Instead, it regards the response at each region-layer node to be the average of the responses at all solutions in the corresponding solution-layer graph.Second, unlike other methods that make sampling decisions only in the projected space, the pGMIA makes hierarchical sampling decisions in both region and solution layers.The pGMIA irst selects a region-layer node, then zooms into its solution-layer graph to select a solution to sample.Essentially, the pGMIA treats all dimensions to be active, not just the region-layer dimensions.
As in the batching approaches, we model both region-layer and solution-layer responses with GMRFs, however, do not it an independent model for each batch of dimensions.Instead, we derive both region-and solution-layer GMRFs from a single GMRF representing the entire solution space; we refer to the last as the single-layer GMRF to distinguish it from the other two.One region-layer GMRF is deined to represent the region-layer graph's responses, and for each region-layer node, a solution-layer GMRF is deined for the corresponding solution-layer graph.
Unlike the batching methods, the pGMIA does not assume the objective function to be a sum of lowerdimensional functions.Rather, our model has a hierarchical structureÐthe region-layer GMRF captures the lower-resolution trend in the response, whereas each solution-layer GMRF provides a local, higher-resolution model.In addition, the pGMIA periodically updates the batches according to a user-chosen criterion.This allows re-batching of the dimensions and gives the pGMIA an opportunity to learn all interaction terms in the response as the simulation budget increases.
The pGMIA+ extends the two-layer scheme of the pGMIA to multi layers; it partitions the dimensions into > 2 batches and constructs layers by hierarchically projecting the batches of dimensions.Starting from the top layer, the pGMIA+ selects a node on each layer's graph and zooms into the lower-layer graph to make a sampling decision.As increases, fewer dimensions are included in each layer, reducing the computational overhead of the sampling decisions.This allows the pGMIA+ to tackle much higher-dimensional problems than the pGMIA.
The idea of hierarchical search between the two layers of GMRFs has also been explored by Salemi et al. (2019) in their algorithm, multi-resolution GMIA (MR-GMIA).However, the MR-GMIA has several theoretical/practical limitations.First, their region-layer GMRF is itted independently from the solution-layer GMRFs for computational convenience, although there is clear dependence due to averaging of the responses.Second, it is assumed that the precision matrix of the region-layer GMRF has the same sparsity pattern as that of the solution-layer GMRF while in truth, the region-layer precision matrix becomes a dense matrix due to averaging.Moreover, the batches of dimensions as well as the hyperparameters of the GMRFs remain unchanged after initialization throughout the algorithm, which signiicantly slows down the search progress in later iterations as our empirical study shows.
Unlike the MR-GMIA, the pGMIA's region-and solution-layer GMRFs are statistically consistent; the regionlayer GMRF is deined by projecting the single-layer GMRF.This makes the region-layer precision matrix dense, however, we prove that the dense precision matrix can be efectively approximated with an easy-to-compute sparse matrix.We derive an exact bound on the approximation error and show that it becomes negligible as the feasible range in each dimension increases.Moreover, the resulting approximate precision matrix has the same sparsity pattern as the single-layer GMRF; what MR-GMIA assumes without mathematical justiication.As such, eicient linear algebraic techniques tailored for the single-layer GMRF can still be applied at the region layer.
Moreover, exploiting the Markov property of the single-layer model, we show that each solution-layer GMRF is independent from the rest of the ield conditional on that the responses at the neighboring solutions of the region are equal to the single-layer prior means.This approximation signiicantly reduces the computational overhead of the solution-layer sampling decisions as we can focus on the solutions included in the selected region-layer node while ignoring the rest of the solution space.
Another beneit of the pGMIA is that the hyperparameters of the GMRFs can be updated cheaply.Under our region-layer GMRF approximation, the region-layer hyperparameters can be written as linear functions of the single-layer GMRF parameters.Thus, instead of computing the maximum likelihood estimator (MLE) of the single-layer GMRF, the pGMIA irst solves the region-layer MLE problem, then solves the solution-layer problem conditional on the region-layer estimates.Since each layer contains fewer dimensions than the original space, the computational overhead is signiicantly reduced.Thanks to this feature, the pGMIA can quickly update the hyperparameters whenever the batch of dimensions is updated.
Under mild assumptions, we prove that both pGMIA and pGMIA+ converge to a global optimum of the DOvS problem almost surely when run without stopping.We compare our algorithms with MR-GMIA as well as some state-of-the-art high-dimensional BO algorithms.The empirical results demonstrate competitiveness of the pGMIA and the pGMIA+ in both search progress and computation time.
The remainder of the paper is structured as follows.In Section 2, we provide background on GMRFs and discuss the projected GMRF model.Section 3 discusses the hyperparameter estimation at both region and solution layers.Section 4 provides the algorithmic details of the pGMIA and the pGMIA+.An extensive empirical study is presented in Section 5 to demonstrate the performances of the pGMIA and the pGMIA+, followed by conclusions in Section 6. Due to the space limit, all appendices are included in the Supplementary Material.

GAUSSIAN MARKOV RANDOM FIELDS FOR DOVS
The DOvS problem of our interest is min x∈X (x) ≜ E[ (x)], where the feasible solution space, X , is a inite subset of -dimensional integer lattice, and (x) is the stochastic simulation output at feasible solution x.For any x ∈ X , (x) = (x) + (x) can be observed for replications = 1, 2, • • • , where simulation error (x) is independent and identically distributed (i.i.d.) with zero mean and unknown variance 2 (x).
We further assume X to be a hyperbox.The number of feasible solutions, ≜ |X |, increases exponentially in in this setting.When is large and simulation runtime is nonnegligible, only a small fraction of X can be simulated.To make inference at the solutions not simulated yet, we model (•) as a realization of a GMRF.In Section 2.1, we introduce the single-layer GMRF, then deine the region-and solution-layer GMRFs in Section 2.2.

Single-layer GMRF
A GMRF is a multivariate Gaussian random vector Y = (Y 1 , Y 2 , • • • , Y ) ⊤ deined on undirected labeled graph G = (V, E), where V is the set of nodes and E is the set of edges connecting neighboring nodes.In the context of DOvS, V corresponds to X , and each Y models the value of (•) at the th feasible solution.We refer to Y as the single-layer GMRF model.We also adopt Y(x) to denote the element of Y corresponding to solution x to make the dependence explicit.
The prior imposed on Y is , where and Q are the mean vector and precision matrix, respectively.The covariance matrix of where Y is the subvector of Y corresponding to the nodes in ⊂ V. Let N () be the set of the neighbors of Node in G.Then, for every ∈ V, we have Y ⊥Y V\(N ( )∪{ } ) |Y N ( ) (Rue and Held 2005).Namely, conditional on the responses at its neighbors, the response at each node is independent of the rest of the ield, which makes Y Markovian.Combining this property with (1), we have Q = 0, if ∉ N ().Thus, the sparsity of Q is determined by N (•).As in Salemi et al. (2019), we deine the set of neighbors of x ∈ X as N (x) = {x ′ ∈ X : ∥x − x ′ ∥ 2 = 1}, where ∥ • ∥ 2 is the Euclidean norm.This makes Q very sparse with the ill-in rate (the fraction of nonzero elements) less than (2 + 1)/.
We assume that = 1 , where is a hyperparameter and 1 is the -dimensional column vector of ones assuming there is no prior information about the objective function values at the solutions.The precision matrix, Q, is parameterized by where e ℓ is the ℓth standard basis vector and | • | abs is the element-wise absolute value operator.From ( 1) and ( 2), , ℓ determines the prior conditional correlation between neighboring solutions in the ℓth dimension.The conditional precision of any solution must be positive, so 0 = Q > 0. We assume that the prior conditional correlation between two solutions is nonnegative, i.e., ℓ ≥ 0 for all 1 ≤ ℓ ≤ .Moreover, we impose that Q is diagonally dominant, i.e., ℓ=1 ℓ < 0.5, to make Q positive deinite.
Since (x) cannot be evaluated directly, we do not observe the realization of Y(x).Instead, we deine a new GMRF that models the stochastic simulation output at x. Let X 2 ⊂ X be the set of simulated solutions and X 1 ≜ X \X 2 .Let Y 1 and Y 2 be the subvectors of Y corresponding to X 1 and X 2 , respectively.For x ∈ X 2 , we observe , where (x) is the number of replications made at x. Let Y 2 be the vector of 2 as a plug-in estimate.Salemi et al. (2019) show that with Q representing the submatrix of Q corresponding to X and X for 1 ≤ , ≤ 2. Updating Q after simulating a new solution is extremely cheap as we only need to update the corresponding diagonal element of Q .Moreover, the sparsity pattern of Q is preserved in Q.
We adopt the complete expected improvement (CEI) proposed by Salemi et al. (2019) as the sampling criterion.Let x be the sample-best solution, i.e., x = argmin x∈X 2 (x); at any point in the algorithm, we refer to x as the current best solution.Then, the CEI of x relative to x is deined as CEI where the expectation is with respect to the posterior distribution in (3).Let M(x) and V(x) denote the posterior mean and variance of Y(x), respectively, and C( x, x) be the posterior covariance between Y(x) and Y( x).Note that M(x) is an element of the mean vector of (3) and V(x) and C( x, x) can be obtained from the posterior covariance matrix, Q−1 .Then, the posterior variance of The CEI of x can be computed as (Jones et al. 1998) where and Φ are the standard normal probability density and cumulative distribution functions, respectively.Therefore, to compute CEI at all solutions, we need the posterior mean vector and 2 − 1 elements of Q−1 ; namely,

Second dimension
First dimension
its diagonal and the column corresponding to x.This observation combined with the sparsity of Q has led to a series of computational improvements of GMIA (Li and Song 2020, Semelhago et al. 2021, 2017) as reviewed in Section 1.However, they eventually hit computational limits as and increase.The memory space required to store the precision matrix is another challenge.For instance, Matlab requires more than 4.46e+12 GB of RAM to store Q with = 10 20 .

Projected GMRF
As mentioned in Section 1, the pGMIA batches the dimensions of X into two groups, the region-layer and the solution-layer dimensions, then projects the solution space onto the region-layer dimensions to create the region-layer graph.We illustrate the projection scheme with a simple two-dimensional example in Figure 1, where the irst dimension has coordinate values from 1 to 4 and the second from 5 to 8. Suppose we chose the second dimension to be the region-layer dimension.By projecting the solution space onto the second, we create a region-layer graph with four nodes (solid circles), where each node corresponds to four solutions in each dashed box.Notice that all solutions in each dashed box have the same second coordinate since they are projected to the same node in the region-layer graph.The corresponding solution-layer graph deined for each dashed box is therefore one-dimensional with only the irst dimension active.In total, the projection creates one region-layer graph and four solution-layer graphs.
The pGMIA deines the response at each region-layer node as the average of the responses at the solutions in the solution-layer graph corresponding to the node.Just as the single-layer GMRF models the responses at all solutions, the pGMIA deines a region-layer GMRF to model the vector of region-layer responses, and a solution-layer GMRF to model the solutions' responses included in each solution-layer graph.For the example in Figure 1, one region-layer GMRF and four solution-layer GMRFs can be deined.In the remainder of this section, we mathematically formalize how the region-and solution-layer GMRFs are derived from the single-layer GMRF.
Let denote the number of feasible values that the th coordinate of x can take.Consequently, we have = =1 .Let 2 represent the number of region-layer dimensions and 1 ≜ − 2 .Without loss of generality, we label the dimensions in the region layer as denote the region-layer nodes, then X = ℓ=1 X (R ℓ ), where X (R ℓ ) is the set of solutions in the solution-layer graph of R ℓ .
The pGMIA allows the projection to be updated throughout the algorithm.Let P denote the partition of dimensions that determines the projection.Given 1 , there are 1 candidates for P. Each time P is updated, the dimensions are relabeled so that the irst 1 dimensions always belong in the solution layer.Other quantities such as and that depend on P are also updated accordingly.The pGMIA updates P with mild computational overhead.In the remainder of this section and Section 3, we assume P to be ixed and defer the discussion on its update to Section 4.
For each region-layer node ℓ , its response is deined as To derive the distribution of Z from that of the single-layer GMRF, Y, let us irst deine × matrix where I is the × identity matrix and represents the Kronecker product.Then, we have Z = −1 PY.We assume that the elements of Y are sorted such that the corresponding solutions are in ascending order of coordinates from the irst to last dimensions.For instance, in Figure 1, the irst ive elements of Y correspond to Solutions (1, 5), (2, 5), (3, 5), (4, 5), and (1, 6).Each time P is updated and the dimensions are relabeled, Y is sorted accordingly.
The prior mean vector and covariance matrix for the region layer are E Although Z is a GMRF, it does not have the same sparse neighborhood structure as the single-layer GMRF.The inverse of Var(Z) turns out to be a dense matrix implying that all region-layer nodes neighbor each other in the region-layer graph.To compute the exact region-layer precision matrix, we need Q −1 , however, for the scale of problems of our interest, inverting Q is computationally infeasible.Moreover, the computational beneit of GMRF is lost with a dense precision matrix.To tackle this challenge, we propose approximating the precision matrix of Z with an easy-to-compute sparse matrix as shown in Theorem 2.1.
Theorem 2.1.We have The proof of Theorem 2.1 is included in Appendix B of the Supplementary Material.Theorem 2.1 provides an element-wise absolute value bound on the diference between I and the matrix product of Var(Z) and PQP ⊤ , and thereby quantiies the error in approximating the precision matrix of Z with PQP ⊤ .From (5), PQP ⊤ can be computed easily.Note that 1 < 1 < 0.5 and 2 < 1 from the diagonal dominance constraint.Because 1 min and 1 − 1

=1
−2 are monotonically decreasing in each , the error bound decreases in .Moreover, when converges to 0. We also empirically observed that PQP ⊤ approximates (Var(Z)) −1 well even for moderate s.From Theorem 2.1, the prior distribution of Z can be approximated with N (1 , T −1 ), where T ≜ PQP ⊤ .Because PQP ⊤ has the same sparsity pattern as Q, all sparse linear algebraic operations tailored for the single-layer GMRF are applicable to the region-layer GMRF.Henceforth, we refer to the approximate model as the region-layer GMRF.
Similar to the single-layer GMRF, we derive the posterior distribution of the region-layer GMRF conditional on the simulation history.Any region-layer node is said to be simulated if at least one of the solutions in its solution-layer graph is simulated.Let R 2 ⊂ R be the set of simulated region-layer nodes and R 1 ≜ R\R 2 , and partition Z to (Z 1 , Z 2 ) ⊤ , where Z 1 and Z 2 correspond to R 1 and R 2 , respectively.Also, let X 2 (R ℓ ) ⊆ X (R ℓ ) be the set of simulated solutions in the solution-layer graph of R ℓ and are simulated; and (ii) even if all x ∈ X (R ℓ ) are simulated, we do not observe (x) without stochastic noise.Instead, we deine the stochastic observation of (R) as Namely, (R ℓ ) is the average of the stochastic observations at the simulated solutions in the solution-layer graph of R ℓ .Let Z 2 be the vector of (R ℓ ) at all R ℓ ∈ R 2 .To model Z 2 , we deine ) and T is the precision matrix of the stochastic noise for the region-layer GMRF.Since all solutions are simulated independently, Z In ( 7), we treat X 2 (R ℓ ) as a set of uniformly randomly selected solutions without replacement from X (R ℓ ).This allows Var( (R ℓ )) to be estimated by (Salemi et al. 2019) 1977); when all solutions in R ℓ are simulated, the irst term of (8) is 0. We adopt 1/ (R ℓ ) as a plug-in estimate of the diagonal element of T that corresponds to R ℓ .Given Z Note that T represents the submatrix of T corresponding to R and R for 1 ≤ , ≤ 2. The posterior distribution of Z in (9) provides low-resolution inference to ind a region-layer node to sample.This ixes the 2 region-layer coordinates.To determine the rest of coordinates for the solution to sample, we rely on high-resolution solution-layer inference.Lemma 2.2 below lets us update the posterior distribution of the solution-layer GMRF; see Appendix C for its proof.
, where Q * is the submatrix of Q corresponding to Y * .Thus, conditional on that the responses at the neighboring solutions of R * are equal to their prior means, Y * is independent of Y\Y * .Known as the mean-ield approximation in general, this approximation is adopted in various statistical models with a complex spatial correlation structure to ease computation involved in the analysis.For instance, Hensman et al. (2015) apply the mean-ield approximation for variational inference of a GP.
Based on Lemma 2.
See Appendix D for the proof.We exploit Proposition 2.3 in two ways to improve the eiciency of the pGMIA.First, it lets us estimate the hyperparameters of the GMRF eiciently by separating the region-and solution-layer estimation problems as we describe in Section 3. Second, Proposition 2.3 enables us to update the solution-layer GMRF's posterior distribution locally for R * without updating the posterior distribution of the entire single-layer GMRF.Recall that updating the single-layer model is computationally infeasible for large X .
The pGMIA makes the sampling decisions utilizing the posterior distributions of the two-layer models.The CEI is applied in the region layer to select the next node to łsample".Recall that sampling a region-layer node means that a solution projected to the node will be selected for sampling.The CEI of where R ∈ R 2 is the node with the smallest observed region-layer output, M(R) and V(R) denote the posterior mean and variance of Z(R), respectively, C( R, R) is the posterior covariance between Z(R) and Z( R), and Note that M(R) and V(R) are the elements of the mean vector and the diagonal of T−1 in (9), respectively, corresponding to R; and ) is selected, we compute the following conditional CEI at the solutions in X (R ℓ ) for solution-layer sampling: where x(R ℓ ) is the solution in X 2 (R ℓ ) with the smallest sample mean.The expectation in (12) can be computed as in (4) from the conditional distribution in (10) by setting R * = R ℓ .We select x CEI (R ℓ ) = arg max x∈X ( R ℓ ) CEI( x(R ℓ ), x; R ℓ ) as well as x(R ℓ ) for sampling.

HYPERPARAMETER ESTIMATION FOR PROJECTED GMRF
The prior/posterior distribution of the GMRF depends on the hyperparameters, and .These values are typically estimated after simulating an initial design.Song and Dong (2018) consider generalized method of moments (GMM) estimation and Salemi et al. (2019) and Semelhago et al. (2021) adopt the MLEs.However, these algorithms estimate the hyperparameters of the single-layer GMRF, which cannot be done for the scale of the problem considered here.
In this section, we discuss eicient estimation of and from the region-and solution-layer models.Section 3.1 reviews the MLE problem for single-layer GMRF.In Section 3.2, we discuss how the single-layer MLE problem can be separated into lower-dimensional region-and solution-layer MLE problems by applying our projection scheme.

MLEs for single-layer GMRF
For the single-layer GMRF, Salemi et al. (2019) derive the log-likelihood function given where From the optimality condition of ( 13), the MLE of can be written as a function of : We deine the proile likelihood, Recall from Section 2.1 that the feasibility constraints for are (i) 0 > 0, (ii) ≥ 0 for = 1, 2, • • • , , and (iii) =1 < 0.5.Thus, the MLE of can be obtained by solving max where strict inequality < (>) can be replaced with ≤ (≥) by subtracting (adding) a small constant to the right-hand side.
For high-dimensional problems, estimating by solving ( 15) is challenging because ( 13) and ( 14) are expensive to compute.Both require 22 to be computed irst for which the bottleneck is factorizing factorizing Q 11 is just as expensive as factorizing Q, which is prohibitively high for the scale of X considered here.

Hierarchical MLEs for region-and solution-layer GMRFs
In this section, we reformulate (15) to lower-dimensional region-and solution-layer MLE problems by exploiting Theorem 2.1 and Proposition 2.3.We refer to this estimation procedure as a hierarchical estimation scheme as the region-and solution-layer MLE problems are solved in sequence.Below, we irst introduce the region-layer MLE problem.
Let = ( 0 , 1 , • • • , 2 ) be the hyperparameter vector of the approximate region-layer precision matrix, T. Since each element of T = PQP ⊤ is the sum of the elements of a block matrix of Q, we can obtain the functional relationship between and .That is, 0 = 0 (1 − 2 1 =1 ), and Similar to the single-layer GMRF MLE problem, the region-layer MLEs can be found by maximizing where Thus, the MLE of can be computed at the region layer.As in ( 14), that maximizes (18) can be written as a function of : Consequently, L ( | Z 2 = Z 2 ) is deined by plugging ( 19) into ( 18).The MLE of can be found by solving max Once solution to (20) is found, 1 +1 , 1 +2 , • • • , are identiied up to a scaling factor from (17).Let ≜ ( 1 , 2 , • • • , 1 ) be the vector of remaining elements of .From ( 16), 0 can be written as a function of and 0 .
We exploit the conditional distribution in Proposition 2.3 to set up a computationally eicient solution-layer MLE problem for .The solution-layer log-likelihood function at R * computed from (10) is where Recall that 0 is a function of given 0 , and is already computed in the region layer from (19).Therefore, the solution-layer MLE problem can be constructed as max Reparameterizing 0 = 1 0 , all constraints in (21) become linear.The equality constraint in ( 21) can be removed by substituting 0 as a linear function of .Since our goal is to ind the global optimum, we choose R * = R min to compute , where R min is the node to which the current best solution is projected so that the resulting solution-layer GMRF best its R min and makes the solution-layer search in R min more efective.
The hierarchical scheme has signiicant computational advantages over the single-layer MLE problem.For ( 20) and ( 21), the most expensive operations are factorizing T 11 and Q * 11 , respectively, which are far smaller than Q 11 .We can also ensure that the estimate of obtained from the hierarchical scheme satisies the feasibility constraint of the single-layer MLE problem.Clearly, is nonnegative.Thus, it remains to show the diagonal dominance condition holds.Observe that where the second equality follows from ( 16) and ( 17), and the last inequality holds because 0 < 2 =1 < 0.5 and 0 < 0 /( 0 ) ≤ 1 from the irst constraint of (21).
Although adopting the same for both region and solution layers is statistically consistent with our modeling assumptions, in practice, this tends to make solution-layer GMRFs provide poor inference when X is large.In particular, when there are only a few simulated solutions in a region-layer node and the sample means of the outputs at these solutions are smaller than , then the posterior distribution of the posterior means of the solution-layer GMRF at unsimulated solutions tend to be large, i.e., reversion to the mean.This causes the algorithm to exploit the simulated solutions instead of exploring new solutions for several iterations.To avoid this, we make heuristic adjustments in our experiments in Section 5; when R ∈ R 2 is selected for sampling, we recompute the MLE (R) from ( 14) from the simulation observations made within R.

PROJECTED GAUSSIAN MARKOV IMPROVEMENT ALGORITHM
In this section, we propose our DOvS algorithms based on the projected GMRF model.The pGMIA is introduced in Section 4.1, then it is extended to the pGMIA+ in Section 4.2 to address even higher-dimensional problems.A complete table of notation is provided in Appendix E for convenience.Both pGMIA and pGMIA+ update P periodically.

Algorithmic details of pGMIA
Algorithm 1 outlines the pGMIA.The pGMIA irst randomly selects the initial P and then solves the regionand solution-layer MLE problems hierarchically.At each iteration, the pGMIA inds new solutions to sample by performing the hierarchical search in Algorithm 3. When a projection update criterion is satisied, it updates P Algorithm 3: Hierarchical search Input : , , P, , , , (X 2 ), 2 (X 2 ), (R 2 ) and (R 2 ) 8 Run replications and compute (x) and 2 (x) at each new solution.Compute M(x), V(x), and C( x(R), x) from ( 10) for all x ∈ X (R).
For each of the selected region-layer nodes, the algorithm updates the solution-layer posterior distribution and inds the current best solution and the solution with the largest CEI to simulate within the region.These operations can be parallelized as no information is exchanged among the nodes.Any of the selected region-layer nodes may be under-sampled in which case the inference at the solution-layer GMRF may be poor.To prevent this, Steps 6ś9 irst ensure that we sample at least design solutions within each region.If |X 2 (R)| > 0, i.e., R has been sampled before, then we apply the approach proposed in Wang (2003) to obtain − |X 2 (R)| new solutions.Given , their algorithm irst deines the Latin Hypercube grid in the domain and shrinks the grid by leaving out the rows and columns that already contain existing design solutions.Then, it generates a Latin Hypercube sample of size − |X 2 (R)| on the shrunk grid and maps it back to the original domain.We modify their algorithm to sample design solutions on the integer grid; further details can be found in Appendix F.
In Steps 2 and 11 of Algorithm 3, in lieu of computing the full posteriors, we only compute the elements of the posterior mean and covariance matrix necessary for computing CEIs in the region and solution layers, respectively.
When P is updated, the region-layer hyperparameters, , must be updated according to the new P.The new can be computed from the old and estimated under the previous projection, say P ′ , by irst computing 1 +1 , 1 +2 , . . ., from (17), then solving T = PQP ⊤ for each element of .However, we empirically observed that updating the hyperparameters allows pGMIA to make a faster search progress, particularly in earlier iterations.
Algorithm 4 details the projection and hyperparameter update in the pGMIA.Updating P tends to increase |R 2 | because the solutions projected to the same region-layer node under the previous projection, P ′ have the same coordinates in the dimensions included in the region layer under P ′ .Thus, when the region-layer dimensions change under P, those solutions are allocated to diferent region-layer nodes, possibly generating more sampled nodes.In earlier iterations, this may result in several nodes with only one sampled solution.For such nodes R, (R) cannot be computed, and thus we do not include those nodes in the set of sampled nodes, R 2 , in the region-layer MLE calculation.Step 1 of Algorithm 4 checks if there are at least nodes under P that Algorithm 4: Projection and hyperparameters update Input : , , P, (X 2 ), 2 (X 2 )  Run replications and compute (x) and 2 (x) at each. 9 end Compute (R) and (R) from ( 6) and ( 8 Output : , , , (X 2 ), 2 (X 2 ), (R 2 ) and (R 2 ).
contain two or more simulated solutions.If not, we randomly select region-layer nodes among those with only one simulated solution and sample an extra solution for each region-layer node to ensure (•) can be computed.
We establish global convergence of pGMIA under Assumption 1 stated below.
Assumption 1.For all x ∈ X , (x) > −∞ and 0 < Var[ (x)] < +∞.Moreover, the MLEs of the GMRF hyperparameters are updated only initely many times for each P throughout the run.
Notice that (x) needs not be normally distributed as long as its variance is inite.The last condition ensures that the hyperparameters are ixed after a inite number of iterations.Theorem 4.1 below guarantees that pGMIA converges to the global optimum almost surely regardless of the projection selection criterion.See Appendix G for its proof.

pGMIA+: Multi-layer Extension of pGMIA
As the problem dimension increases, the two-layer model adopted in the pGMIA eventually runs into the computational limitation as the number of solutions/nodes in each layer becomes too large.In this section, we propose a multi-layer extension of pGMIA, pGMIA+, which partitions the dimensions of the solution space into > 2 batches to achieve computational eiciency for even higher-dimensional problems.
Let the number of dimensions included in the th group of the partition be so that =1 = .Without loss of generality, let us label the dimensions in the th group as Each layer of the multi-layer GMRF model is constructed by hierarchically projecting the dimensions of the solution space onto a lower-dimensional integer lattice.The th-layer graph is the top-layer graph where all dimensions are projected to the last dimensions.Each th-layer node has unique -dimensional coordinates.For 2 ≤ ≤ , each th-layer node can be mapped to the ( − 1)th-layer graph consisting of the ( − 1)th-layer nodes projected to that th-layer node.
We illustrate the -layer projection scheme in Figure 2 using a three-dimensional solution space consisting of 27 solutions, where the irst dimension takes values from 1 to 3, the second from 4 to 6, and the third from 7 to 9. Suppose the dimensions are partitioned into three batches: {1}, {2}, and {3}.Therefore, the top (third) layer consists of the third dimension only, resulting in a one-dimensional graph with three nodes.Notice that the third-layer node with coordinate 9 is mapped to the second-layer graph with three nodes, (4, 9), (5, 9), and (6, 9), whose third coordinates are 9.Similarly, other nodes in the third layer also have corresponding second-layer graphs; thus, three second-layer graphs are deined in total.The igure also shows that the second-layer node (6, 9) is mapped to the irst-layer graph consisting of (1, 6, 9), (2, 6, 9), and (3, 6, 9).There are total 9 irst-layer graphs.
The pGMIA+ deines the response at each th-layer node as the average of the responses at all nodes in the corresponding ( − 1)th-layer graph.For 1 ≤ ≤ , the th-layer GMRF is deined for each th-layer graph.As in the pGMIA, the pGMIA+ selects a node to sample starting at the th layer and continues its way down to the irst layer by selecting a node in the lower-layer graph.For instance, in Figure 2, suppose 9 is selected for sampling in the third layer.Then, the pGMIA+ continues to make a sampling decision on the second layer by selecting a node among (4, 9), (5, 9), and (6, 9).This procedure is repeated in the irst layer, until a solution among (1, 6, 9), (2, 6, 9), and (3, 6, 9) is selected.Hence, while there are 27 feasible solutions in the single-layer GMRF, the pGMIA+ only needs to make inference at three nodes on each layer, which is why the computational cost is signiicantly reduced in the pGMIA+.
Similar to the two-layer case, the MLEs of GMRF hyperparameters can be computed hierarchically from the th layer to the irst layer exploiting Theorem 2.1 and Proposition 2.3.The details of multi-layer GMRF distributions, their MLE problems and the pGMIA+ are presented in Appendix H. Corollary 4.2 below states the global convergence of pGMIA+; its proof can be found in Appendix I.In practice, making exact inference on all layers may be too expensive when is large.One way to reduce the computational overhead is to make exact inferences on the top − 1 layers while randomly sampling a solution to simulate in the irst layer.The computational saving can be made substantial by setting 1 large.

EMPIRICAL ANALYSIS
In this section, we provide extensive empirical results to demonstrate the performances of the pGMIA and pGMIA+.Both algorithms are compared with MR-GMIA in Salemi et al. (2019), and four state-of-the-art high-dimensional BO algorithms: Random EMbedding Bayesian Optimization (REMBO) in Wang et al. (2016), Sparse Axis-Aligned Subspace Bayesian Optimization (SAASBO) in Eriksson and Jankowiak (2021), and High-dimensional Bayesian Optimization (HDBO) and High-dimensional Batch Bayesian Optimization (HDBBO) in Wang et al. (2017).REMBO and SAASBO are projection-based BO algorithms.While SAASBO only considers an axis-aligned projection, both adopt the Expected Improvement (EI) as the sampling criterion.HDBO and HDBBO are batching approaches.Both apply the Upper Conidence Bound (UCB) as the sampling criterion, however, HDBBO has parameter that controls the number of solutions sampled in each iteration; it selects solutions with the largest UCB values.Although their inference is still valid, REMBO, SAASBO, HDBO and HDBBO are not designed speciically for the integer solution space, when a fractional solution is chosen from these algorithms, we round it to the nearest integer solution for simulation.

Performance of pGMIA
We test two choices of projection selection criteria with the pGMIA, random selection and distance correlation.To diferentiate the two criteria, we denote them as pGMIA-R and pGMIA-DC, respectively.We demonstrate the performance of pGMIA-R and pGMIA-DC using three popular optimization test functions (https://www.sfu.ca/~ssurjano/optimization.html): Zakharov (5.1.1),Branin (5.1.2),and Styblinski-Tang (Section 5.1.3)functions.To convert them to DOvS problems, we add stochastic noise to the function values.All dimensions of Zakharov function are active and it is non-decomposable.The Branin function has a two-dimensional active subspace and thus is favorable to projection-based methods like REMBO and SAASBO.The Styblinski-Tang function is decomposable in each coordinate direction and thus is favorable to batching methods such as HDBO and HDBBO.We also include a slightly modiied Styblinski-Tang function to induce interactions between the dimensions.
All three test problems are 10-dimensional and there are 5 feasible values in each dimension.We set 1 = 2 = 5 for pGMIA-R and pGMIA-DC.We choose the last 5 dimensions to be included in the region layer for MR-GMIA and set the dimension of the active subspace of REMBO to be 5.Since Wang et al. (2017) do not give speciic guidance on the batch size, we choose = 6 for HDBBO as pGMIA-R and pGMIA-DC typically sample 6 solutions at each iteration.
To estimate the hyperparameters, all algorithms initially sample 100 design solutions via space illing for the Zakharov and Styblinski-Tang functions.In pGMIA-R, pGMIA-DC and MR-GMIA, = 10 region-layer nodes and = 10 solutions per node are selected via space illing.For the Branin function, all algorithms initially sample 9 design solutions ( =3 and = 3 for pGMIA-R, pGMIA-DC and MR-GMIA).Note that the Branin function has two active dimensions.Therefore, we choose a small initial design to avoid inding the optimal solution right after sampling the initial design.All algorithms run 10 replications each time they sample a solution.
In pGMIA-R and pGMIA-DC, we compute the MLEs from the outputs of 50 most-sampled region-layer nodes to avoid sampling too many new solutions when the projection scheme is updated, and to save the computational cost.

Zakharov Function
The Zakharov function is neither additive nor has a lower-dimensional subspace; all ten dimensions are active and there is an interaction term between any two dimensions.These features make all four benchmarking algorithms not particularly advantageous for the Zakharov function.We choose X = {−2, −1, 0, 1, 2} 10 as the feasible solution space, which makes (x) ∈ [0, 9.15 × 10 6 ].The global optimum of the Zakharov function is at x = 0 10 with response 0, where the diference between the responses at the global minimum and the second best is 1.3.To make the output stochastic, we add normal noise that follows N (0, 1.8 2 ) to the response function.
We irst examine the sensitivity of the pGMIA to the choice of before comparing it with the benchmarking algorithms.Figure 3 displays the trajectories of optimality gap against the number of samples pGMIA-R evaluates averaged over 100 macro-runs; recall that each sample consists of 10 simulation replications.The shaded area around each curve shows point-wise ±2 standard error of the average performance computed from 100 macroruns.Although = 5 and = 100 show slightly less desirable behaviors in the beginning and toward the end, respectively, the choice of does not appear to afect the pGMIA's performance signiicantly.As more solutions are sampled, for all ive choices of , pGMIA-R eventually converges to the global optimum.The same sensitivity analysis was conducted for pGMIA-DC and showed similar conclusions.
In the remainder of Section 5, we adopt = 20 for pGMIA-R and pGMIA-DC.To match the pGMIA, we also set REMBO to update its parameters every 20 iterations.Additionally, we let SAASBO, HDBO, and HDBBO update the projection/decomposition and parameters every 20 iterations.The MR-GMIA only computes the parameters once at the beginning of the algorithm.
Figure 4(a) compares the performance of pGMIA-R and pGMIA-DC with all other algorithms.Notably, pGMIA-R converges faster than pGMIA-DC, indicating that randomly selecting the projection is efective.The performance diference between pGMIA-DC and pGMIA-R can be attributed to the frequency at which each algorithm updates the projection.For the experiments reported in Figure 4(a), pGMIA-R and pGMIA-DC evaluate the projection update criterion 19 times by the 1000th sample.While pGMIA-R updates the projection 18.8 times on average across 100 macro-runs (standard error 0.037), pGMIA-DC updates only 14.2 times (0.397), indicating that the same projection may be chosen multiple times by the distance correlation criterion.Early on in the algorithm, there is a beneit of changing the projection more frequently as it induces more exploration in all dimensions.Moreover, pGMIA-DC estimates the distance correlations based on the posterior means obtained from the GMRF model as discussed in Section 4.1, which tends to have large prediction errors in the earlier iterations.
As expected, REMBO, SAASBO, HDBO, and HDBBO perform poorly on the Zakharov function, which is not additive and all dimensions are active.We stopped running the SAASBO around 500 samples because it takes more than 2 hours at each iteration at that point, and the trajectory already demonstrates the eiciency of pGMIA (a) Noise level N (0, 1.8 2 ).
(d) Noise level N (0, 5.2 2 ).over SAASBO.Additionally, we observe that the MR-GMIA, which uses a ixed projection scheme, progresses slowly after 200 samples.This shows that updating the projection scheme is indeed efective in narrowing the optimality gap.
Next, we examine robustness of the pGMIA and benchmarking algorithms to diferent levels of stochastic noise.In Figures 4(b)ś4(d), we increase the stochastic noise variance from 1.8 2 to 2.6 2 , 3.9 2 , and 5.2 2 , respectively.Observe that the trajectories of pGMIA-R, pGMIA-DC, and MR-GMIA are similar across all four cases suggesting that these algorithms perform consistently under diferent noise conditions.On the contrary, REMBO, SAASBO, HDBO and HDBBO show noticeable performance degradation with increasing noise levels.In particular, the HDBBO starts outperforming the MR-GMIA around the 300-sample mark when the stochastic noise variance is 1.8 2 , but the cross-over point increases to 900 samples when stochastic noise variance is 5.2 2 .
To examine the computational overhead of the algorithms, Figure 5 plots the average trajectories of the optimality gap against the wall-clock time computed from 30 macro-runs, where the stochastic noise variance is 1.8 2 .All algorithms are run on a single thread of Intel® Core™i5-9600K CPU (3.70GHz) with 16.0 GB RAM to measure the wall-clock time.Observe that pGMIA-R shows the fastest progress and clearly dominates the others.The performance diference between pGMIA-R and pGMIA-DC can be attributed to the cost of computing the distance correlations in pGMIA-DC.All three variations of the GMIA outperform the others in wall-clock time, relecting the computational beneit of GMRF-based algorithms over the GP-based algorithms applied to the integer solution space.SAASBO shows no improvement over the 1000-second period due to its heavy computational overhead.
To summarize, the experiment results in this section demonstrate the robustness of the pGMIA to diferent levels of stochastic noise and highlight the superior sample and computational eiciency of the pGMIA on a nonadditive objective function for which all dimensions are active compared to other batching or projection-based BO algorithms.
Figure 6 shows the trajectories of optimality gap averaged over 100 macro-runs when the algorithms are terminated after 500 samples.Although both pGMIA-R and pGMIA-DC are slower in the beginning, they outperform all other algorithms after 120 samples.REMBO makes reasonably good progress at the beginning since the Branin function has a two-dimensional active subspace, however, it signiicantly slows down after about 100 samples.On the other hand, SAASBO converges to the global optimal after 400 samples.For MR-GMIA, the irst ive dimensions in the region layer include the two active dimensions of the Branin function.Therefore, all solutions within a region in MR-GMIA have identical responses, which makes the solution-layer search meaningless.HDBO shows good initial performance but slows down after 100 samples exhibiting the worst performance among all.On the other hand, HDBBO is almost as competitive as pGMIA-R and pGMIA-DC.We conjecture that the batch sampling scheme of HDBBO increases its chance to explore more along the active dimensions.
We close this section by pointing out that even if the true objective function has a lower-dimensional active subspace, the pGMIA outperforms the benchmarked active subspace-based BO algorithms.
In the following, we compare the performance of pGMIA-R and pGMIA-DC tested on the Styblinski-Tang function against MR-GMIA, REMBO, SAASBO, HDBO, and HDBBO.Figure 7(a) shows the trajectories of the optimality gap against the number of samples averaged over 100 macro-runs.Notice that HDBO exhibits the best performance at the beginning of the algorithm.The pGMIA-R and pGMIA-DC outperform HDBO after the 300-sample markÐthis is when the projection and MLEs are updated for the irst time.Until the irst projection update, pGMIA-R and pGMIA-DC are identical, but thereafter, pGMIA-DC shows slower progress than pGMIA-R.HDBBO is not competitive at the beginning of the algorithm, but catches up the pGMIA-R and pGMIA-DC in the long run.REMBO's and SAASBO's poor performances are not surprising as all dimensions are active in the Styblinski-Tang function.MR-GMIA performs better than REMBO but not as well as the other four algorithms.In addition, both MR-GMIA and REMBO do not update the initial projection scheme, which explains their slower convergence.
To examine the efect of interactions across dimensions on the algorithms' performances, we modify the Styblinski-Tang function to (Ax), where A = [ , ] is a 10 × 10 matrix whose elements are zeroes except that , = ,+1 = 0.5 for 1 ≤ ≤ 9 and 10,1 = 10,10 = 0.5.Given the same X , the global optimum is at x = (−3)1 10 and its response remains the same.Figure 7(b) shows the results for the modiied Styblinski-Tang function.All algorithms except for HDBBO show smaller optimality gap at a given number of samples compared to the original function while pGMIA-R and pGMIA-DC still exhibit the best performances among all.The relatively poor performance of the HDBBO can be attributed to that the function is no longer additive; its sample performance is  afected more than the HDBO's as it selects a batch of solutions assuming the additive structure.Not surprisingly, the two active-subspace based algorithms, REMBO and SAASBO, still show poor performances.
In both Figures 7(a) and 7(b), there are no signiicant diferences between the performances of pGMIA-DC and pGMIA-R.This can be attributed to the fact that all dimensions in the Styblinski-Tang function are equally important.As a result, the marginal distance correlations of all dimensions are identical.

Performance of pGMIA+
In this section, we test the 100-dimensional Branin (Section 5.2.1) and Zakharov (5.2.2) functions, and a 30dimensional Assemble-to-Order DOvS problem (5.2.3) to demonstrate the performance of pGMIA+.Moreover, we only test pGMIA+ with random selection as the projection selection criterion because (i) it shows better performance when combined with pGMIA in Section 5.1; and (ii) to avoid computing the distance correlations for all dimensions.The MR-GMIA is dropped here as the problem size is too large for it to be computationally feasible.
In Sections 5.2.1ś5.2.2, we adopt a three-layer GMRF model for the pGMIA+, where 3 = 3, 2 = 3 and 1 = 94.With this setting, the pGMIA+ samples 8ś19 solutions at each iteration, which depends on whether the CEI-maximizing or sample-best node at each layer coincides with the node that the sample-best solution is projected to.See Appendix H for the details.We set the active subspace dimension of REMBO to be 2 + 3 = 6 and adopt = 18 for HDBBO as pGMIA+ typically samples 18 solutions at each iteration.All algorithms are initialized with the same number of design solutions selected via Latin hypercube sampling, and simulate = 10 replications per sampling.As in Section 5.1, we select (at most) 50 most-sampled nodes for MLE estimation.
In Section 5.2.3, we change the setting for the pGMIA+ to 3 = 3, 2 = 3 and 1 = 24.Figure 8 shows the trajectories of the optimality gap averaged over 100 macro-runs.Notice that pGMIA+ outperforms REMBO and SAASBO.Again, the observation for SAASBO is cut short due to its prohibitively large computational cost.The pGMIA+ converges to the global optimum after 500 samples while HDBBO converges after 800 samples.Unlike in the 10-dimensional Branin function, REMBO performs signiicantly worse than pGMIA+, HDBBO and SAASBO.This is because the random projection REMBO adopts tends to project the solution selected in the 6-dimensional active subspace to outside of the feasible region of the 100-dimensional solution space.A similar observation has been reported by Letham et al. (2020).We add N (0, 1.8 2 ) stochastic noise to the response function.All algorithms initially sample 200 design solutions ( 3 = 10, 2 = 10 and 1 = 2 for pGMIA+) for hyperparameter estimation.

Zakharov
Figure 9 shows the trajectories of true optimality gap averaged over 100 macro-runs when the algorithms are terminated after 1,000 samples.In the irst 400 samples, it is diicult to distinguish the four algorithms due to large run-to-run variation.We stopped running SAASBO after 300 samples because its iteration takes more than one hour at this point.After 400 iterations, we can observe that pGMIA+ shows better performance than the other two algorithms.Although it appears that the pGMIA+'s progress slows down as the algorithm proceeds, we note that the remaining optimality gap (approximately 1000) is negligible compared to the range of the function, 2.54 × 10 16 .5.2.3 Assemble-to-order (ATO) System.The ATO system (Hong and Nelson 2006) is an inventory management example, where the inal product is assembled from the components made to stock whenever an order is received.The components required for each product vary.We modify the ATO problem to have 5 product groups, 1 ≤ ≤ 5, where each has 5 product types, 1 , 2 , . . ., 5 , assembled from 6 components, 1 , 2 , . . ., 6 .The objective is to determine the expected proit-maximizing inventory levels for 1 , 2 , . . ., 6 , 1 ≤ ≤ 5, which is a 30-dimensional DOvS problem.The system applies a continuous-review base-stock policy, that is, whenever there is a demand for a component, the system automatically starts producing the component to ill up to the base-stock level.For the th component of the th product group, the production time follows a truncated normal distribution in (0, +∞) with mean and variance 2 .For each , there are only 2 machines producing the components on the irst-in-irst-out basis.Each component has the inventory capacity, , and the holding cost per period, ℎ .
The arrival process of orders of the th product of the th product group follows a Poisson distribution with rate , for = 1, 2, • • • , 5 and = 1, 2, • • • , 5. When a product order is received, if any of its required components are out of stock, the order is canceled with a ixed penalty = $3/.Otherwise, the order is assembled immediately and generates proit .Table 1 displays the parameters for the ATO system for the th product group.
Unlike other examples, the ATO problem has heteroscedastic stochastic noise across the solutions.Thus, we drop REMBO from comparison in this section as it only allows homoscedastic noise.We also drop SAASBO due to its heavy computational cost.Figure 10 shows the trajectories of the estimated optimality gap at the current best solution against the number of samples averaged over 100 macro-runs.Clearly, the pGMIA+ outperforms the HDBBO and the optimality gap widens as the algorithms continue.This indicates efectiveness of the pGMIA+ when applied to a more realistic simulation optimization problem.

CONCLUSIONS
The pGMIA reduces computational complexity of a high-dimensional DOvS problem by batching the dimensions into region and solution layers and projecting the solution space onto the region-layer dimensions.The pGMIA hierarchically optimizes each layer; it irst selects a region-layer node to simulate, then selects a solution to simulate within the solution-layer graph projected to the region-layer node.Since the dimensions at both region and solution layers are lower than the solution space, the computational overhead of the algorithm is greatly reduced.However, the region-layer GMRF's precision matrix becomes a dense matrix after the projection, which negates the computational beneit of the GMRF in solving a DOvS problem.We resolve this issue by proposing the approximate sparse precision matrix introduced in Theorem 2.1, which becomes increasingly accurate as the number of values in each dimension increases.Exploiting the approximate precision matrix, we also propose novel approaches to estimate the hyperparameters of the region-and solution-layer GMRFs.The pGMIA can be extended to the pGMIA+ to incorporate more layers to even further exploit the computational beneit of projection.
Across all numerical examples, we have observed that the pGMIA with the random projection selection criterion outperforms other benchmarks.We attribute its performance to that the pGMIA simply uses batching and projection as means to reduce the computational burden instead of assuming the objective function is additive or has a lower-dimensional active subspace; it periodically updates the batches of dimensions by randomization and this allows the interaction efects among dimensions previously included in diferent batches to be learned.Moreover, the pGMIA does not regard the projected dimensions to be inactive; once a region-layer sampling decision is made based on the projected GMRF model, then it applies a sampling criterion to the solution-layer model consisting of the projected dimensions in the region layer to complete the sampling decision.
We have also observed that the pGMIA tends to perform better than the GMIA, even for a lower-dimensional problem whose computational overhead is cheap enough for the GMIA.The GMIA its a single-layer GMRF for the entire solution space and it is harder to obtain an accurate global it if the solution space is large.The projected GMRF, on the other hand, models the solution-layer graph with a solution-layer GMRF that provides better local inference.Meanwhile, the region-layer GMRF provides a low-resolution inference to quickly ind promising regions of the solution space.Consequently, the pGMIA learns local performance of the objective function more efectively at regions with good average performances.
Although the pGMIA+ scales well with the dimension when the number of feasible values in each dimension is relatively small, if the number is large, then it is limited by the size of the precision matrix that can be factorized by a computer.In the latter case, the pGMIA+ may only contain a few dimensions in each batch.This can potentially slow down the algorithm's progress as interaction efects among the dimensions in separate batches may not be captured as efectively until the dimensions are rebatched.Aggregating feasible values in each dimension to obtain a coarser model may help in this case.Similar ideas have been explored by (Mes et al. 2011).
Exploiting parallel computing to improve eiciency of the pGMIA is another important future research topic.As mentioned in Section 4, there is no exchange of information among the solutions projected to diferent nodes in pGMIA.Thus, the solution-layer search can be run in parallel if multiple regions are selected for sampling.This will require the sampling criterion to be extended to evaluate the beneit of jointly sampling a set of regions as well as a stopping criterion to terminate the parallel search when a region-layer node is deemed not worth exploring anymore.
Theorem 4.1.Under Assumption 1, the pGMIA run without stopping converges to the global optimum almost surely.
Corollary 4.2.Under Assumption 1, the pGMIA+ run without stopping converges to the global optimum almost surely.
Fig. 8.Comparison of the trajectories of optimality gap of the 100-dimensional Branin function versus the number of samples for pGMIA+, REMBO, SAASBO and HDBBO ( = 18) averaged over 100 macro-runs.

Fig. 9 .
Fig. 9. Comparison of the trajectories of optimality gap of the 100-dimensional Zakharov function versus the number of samples for pGMIA+, REMBO, SAASBO and HDBBO ( = 18) averaged over 100 macro-runs.

Fig. 10 .
Fig. 10.Comparison of the trajectories of estimated optimality gap of the ATO function at the current best for pGMIA+ and HDBBO ( = 18).