Batched Neural Bandits

In many sequential decision-making problems, the individuals are split into several batches and the decision-maker is only allowed to change her policy at the end of batches. These batch problems have a large number of applications, ranging from clinical trials to crowdsourcing. Motivated by this, we study the stochastic contextual bandit problem for general reward distributions under the batched setting. We propose the BatchNeuralUCB algorithm which combines neural networks with optimism to address the exploration-exploitation tradeoff while keeping the total number of batches limited. We study BatchNeuralUCB under both fixed and adaptive batch size settings and prove that it achieves the same regret as the fully sequential version while reducing the number of policy updates considerably. We confirm our theoretical results via simulations on both synthetic and real-world datasets.


Introduction
In the stochastic contextual bandit problem, a learner sequentially picks actions over T rounds (the horizon).At each round, the learner observes K actions, each associated with a d-dimensional feature vector.After selecting an action, she receives stochastic reward.Her goal is to maximize the cumulative reward attained over the horizon.Contextual bandits problems have been extensively studied in the literature (Langford and Zhang, 2007;Bubeck and Cesa-Bianchi, 2012;Lattimore and Szepesvári, 2020) and have a vast number of applications such as personalized news recommendation (Li et al., 2010) and healthcare (see Bouneffouf and Rish (2019) and references therein).
Various reward models have been considered in the literature, such as linear models (Auer, 2002;Dani et al., 2008;Li et al., 2010;Chu et al., 2011;Abbasi-Yadkori et al., 2011;Agrawal and Goyal, 2013), generalized linear models (Filippi et al., 2010;Li et al., 2017), and kernel-based models (Srinivas et al., 2009;Valko et al., 2013).Recently, neural network models that allow for a more powerful approximation of the underlying reward functions have been proposed (Riquelme et al., 2018;Zhou et al., 2019;Zhang et al., 2020;Xu et al., 2020).For example, the NeuralUCB algorithm (Zhou et al., 2019) can achieve near-optimal regret bounds while only requiring a very mild boundedness assumption on the rewards.However, a major shortcoming of NeuralUCB is that it requires updating the neural network parameters in every round, as well as optimizing the loss function over observed rewards and contexts.This task is computationally expensive and makes NeuralUCB considerably slow for large-scale applications and inadequate for the practical situations where the data arrives at a fast rate (Chapelle and Li, 2011).
In addition to the computational issues, many real-world applications require limited adaptivity, which allows the decision-maker to update the policy at only certain time steps.Examples include multi-stage clinical trials (Perchet et al., 2016), crowdsourcing platforms (Kittur et al., 2008), and running time-consuming simulations for reinforcement learning (Le et al., 2019).This restriction formally motivates the batched multi-armed bandit problem that was studied first in Perchet et al. (2016) for the two-armed bandit with noncontextual rewards.Their results have been recently extended to many-armed setting (Gao et al., 2019;Esfandiari et al., 2019;Jin et al., 2020), linear bandits (Esfandiari et al., 2019), and also linear contextual bandits (Han et al., 2020;Ruan et al., 2020).A closely related literature of rarely switching multi-armed bandit problem measures the limited adaptivity by the number of policy switches (Cesa-Bianchi et al., 2013;Dekel et al., 2014;Simchi-Levi and Xu, 2019;Ruan et al., 2020), where in contrast to the batched models that the policy updates are at pre-fixed time periods, the policy updates are adaptive and can depend on the previous context and reward observations.While these papers provide a complete characterization of the optimal number of policy switches in both cases for the stochastic contextual bandit problem with the linear reward, the extension of results to more general rewards remains unstudied.
In this paper, we propose a BatchNeuralUCB algorithm that uses neural networks for estimating rewards while keeping the total number of policy updates to be small.BatchNeuralUCB addresses both limitations described above: (1) it reduces the computational complexity of NeuralUCB, allowing its usage in large-scale applications, and (2) it limits the number of policy updates, making it an excellent choice for settings that require limited adaptivity.It is worth noting that while the idea of limiting the number of updates for neural networks has been used in Riquelme et al. (2018); Xu et al. (2020) and the experiments of Zhou et al. (2019), no formal results on the number of batches required or the optimal batch selection scheme are provided.Our main contributions can be summarized as follows.
• We propose BatchNeuralUCB which, in sharp contrast to NeuralUCB, only updates its network parameters at most B times, where B is the number of batches.We propose two update schemes: the fixed batch scheme where the batch grid is pre-fixed, and the adaptive batch scheme where the selection of batch grid can depend on previous contexts and observed rewards.When B = T , BatchNeuralUCB degenerates to NeuralUCB.
• We prove that for BatchNeuralUCB with fixed batch scheme, the regret is bounded by , where d is the effective dimension (See Definition 5.6).For adaptive batch scheme, for any choice of q, the regret is bounded by O( max{q, (1 , where q is the parameter that determines the adaptivity of our algorithm (See Algorithm 1 for details), and K is the number of arms.Therefore, to obtain the same regret as its fully sequential counterpart, BatchNeuralUCB only requires Ω( √ T ) for fixed and Ω(log T ) for adaptive batch schemes.These bounds match the lower bounds presented in the batched linear bandits (Han et al., 2020) and rarely switching linear bandits (Ruan et al., 2020) respectively.
• We carry out numerical experiments over synthetic and real datasets to confirm our theoretical findings.In particular, these experiments demonstrate that in most configurations with fixed and adaptive schemes, the regret of the proposed BatchNeuralUCB remains close to the regret of the fully sequential NeuralUCB algorithm, while the number of policy updates as well as the running time are reduced by an order of magnitude.
Notations We use lower case letters to denote scalars, lower and upper case bold letters to denote vectors and matrices.We use • to indicate Euclidean norm, and for a semi-positive definite matrix Σ and any vector x, x Σ := Σ 1/2 x =

√
x Σx.We also use the standard O and Ω notations.We say The notation O is used to hide logarithmic factors.Finally, we use the shorthand that [n] to denote the set of integers {1, ..., n}.

Related Work
The literature on the contextual multi-armed problem is vast.Due to the space limitations, we only review the existing work on batched bandits and bandits with function approximations here and refer the interested reader to recent monographs by Slivkins et al. (2019) and Lattimore and Szepesvári (2020) for a thorough overview.
Batched Bandits.The design of batched multi-armed bandit models can be traced back to UCB2 (Auer et al., 2002) and Improved-UCB (Auer and Ortner, 2010) algorithms originally for the fully sequential setting.Perchet et al. (2016) provided the first systematic analysis of the batched stochastic multi-armed bandit problem and established near-optimal gap-dependent and gap-independent regrets for the case of two arms (K = 2).Gao et al. (2019) extended this analysis to the general setting of K > 2. They proved regret bounds for both adaptive and non-adaptive grids.Esfandiari et al. (2019) improved the gap-dependent regret bound for the stochastic case and provided lower and upper bound regret guarantees for the adversarial case.They also establish regret bounds for the batched stochastic linear bandits.
Our work in the batched setting is mostly related to Han et al. (2020); Ruan et al. (2020).In particular, Han et al. (2020) studied the batched stochastic linear contextual bandit problem for both adversarial and stochastic contexts.For the case of adversarial contexts, they show that the number of batches B should be at least Ω( √ dT ).Ruan et al. (2020) studied the batched contextual bandit problem using distributional optimal designs and extended the result of (Han et al., 2020).They also studied the minimum adaptivity needed for the rarely switching contextual bandit problems in both adversarial and stochastic context settings.In particular, for adversarial contexts, they proved a lower bound of Ω((d log T )/ log(d log T )).Our work, however, is different from Han et al. (2020); Ruan et al. (2020) as we do not require any assumption on the linearity of the reward functions; similar to NeuralUCB (Zhou et al., 2019), our regret analysis only requires the rewards to be bounded.
Bandits with Function Approximation.Given the fact that Deep Neural Network (DNN) models enable the learner to make use of nonlinear models with less domain knowledge, Riquelme et al. (2018); Zahavy and Mannor (2019) studied neural-linear bandits.In particular, they used all but the last layers of a DNN as a feature map, which transforms contexts from the raw input space to a low-dimensional space, usually with better representation and less frequent updates.Then they learned a linear exploration policy on top of the last hidden layer of the DNN with more frequent updates.Even though these attempts have achieved great empirical success, they do not provide any regret guarantees.Zhou et al. (2019) proposed NeuralUCB algorithm that uses neural networks to estimate reward functions while addressing the exploration-exploitation tradeoff using the upper confidence bound technique.Zhang et al. (2020) extended their analysis to Thompson Sampling.Xu et al. (2020) proposed Neural-LinUCB which shares the same spirit as neural-linear bandits and proved O( √ T ) regret bound.

Problem Setting
In this section, we present the technical details of our model and our problem setting.
Model.We consider the stochastic K-armed contextual bandit problem, where the total number of rounds T is known.At round t ∈ [T ], the learner observes the context consisting of K feature vectors: For brevity, we denote the collection of all contexts {x 1,1 , x 1,2 , . . ., x T,K } by Reward.Upon selecting action a t , she receives a stochastic reward r t,at .In this work, we make the following assumption about reward generation: for any round t, where h is an unknown function satisfying 0 ≤ h(x) ≤ 1 for any x, and Goal.The learner wishes to maximize the following pseudo regret (or regret for short): where a * t = argmax a∈[K] E[r t,a ] is the optimal action at round t that maximizes the expected reward.
Reward Estimation.In order to learn the reward function h in Eq. (3.1), we propose to use a fully connected neural networks with depth L ≥ 2: where σ(x) = max{x, 0} is the rectified linear unit (ReLU) activation function, . Without loss of generality, we assume that the width of each hidden layer is the same (i.e., m) for convenience in the analysis.We denote the gradient of the neural network function by Batch Setting.In this work, we consider the batch bandits setting in which the entire horizon of T rounds is divided into B batches.Formally, we define a grid .The learner selects her policy at the beginning of each batch and executes it for the entire batch.She observes all collected rewards at the end of this batch and then updates her policy for the next batch.The batch model consists of two specific schemes.In the fixed batch size scheme, the points in the grid T are pre-fixed and cannot be altered during the execution of the algorithm.In the adaptive batch size scheme, however, the beginning and the end rounds of each batch are decided dynamically by the algorithm.

Algorithms
We propose our algorithm BatchNeuralUCB in this section.In essence, BatchNeuralUCB uses a neural network f (x; θ) to predict the reward of the context x and upper confidence bounds computed from the network to guide the exploration (Auer, 2002), which is similar to NeuralUCB (Zhou et al., 2019).The main difference is that BatchNeuralUCB does not update its parameter θ at each round.Instead, BatchNeuralUCB specifies either a fixed or adaptive batch grid T = {t 0 , t 1 , . . ., t B }.
At the beginning of the b-th batch, the algorithm updates the parameter θ of the neural network to θ b by optimizing a regularized square loss trained on all observed contexts and rewards using gradient descent.The training procedure is described in Algorithm 2. Meanwhile, within each batch, BatchNeuralUCB maintains the covariance matrix Z t b which is calculated over the gradients of the observed contexts, each taken with respect to the estimated parameter of the neural network at the beginning of that contexts' corresponding batch.Based on θ b and Z t b , BatchNeuralUCB calculates the UCB estimate of reward f b (•), as Line 7 in Algorithm 1 suggests.The function f b (•) is used to select actions during the b-th batch.In particular, at round t, BatchNeuralUCB receives contexts {x t,a } K a=1 and picks a t which maximizes the optimistic reward f b (x t,a ) (see Line 10).Once this batch finishes, the rewards r t,at collected during this batch are observed (Line 5), and the process continues.

Fixed Batch Size Scheme
For the fixed batch scheme, BatchNeuralUCB predefines the batch grid T = {t 0 , t 1 , . . .t B } as a deterministic set depending on the time horizon T and number of batches B.
In particular, BatchNeuralUCB selects the simple uniform batch grid, with t b = b • T /B + 1, as suggested in Eq. (4.1).It is easy to see that when B = T , BatchNeuralUCB updates the network parameters at each round, reducing to NeuralUCB.(Han et al., 2020) also studied the fixed batch size scheme, but for the linear reward.

Adaptive Batch Size Scheme
Unlike the fixed batch size scheme, in the adaptive batch size scheme, BatchNeuralUCB does not predefine the batch grid.Instead, it dynamically selects the batch grids based on the previous observations.Specifically, at any time t, the algorithm calculates the determinant of the covariance matrix and keeps track of its ratio to the determinant of the covariance matrix calculated at the end of the previous batch.If this ratio is larger than a hyperparameter q and the number of utilized batches is less than the budget B, then BatchNeuralUCB starts a new batch.This idea used in the adaptive batch size scheme is similar to the rarely switching updating rule introduced in Abbasi-Yadkori et al. ( 2011) for linear bandits.The difference is that while Abbasi-Yadkori et al. (2011) applies this idea directly to the contexts {x i } i , Algorithm 1 applies it to the gradient mapping of contexts.

Main Results
In this section, we propose our main theoretical results about Algorithm 1. First, we need the definition of the neural tangent kernel (NTK) matrix (Jacot et al., 2018).
Definition 5.1.Let {x i } T K i=1 be a set of contexts.Define Then, H = ( H (L) + Σ (L) )/2 is called the neural tangent kernel (NTK) matrix on the context set {x i } i .For simplicity, let h ∈ R T K denote the vector (h(x i )) T K i=1 .We need the following assumption over the NTK gram matrix H. Assumption 5.2.The NTK matrix satisfies H λ 0 I. Remark 5.3.Assumption 5.2 suggests that the NTK matrix H is non-singular.Such a requirement can be guaranteed as long as no two contexts in {x i } i are parallel (Du et al., 2018).
We also need the following assumption over the initialized parameter θ 0 and the contexts x i .
Assumption 5.4.For any 1 ≤ i ≤ T K, the context , where each entry of W is generated independently from N (0, 4/m); W L is set to (w , −w ), where each entry of w is generated independently from N (0, 2/m).
Remark 5.5.Assumption 5.4 suggests that the context x i and the initial parameter θ 0 should be 'symmetric' considering each coordinate.It can be verified that under such an assumption, for any i ∈ [T K] we have f (x i ; θ 0 ) = 0, which is crucial to our analysis.Meanwhile, for any context x that does not satisfy the assumption, we can always construct a satisfying new context x by setting We also need the following definition of the effective dimension.The following two theorems characterize the regret bounds of BatchNeuralUCB under two different update schemes.We first show the regret bound of BatchNeuralUCB under the fixed batch size update scheme.
Theorem 5.8.Suppose Assumptions 5.2 and 5.4 hold.Setting m = poly(T, L, K, λ −1 , λ −1 0 , S −1 , log(1/δ)) and λ ≥ S −2 , where S is a parameter satisfying S ≥ √ 2h H −1 h.There exist positive constants C 1 , C 2 , C 3 such that, if , then with probability at least 1 − δ, the regret of Algorithm 1 with fixed batch size scheme is bounded as follows: Remark 5.9.Suppose h belongs to the RKHS space of NTK kernel H with a finite RKHS norm h H , then (5.1) Eq. ( 5.1) shows that to obtain an O( √ T )-regret, at least Ω( √ T ) number of batches are needed, which implies that our choice of B as √ T is tight.
We have the following theorem for Algorithm 1 under the adaptive batch size scheme.
Theorem 5.11.Suppose Assumptions 5.2 and 5.4 hold.Let S, λ, J, η, {β t } be selected as in Theorem 5.8.Then with probability at least 1 − δ, the regret of Algorithm 1 with the adaptive batch size scheme can be bounded by Remark 5.12.By treating ν as a constant and assuming that h belongs to the RKHS space of NTK kernel H with a finite RKHS norm h H , and by setting S and λ as Remark 5.9 suggests, the regret is bounded by O( max{q, (1 Remark 5.13.To achieve an O( d √ T ) regret, here B needs to be chosen as Ω( d log(1 + T K/λ)) and q = Θ((1 + T K/λ) d/B ).As a comparison, for the linear bandits case, Ruan et al. (2020)

Numerical Experiments
In this section, we run numerical experiments to validate our theoretical findings.In what follows we consider both real and synthetically generated data.Due to space limitations, we defer the discussion of hyperparameter tuning of algorithms and also additional simulations to Appendix A. LinUCB NeuralUCB, Fully Sequential BNUCB Fixed BNUCB Adaptive: log(q)=40 BNUCB Adaptive: log(q)=50 BNUCB Adaptive: log(q)=60 B=100 LinUCB NeuralUCB, Fully Sequential BNUCB Fixed BNUCB Adaptive: log(q)=20 BNUCB Adaptive: log(q)=25 BNUCB Adaptive: log(q)=30 B=200 LinUCB NeuralUCB, Fully Sequential BNUCB Fixed BNUCB Adaptive: log(q)=10 BNUCB Adaptive: log(q)=15 BNUCB Adaptive: log(q)=20 Figure 1: Distribution of per-instance regret on Synthetic data.The solid and dashed lines indicate the median and the mean respectively.Note that BNUCB stands for BatchNeuralUCB and also that the regrets are plotted on the log scale.

Synthetic Data
We compare the performance of our proposed BatchNeuralUCB (BNUCB) algorithm with fixed and adaptive size batches for several values of B and q with two fully sequential benchmarks on synthetic data generated as follows.Let T = 2000, d = 10, K = 4, and consider the cosine reward function given by r t,a = cos(3x t,a θ * ) + ξ t where {x t,1 , x t,2 , • • • , x t,K } are contexts generated at time t according to U [0, 1] d independent of each other.The parameter θ * is the unknown parameter of model that is generated according to U [0, 1] d , normalized to satisfy θ * 2 = 1.The noise ξ t is generated according to N (0, 0.25) independent of all other variables.
The fully sequential benchmarks considered are: (1) NeuralUCB algorithm (Zhou et al., 2019) and (2) LinUCB algorithm (Li et al., 2010).For BatchNeuralUCB and NeuralUCB algorithms, we consider two-layer neural networks with m = 200 hidden layers.We report this process for 10 times and generate the following plots: • Box plot of the total regret of algorithms together with its standard deviation.
• Scatter plot of the total regret vs execution time for 5 simulations randomly selected out of all 10 simulations.
Results.The results are depicted in Figures 1 and 3(a).We can make the following observations.First, the regret of LinUCB is almost 10 times worse than the fully sequential NeuralUCB, which is potentially due to the model misspecification.Our proposed BNUCB works pretty well in both fixed and adaptive schemes, while keeping the total number of policy updates and also the running time small.In fact, for all models, except the fixed batched setting with B = 40, the regret of BNUCB is within a factor of two of its fully sequential counterpart.At the same time, the number of policy updates and execution times of all configurations of BNUCB for all pairs of (B, q) are almost ten times smaller than the fully sequential version.Second, for a given batch size B, the adaptive batch scheme configurations have better performance compared to the fixed ones.

Real Data
We repeat the above simulation this time using the Mushroom dataset from the UCI repository. 1 The dataset is originally designed for classifying edible vs poisonous mushrooms.It contains n = 8124 samples each with d = 22 features, each belonging to one of K = 2 classes.For each sample s t = (c t , l t ) with context c t ∈ R d and label l t ∈ [K], we consider the zero-one reward defined as r t,at = 1{a t = l t } and generate our context vectors as {x t,a ∈ R Kd : We compare the performance of algorithms on similar metrics as those described in Section 6.2 over 10 random Monte Carlo simulations and report the results.For each instance, we select T = 2000 random samples without replacement from the dataset and run all algorithms on that instance.Note that in this simulation, both NeuralUCB and BatchNeuralUCB use two-layer neural networks with m = 100 hidden layers.BNUCB Adaptive: log(q)=25 BNUCB Adaptive: log(q)=30 BNUCB Adaptive: log(q)=40 B=100 LinUCB NeuralUCB, Fully Sequential BNUCB Fixed: BNUCB Adaptive: log(q)=20 BNUCB Adaptive: log(q)=25 BNUCB Adaptive: log(q)=30 B=250 LinUCB NeuralUCB, Fully Sequential BNUCB Fixed: BNUCB Adaptive: log(q)=10 BNUCB Adaptive: log(q)=15 BNUCB Adaptive: log(q)=20 Results.The results are depicted in Figures 2 and 3(b).We can make the following observations.First, among fully sequential models, NeuralUCB outperforms LinUCB although it is 1000 times slower.As the number of batches used in BNUCB increases the regret decreases and it gets closer to that of fully sequential NeuralUCB.For instance, in all adaptive batch scheme models with B = 250, the average regret is worse than that of the fully sequential NeuralUCB by only twenty percent, outperforming LinUCB.Furthermore, they keep the total number of policy changes limited and improve the running time of the fully sequential NeuralUCB by a factor of eight.Second, across these configurations, every adaptive batch model outperforms all fixed batch models.For example, the adaptive BNUCB with B = 40 and log(q) = 30, outperforms BNUCB with B = 250 fixed batches.This validates our theory that the minimum number of batches required for getting the optimal O( √ T ) regret is much smaller in the adaptive batch setting compared to the fixed batch setting (order of log T vs √ T ).Regret LinUCB NeuralUCB, Fully Sequential BNUCB Fixed: B=40 BNUCB Adaptive: B=40, log(q)=40 BNUCB Adaptive: B=40, log(q)=50 BNUCB Adaptive: B=40, log(q)=60 BNUCB Fixed: B=100 BNUCB Adaptive: B=100, log(q)=20 BNUCB Adaptive: B=100, log(q)=25 BNUCB Adaptive: B=100, log(q)=30 BNUCB Fixed: B=200 BNUCB Adaptive: B=200, log(q)=10 BNUCB Adaptive: B=200, log(q)=15 BNUCB Adaptive: B=200, log(q)=20 (a) Synthetic dataset Regret LinUCB NeuralUCB, Fully Sequential BNUCB Fixed: B=50 BNUCB Adaptive: B=50, log(q)=25 BNUCB Adaptive: B=50, log(q)=30 BNUCB Adaptive: B=50, log(q)=40 BNUCB Fixed: B=100 BNUCB Adaptive: B=100, log(q)=20 BNUCB Adaptive: B=100, log(q)=25 BNUCB Adaptive: B=100, log(q)=30 BNUCB Fixed: B=250 BNUCB Adaptive: B=250, log(q)=10 BNUCB Adaptive: B=250, log(q)=15 BNUCB Adaptive: B=250, log(q)=20 (b) Mushroom dataset

Proof of the Main Results
In this section, we prove Theorem 5.8 and Theorem 5.11.

Proof of Theorem 5.8
To prove Theorem 5.8, we need the following lemmas.The first lemma from Zhou et al. (2019) suggests that at each time t within the b-th batch, the difference between the reward of the optimal action x t,a * t and the selected action x t,at can be upper bounded by a bonus term defined based on the confidence radius β t b , the gradients g(x t,at ; θ b ) and the covariance matrix Z t b .
Next lemma bounds the log-determinant of covariance matrix Z T +1 by the effective dimension d.
where (a) holds since h x t,a * t −h x t,at ≤ 1 and Lemma 7.1 and (b) holds since b / ∈ C and Lemma 7.4.Hence, where (a) holds due to Cauchy-Schwarz inequality and (b) holds due to Lemma 7.3.We can bound |C| as follows: where (a) holds since Z t b+1 Z t b and (b) holds due to the definition of C. Eq. ( 7.1) suggests that |C| ≤ log(det(Z T +1 )/ det(λI)).Therefore, Finally, with a large enough m, by the selection of J and Lemma 7.2, we have log(det We also have m −1/6 √ log mξ(T )T ≤ 1. Substituting these terms into Eq.( 7.2), we complete the proof.

Proof of Theorem 5.11
Let B be the value of b when Algorithm 1 stops.It is easy to see that B ≤ B, therefore there are at most B batches.We can first decompose the regret as follows, using Lemma 7.1: To bound Eq. ( 7.3), we have the following two separate cases.First, if B < B, then for all 0 ≤ b ≤ B and t b ≤ t < t b+1 , we have det(Z t ) ≤ q det(Z t b ).Therefore, we have Therefore, by Eq. ( 7.3) the regret can be bounded as Thus, substituting Eq. (7.7) into Eq.(7.6), we have
Results.The results are depicted in Figures 6 and 7.As can be observed, both fully sequential algorithms, i.e.LinUCB and NeuralUCB perform relatively well with NeuralUCB outperforming LinUCB by a slight margin.Across batch models, adaptive ones perform much better than fixed ones.For instance, BatchNeuralUCB with B = 40 adaptively chosen using log(q) = 50, 60, or 70 outperforms BatchNeuralUCB with B = 200 batches of fixed size.The average regret of both fixed and adaptive versions reduces as the number of batches increases.In particular, the regret of adaptive models with B = 100 and B = 200 are very close and indistinguishable from the fully sequential NeuralUCB algorithm, while taking almost 8 times less time for execution.
1 are the start and end rounds of the batches.Here, the interval [t b−1 , t b ) indicates the rounds belonging to batch b ∈ [B]

Definition 5. 6 .
The effective dimension d of the neural tangent kernel matrix on contexts {x i } T K i=1 is defined asd = log det(I + H/λ) log(1 + T K/λ) .Remark 5.7.The notion of effective dimension d is similar to the information gain introduced in Srinivas et al. (2009) and effective dimension introduced in Valko et al. (2013).Intuitively, d measures how quickly the eigenvalues of H diminish, and it will be upper bounded by the dimension of the RKHS space spanned by H (Zhou et al., 2019).
has shown that an O(d log d log T ) number of batches is necessary to achieve a O( √ T ) regret.Therefore, our choice of B as d log(T ) is tight up to a log d factor.

Figure 2 :
Figure 2: Distribution of per-instance regret on Mushroom dataset.The solid and dashed lines indicate the median and the mean respectively.Note that BNUCB stands for BatchNeuralUCB.

Figure 3 :
Figure 3: Scatter plot of regret vs execution time for 5 instances selected at random on synthetic data (left) and Mushroom data (right).
.4) where (a) holds due to Lemma 7.4, (b) holds due to Cauchy-Schwarz inequality and (c) holds due to Lemma 7.3.Second, if B = B, then for all 0 ≤ ) where (a) holds due to Lemma 7.4 and the following two facts: det(Z t ) ≤ q det(Z t b ) for all 0 ≤ b ≤ B − 1, t b ≤ t < t b+1 ; Eq. (7.5) for b = B, (b) holds due to Cauchy-Schwarz inequality and (c) holds due to Lemma 7.3.Combining Eqs.(7.4) and (7.6), we have under both B < B and B = B cases, Eq. (7.6) holds.Finally, by Lemma 7.2 and the selection of J and m, we have det Z T log λI = O((1

Figure 4 :
Figure 4: Distribution of per-instance regret on Synthetic data with quadratic reward.The solid and dashed lines indicate the median and the mean respectively.Note that BNUCB stands for BatchNeuralUCB.

Figure 5 :
Figure 5: Scatter plot of regret vs execution time for 5 instances selected at random on synthetic data with quadratic reward.

Figure 7 :
Figure 7: Scatter plot of regret vs execution time for 5 instances selected at random on Magic Telescope dataset.
Zhou et al. (2019)d(B.19)inZhouetal.(2019),wecan obtain our statement when m is large enough.Lemma 7.3 (Lemma 11, Abbasi-Yadkori et al. 2011).For any {x t } T t=1 ⊂ R d that satisfies x t 2 ≤ L, let A 0 = λI and A t = A 0 + t−1 i=1 x i x i ,then we have Lemma 7.4 (Lemma 12, Abbasi-Yadkori et al. 2011).Suppose A, B ∈ R d×d are two positive definite matrices satisfying A B, then for any x ∈ R d , x A ≤ x B • det(A)/ det(B).Proof of Theorem 5.8.Define the set C as follows: