Meta Clustering of Neural Bandits

The contextual bandit has been identified as a powerful framework to formulate the recommendation process as a sequential decision-making process, where each item is regarded as an arm and the objective is to minimize the regret of $T$ rounds. In this paper, we study a new problem, Clustering of Neural Bandits, by extending previous work to the arbitrary reward function, to strike a balance between user heterogeneity and user correlations in the recommender system. To solve this problem, we propose a novel algorithm called M-CNB, which utilizes a meta-learner to represent and rapidly adapt to dynamic clusters, along with an informative Upper Confidence Bound (UCB)-based exploration strategy. We provide an instance-dependent performance guarantee for the proposed algorithm that withstands the adversarial context, and we further prove the guarantee is at least as good as state-of-the-art (SOTA) approaches under the same assumptions. In extensive experiments conducted in both recommendation and online classification scenarios, M-CNB outperforms SOTA baselines. This shows the effectiveness of the proposed approach in improving online recommendation and online classification performance.


Introduction
Recommender systems play an integral role in various online businesses, including e-commerce platforms and online streaming services.They leverage user correlations to assist the perception of user preferences, a field of study spanning several decades.In the past, considerable effort has been directed toward supervised-learningbased collaborative filtering methods within relatively static environments [26,53].However, the ideal recommender systems should adapt over time to consistently meet user interests.Consequently, it is natural to formulate the recommendation process as a sequential decision-making process.In this paradigm, the recommender engages with users, observes their online feedback (i.e., rewards), and optimizes the user experience for long-term benefits, rather than fitting a model on the collected static data based on supervised learning [13,22,62].Based on this idea, this paper focuses on the formulation of contextual bandits, where each item is treated as an arm (context) in a recommendation round, and the primary objective is to minimize the cumulative regret over  rounds and tackle the dilemma of exploitation and exploration in the sequential decision-making process [1,3,4,8,23,24,39,39,40,44,47,48].
Linear contextual bandits model a user's preference through a linear reward function based on arm contexts [1,16,38].However, given the substantial growth of users in recommender systems, it can be overly ambitious to represent all user preferences with a single reward function, and it may overlook the user correlations if each user is modeled as a single bandit.To address this challenge, a series of methods known as clustering of linear bandits [4,23,24,39,40] have emerged, which represent each cluster of users as a reward function, achieving a balance between user heterogeneity and user correlations.Note that the cluster information is unknown in this problem setting.In essence, with each user being treated as a linear contextual bandit, these methods adopt graph-based techniques to dynamically cluster users, and leverage user correlations for making arm recommendations.However, it is crucial to acknowledge the limitations of this line of works: they all rely on linear reward functions, and user clusters are represented as linear combinations of individual bandit parameters.The assumptions of linearity in reward functions and the linear representation of clusters may not hold up well in real-world applications [54,67].
In relaxation of the assumption on reward mapping functions, inspired by recent advances in the single neural bandit [66,67] where a neural network is assigned to learn an unknown reward function, we study the new problem of Clustering of Neural Bandits (CNB) in this paper.Different from the single neural bandit [66,67] and clustering of linear bandits [4,23,24,39,40], CNB introduces the bandit clusters built upon the arbitrary reward functions, which can be either linear or non-linear.Meanwhile, we note that the underlying clusters are usually not static over specific arm contexts [39].For example, in the personalized recommendation task, two users (bandits) may both like "country music", but can have different opinions on "rock music".Therefore, adapting to arm-specific "relative clusters" in a dynamic environment is one of the main challenges in this problem.We propose a novel algorithm, Meta Clustering of Neural Bandits (M-CNB), to solve the CNB problem.Next, we will summarize our key ideas and contributions.
Methodology.To address the CNB problem, we must confront three key challenges: (1) Efficiently determining a user's relative group: Our approach involves employing a neural network, named the "user learner," to estimate each user's preferences.By grouping users with similar preferences, we efficiently create clusters with a process taking O () time, where  is the number of bandits (users).( 2) Effective parametric representation of dynamic clusters: Inspired by advancements in meta-learning [21,64], we introduce a meta-learner capable of representing and swiftly adapting to evolving clusters.In each round , the meta-learner leverages its perceived knowledge from prior rounds {1, . . .,  − 1} to rapidly adapt to new clusters via a few samples.This enables the rapid acquisition of nonlinear cluster representations, marking our first main contribution.(3) Balancing exploitation and exploration with relative bandit clusters: Our second main contribution is proposing an informative UCB-type exploration strategy, which takes into account both user-side and meta-side information for balancing the exploration and exploitation.By addressing these three main challenges, our approach manages to solve the CNB problem effectively and efficiently.
Theoretical analysis.To obtain a regret upper bound for the proposed algorithm, we need to tackle the following three challenges: (1) Analyzing neural meta-learner in bandit framework: To finish the analysis, we must build a confidence ellipsoid for the meta-learner approximation, which is one of the main research gaps.To deal with this gap, we bridge the meta-learner and user-learner via the Neural Tangent Kernel (NTK) regression and build the confidence ellipsoid upon the user-learner, which allows us to achieve a more comprehensive understanding of the meta-learner's behavior.( 2 , where  is the expected number of clusters.This also indicates the proposed algorithm can leverage the collaborative effects among users.(3) Adversarial attack on contexts: In most neural bandit works, a common assumption is that the NTK matrix is non-singular, requiring that no two observed contexts (items) are identical or parallel [66,67].This vulnerability makes their regret analysis susceptible to adversarial attacks and less practical in real-world scenarios.In face of this challenge, we provide an instance-dependent regret analysis that withstands the context attack, and allows the contexts to be repeatedly observed.Furthermore, under the same assumptions as in existing works, we demonstrate that our regret upper bound is at least as good as SOTA approaches.The above efforts to address the challenges in the theoretical analysis is our third main contribution.
Evaluations.We evaluate the proposed algorithm in two scenarios: Online recommendation and Online classification with bandit feedback.For the first scenario, which naturally lends itself to CNB, we assess the algorithm's performance on four recommendation datasets.Since online classification has been widely used to evaluate neural bandits [6,66,67], we evaluate the algorithms on eight classification datasets where each class can be considered as a bandit (user), and correlations among classes are expected to be exploited.We compare the proposed algorithm with 8 strong baselines and show the superior performance of the proposed algorithm.Additionally, we offer the empirical analysis of the algorithm's time complexity, and conduct extensive sensitivity studies to investigate the impact of critical hyperparameters.The above empirical evaluation is our fourth main contribution.
Next, detailed discussion regarding related works is placed in Section 2. After introducing the problem definition in Section 3, we present the proposed algorithm, M-CNB, in Section 4 together with theoretical analysis in Section 5.Then, we provide the experimental results in Section 6 and conclude the paper in Section 7.

Related Work
In this section, we briefly review the related works, including clustering of bandits and neural bandits.
Clustering of bandits.CLUB [23] first studies collaborative effects among users in contextual bandits where each user hosts an unknown vector to represent the behavior based on the linear reward function.CLUB formulates user similarity on an evolving graph and selects an arm leveraging the clustered groups.Then, Gentile et al. [24], Li et al. [39] propose to cluster users based on specific contents and select arms leveraging the aggregated information of conditioned groups.Li et al. [40] improves the clustering procedure by allowing groups to split and merge.Ban and He [4] uses seed-based local clustering to find overlapping groups, different from global clustering on graphs.Korda et al. [33], Liu et al. [42], Wang et al. [58], Wu et al. [59], Yang et al. [63] also study clustering of bandits with various settings in recommender systems.However, all these works are based on the linear reward assumption, which may fail in many real-world applications.
Neural bandits.Lipton et al. [41], Riquelme et al. [50] adapt the Thompson Sampling (TS) to the last layer of deep neural networks to select an action.However, these approaches do not provide regret analysis.Zhou et al. [67] and Zhang et al. [66] first provide the regret analysis of UCB-based and TS-based neural bandits, where they apply ridge regression on the space of gradients.Ban et al. [5] studies a multi-facet bandit problem with a UCB-based exploration.Jia et al. [30] perturbs the training samples for incorporating both exploitation and exploration.EE-Net [6,9] proposes to use another neural network for exploration with applications on active learning [7,10] and meta-learning [49].[61] combines the lastlayer neural network embedding with linear UCB to improve the computation efficiency.Dutta et al. [20] uses an off-the-shelf metalearning approach to solve the contextual bandit problem in which the expected reward is formulated as Q-function.Santana et al. [51] proposes a Hierarchical Reinforcement Learning framework for recommendation in the dynamic experiments, where a meta-bandit is used for the selected independent recommender system.Kassraie and Krause [31] revisit Neural-UCB type algorithms and shows the O ( √  ) regret bound without the restrictive assumptions on the context.Hong et al. [27], Maillard and Mannor [43] study the latent bandit problem where the reward distribution of arms are conditioned on some unknown discrete latent state and prove the O ( √  ) regret bound for their algorithm as well.Federated bandits [15] consider dealing with multiple bandits (agents) while preserving the privacy of each bandit.Deb et al. [17] reduce the contextual bandits to neural online regression for tighter regret upper bound.Qi et al. [48] propose to use graph to formulate user correlations with the adoption of graph neural networks.However, the above works either focus on the different problem settings or overlook the clustering of bandits.
Other related works.[35,52] study meta-learning in Thompson sampling and Hong et al. [28], Wan et al. [55] aims to exploit the hierarchical knowledge among hierarchical Bayesian bandits.However, they focus on the Bayesian or non-contextual bandits.

Problem: Clustering of Neural Bandits
In this section, we introduce the CNB problem, motivated by learning correlations among bandits with arbitrary reward functions.Next, we will use the scenarios of personalized recommendation to state the problem setting.
Suppose there are  users (bandits),  = {1, . . ., }, to serve on a platform.In the  th round, the platform receives a user   ∈  (unique ID for this user) and prepares the corresponding  candidate arms X  = {x ,1 , x ,2 , . . ., x , }.Each arm is represented by its -dimensional feature vector x , ∈ R  ,  ∈ [] = {1, . . .,  }, which will encode the information from both the user side and the arm side [38].Then, the learner is expected to select an arm x  ∈ X  and recommend it to   , where   refers to the target or served user.In response to this action,   will provide the platform with a corresponding reward (feedback)   .Here, since different users may generate different rewards towards the same arm, we use  , |  to represent the reward produced by   given x , .The formal definition of arm reward is below.
Given   ∈  , the reward  , for each candidate arm x , ∈ X  is assumed to be governed by an unknown function by where ℎ   is an unknown reward function associated with   , and it can be either linear or non-linear. , is a noise term with zero expectation E[ , ] = 0. We also assume the reward  , ∈ [0, 1] is bounded, as in many existing works [4,23,24].Note that previous works on clustering of linear bandits all assume ℎ   is a linear function with respect to arm x , [4,23,24,39,40].Meanwhile, users may exhibit clustering behavior.Inspired by [24,39], we consider the cluster behavior to be item-varying, i.e., the users who have the same preference on a certain item may have different opinions on another item.Therefore, we formulate a set of users with the same opinions on a certain item as a relative cluster, with the following definition.Definition 3.1 (Relative Cluster).In round , given an arm x , ∈ X  , a relative cluster N (x , ) ⊆  with respect to x , satisfies (2) N ′ ⊆  , s.t.N ′ satisfies (1) and N (x , ) ⊂ N ′ .
The condition (2) is to guarantee that no other clusters contains N (x , ).This cluster definition allows users to agree on certain items while disagree on others, which is consistent with the realworld scenario.Since the users from different clusters are expected to have distinct behavior with respect to x , , we provide the following constraint among relative clusters.Definition 3.2 (-gap).Given two different cluster N (x , ), N ′ (x , ), there exists a constant  > 0, such that For any two clusters in  , we assume that they satisfy the gap constraint.Note that such an assumption is standard in the literature of online clustering of bandit to differentiate clusters [4,23,24,39,40].As a result, given an arm x , , the bandit pool  can be divided into  , non-overlapping clusters: N 1 (x , ), N 2 (x , ), . . ., N  , (x , ), where  , ≪ .Note that the cluster information is unknown in the platform.
For the CNB problem, the goal of the learner is to minimize the pseudo regret of  rounds: where   is the reward received in round , and E[ *  |  , X  ] = max x , ∈X  ℎ   (x , ).
Notations.Let x  be the arm selected in round , and   be the corresponding reward received in round .We use ∥x  ∥ 2 to represent the Euclidean norm.For each user  ∈  , let    be the number of rounds that user ' learner has been served up to round , and T   be all of 's historical data up to round . is the width of neural network and  is depth of neural network in the proposed approach.Given a group N , all its data up to round  can be denoted by {T   }  ∈ N = {T   | ∈ N }.We use standard O and Ω notation to hide constants.

Proposed Algorithm
In this section, we present our proposed algorithm, denoted as M-CNB, to address the formulated CNB problem.M-CNB leverages the potential correlations among bandits, and aims to rapidly acquire a representation for dynamic relative clusters.
For M-CNB, we utilize a meta-learner, denoted as Θ, to rapidly adapt to clusters, as well as represent the behavior of a cluster.Additionally, there are  user-learners, denoted by {  }  ∈ , responsible for learning the preference ℎ  (•) for each user  ∈  .In terms of the workflow, the primary role of the meta-learner is to determine recommended arms, while the user-learners are primarily utilized for clustering purposes.The meta-learner and user-learners share the same neural network structure, denoted as  .And the workflow of M-CNB is divided into three main components: User clustering, Meta adaptation, and UCB-based selection.Then, we proceed to elaborate their details.
User clustering.Recall that in Section 3, each user  ∈  is governed by an unknown function ℎ  .In this case, we use a neural network  (•;   ), to estimate ℎ  .In round  ∈ [ ], let   be the user to serve.Given   's past data up to round  − 1, i.e., T    −1 , we can train parameters    by minimizing the following loss: −1 in round  − 1 by stochastic gradient descent (SGD).Therefore, for each  ∈  , we can obtain the trained parameters    −1 .Then, given   and an arm x , , we return   's estimated cluster with respect to arm x , by (3) where  ∈ (0, 1) represents the assumed -gap and  > 1 is a tuning parameter to for the exploration of cluster members.
Meta adaptation.We employ one meta-learner Θ to represent and adapt to the behavior of dynamic clusters.In meta-learning, the meta-learner is trained based on a number of different tasks and can quickly adapt to new tasks with a small amount of new data [21].Here, we consider a cluster N   (x , ) as a task and its collected data as the task distribution.As a result, M-CNB has two adaptation phases: meta adaptation, and user adaptation.
Meta adaptation.In the  th round, given a cluster N   (x , ), we have the available "task distributions" {T   −1 }  ∈ N   (x , ) .The goal of the meta-learner is to quickly adapt to the bandit cluster.Thus, we randomly draw a few samples from {T   −1 }  ∈ N   (x , ) and update Θ in round  using SGD, denoted by Θ , , based on Θ  −1 that is continuously trained on the collected interactions to incorporate the knowledge of past  − 1 rounds.The workflow is described in Figure 1 and Algorithm 2.
User adaptation.In the  th round, given   , after receiving the reward   , we have available data (x  ,   ).Then, the user leaner    is updated in round  to have a refined clustering capability, denoted by     .As the users in a cluster share the same or similar preferences on a certain item, we update all the user learners in this cluster, described in Algorithm 1 Lines 14-18.
Note that for the clustering of linear bandits works [4,23,24,39,40], they represent the cluster behavior Θ by the linear combination Determine 8: end for 11: Play x ,  and observe reward  ,  13: for  ∈ N   (x  ) do 15: end for 19: for  ∉ N   (x  ) do 20: .This can lead to limited representation power of the cluster learner, and their linear reward assumptions may not necessarily hold for real world settings [67].Instead, we use the meta adaptation to update the meta-learner Θ  −1 according to N   (x , ), which can represent non-linear combinations of user-learners [21,56].
UCB-based Exploration.To balance the trade-off between the exploitation of the currently available information and the exploration of new matches, we introduce the following UCB-based selection criterion.Based on Lemma A.14, the cumulative error induced by meta-learner is controlled by , where ∇ Θ  (x  ; Θ  ) incorporates the discriminative information of meta-learner acquired from the correlations within the relative cluster shows the shrinking confidence interval of user-learner to a specific user .Then, we select an arm according to: x  = arg x , ∈X  max U , ( where U , is calculated in Line 9).In summary, Algorithm 1 depicts the workflow of M-CNB.In each round , given a target user and a pool of candidate arms, we compute the meta-learner and its bound for each relative cluster (Line 6-10).Then, we choose the arm according to the UCB-type strategy (Line 11).After receiving the reward, we update the userlearners.Note that the meta-learner has been updated in Line 8.
Then, we discuss the time complexity of Algorithm 1.Here, with  being the number of users, M-CNB will take O () to find the cluster for the served user.Given the detected cluster N , it takes O (| N |) to update the meta-learner by SGD.Suppose E[| N |] = / q and / q ≪ .Therefore, the overall test time complexity of Algorithm 1 is O ( ( + / q)).To scale M-CNB for deployment in large recommender systems, we can rely on the assistance of pre-processing tools: Pre-clustering of users and Pre-selection of items.On the one hand, we can perform pre-clustering of users based on the user features or other information.Then, let a precluster (instead of a single user) hold a neural network, which will significantly reduce .On the other hand, we can conduct the preselection of items based on item and user features, to reduce  substantially.For instance, we only consider the restaurants that are near the serving user for the restaurant recommendation task.Furthermore, we can also control the magnitude of / q by tuning the hyperparameter  based on the actual application scenario.Consequently, M-CNB can effectively serve as a core component of large-scale recommender systems.

Regret Analysis
In this section, we provide the performance guarantee of M-CNB, which is built in the over-parameterized neural networks regime.
As the standard setting in contextual bandits, all arms are normalized to the unit length.Given an arm x , ∈ R  with ∥x , ∥ 2 = 1,  ∈ [ ],  ∈ [], without loss of generality, we define  as a fullyconnected network with depth  ≥ 2 and width : where  is the ReLU activation function, Note that our analysis results can also be readily generalized to other neural architectures such as CNNs and ResNet [2,18].Then, we employ the following initialization [12] for  and Θ : For  ∈ [ − 1], each entry of W  is drawn from the normal distribution N (0, 2/); Each entry of W  is drawn from the normal distribution N (0, 1/).
Here, given  > 0, we define the following function class: The term ( 0 , ) defines a function class ball centered at the random initialization point  0 and with a radius of .This definition was originally introduced in the context of analyzing overparameterized neural networks, and it can be found in the works of [12] and [2].Recall that  , represents the number of clusters given x , .For the simplicity of analysis, we assume represent all the data in  rounds and define the squared loss L  ( ) = ( (x  ;  ) −   ) 2 /2.Then, we provide the instance-dependent regret upper bound for M-CNB with the following theorem.
Theorem 5.1.Given the number of rounds  and , for any  ∈ (0, 1),  > 0, suppose  ≥ Ω(poly( , , ) , and Then, with probability at least 1 −  over the initialization, Algorithm 1 achieves the following regret upper bound: where Theorem 5.1 provides a regret bound for M-CNB, which consists of two main terms.The first term is instance-dependent and relates to the squared error achieved by the function class ( 0 , ) on the data.The second term is a standard large-deviation error term.
There are some noteworthy properties regarding Theorem 5.1.One important aspect is that it depends on the parameter , which represents the expected number of clusters, rather than the number of users .Specifically, O ( √  ) corresponds to the regret effort for learning a single bandit, and thus O ( √  ) is an estimate of the regret effort for learning  bandits.However, Theorem 5.1 refines this naive bound to O ( √︁  ), linking the regret effort to the actual underlying clusters among users.
Another advantage of Theorem 5.1 is that it makes no assumptions about the contexts {x  }    =1 used in the problem.This makes Theorem 5.1 robust against adversarial attacks on the contexts and allows the observed contexts to contain repeated items.In contrast, existing neural bandit algorithms like [31,66,67] rely on Assumption 5.1 for the contexts, and their regret upper bounds can be disrupted by straightforward adversarial attacks, e.g., creating two identical contexts with different rewards.
The term  *   reflects the "regression difficulty" of fitting all the data using a given function class, while the radius  controls the richness or complexity of that function class.It's important to note that the choice of  is flexible, although it's not without constraints: specifically, the value of  must be larger than a polynomial of .When  is set to a larger value, it expands the function class ( 0 , ), which means it can potentially fit a wider range of data.Consequently, this tends to make  *   smaller.Recent advances in the convergence of neural networks, as demonstrated by [2] and [18], have shown that there is an optimal region around the initialization point in over-parameterized neural networks.This suggests that, with the proper choice of , term  *   can be constrained to a small constant value.
Next, we show the common assumption made on existing neural bandits, and prove that Theorem 5.1 is no worse than their regret bounds under the same assumption.The analysis is associated with the Neural Tangent Kernel (NTK) matrix as follows: Definition 5.2 (NTK [29,57]).Let N denote the normal distribution.Given the data instances {x  }   =1 , for all ,  ∈ [ ], define Then, the NTK matrix is defined as Assumption 5.1.There exists  0 > 0, such that H ⪰  0 I The assumption 5.1 is generally held in the literature of neural bandits [5,6,15,30,61,66,67] to ensure the existence of a solution for NTK regression.This assumption holds true when any two contexts in {x  }    =1 are not linearly dependent or parallel.Then, the SOTA regret upper bound for a single neural bandit ( = 1) [5,15,66,67] is as follows: There are two complexity terms in the regret bounds [6,67].The first complexity term is  = The purpose of the term  is to provide an upper bound on the optimal parameters in the context of NTK regression.However, it's important to note that the value of  becomes unbounded (i.e., ∞) when the matrix H becomes singular.This singularity can be induced by an adversary who creates two identical or parallel contexts, causing problems in their analysis.
The second complexity term is the effective dimension d, defined as  = log det(I+H) log(1+  ) , which describes the actual underlying dimension in the RKHS space spanned by NTK.The following lemma is to show an upper bound of  *   under the same assumption.Lemma 5.3.Suppose Assumption 5.1 and conditions in Theorem 5.1 holds where  ≥ Ω(poly( , ) •  −1 0 log(1/)).With probability at least 1 −  over the initialization, there exists  ′ ∈ ( 0 , Ω( 3/2 )), such that

Experiments
In this section, we evaluate M-CNB's empirical performance on both online recommendation and classification scenarios.Our source code are anonymously available at https:// anonymous.4open.science/r/ Mn-C35C/ .
Recommendation datasets.We use four public datasets, Amazon [46], Facebook [37], Movielens [25], and Yelp 1 , to evaluate M-CNB's ability in discovering and exploiting user clusters to improve the recommendation performance.Amazon is an E-commerce recommendation dataset consisting of 883636 review ratings.Facebook is a social recommendation dataset with 88234 links.Movie-Lens is a movie recommendation dataset consisting of 25 million reviews between 1.6 × 10 5 users and 6 × 10 4 movies.Yelp is a shop recommendation dataset released in the Yelp dataset challenge, composed of 4.7 million review entries made by 1.18 million users towards 1.57 × 10 5 merchants.For these four datasets, we extract ratings in the reviews and build the rating matrix by selecting the top 10000 users and top 10000 items (friends, movies, shops) with the most rating records.Then, we use the singular-value decomposition (SVD) to extract a normalized 10-dimensional feature vector for each user and item.The goal of this problem is to select the item with good ratings.Given an item and a specific user, we generate the reward by using the user's rating stars for this item.If the user's rating is more than 4 stars (5 stars total), its reward is 1; Otherwise, its reward is 0. Here, we use pre-clustering (K-means) to form the user pool with 50 users (pre-clusters).Then, in each round, a user   is randomly drawn from the user pool.For the arm pool, we randomly choose one restaurant (movie) rated by   with reward 1 and randomly pick the other 9 restaurants (movies) rated by   with 0 reward.With each restaurant or movie corresponding to an arm, the goal for the learner is to pick the arm with the highest reward.
Classification datasets.In our online classification with bandit feedback experiments, we utilized a range of well-known classification datasets, including Mnist [36], Notmnist [11], Cifar10 [34], Emnist (Letter) [14], Fashion [60], as well as the Shuttle, Mushroom, and MagicTelescope (MT) datasets [19].Here, we provide some preliminaries for this setup.In the round  ∈ [ ], given an instance x  ∈ R  drawn from some distribution, we aim to classify x  among  classes.x  is first transformed into  long vectors: x ,1 = (x ⊤ , 0, . . ., 0) ⊤ , x ,2 = (0, x ⊤ , . . ., 0) ⊤ , . . ., x , = (0, 0, . . ., x ⊤ ) ⊤ ∈ R  , matching  classes respectively.The index of the arm that the learner selects is the class predicted by the learner.Then, the reward is defined as 1 if x  belongs to this class; otherwise, the reward is 0. In other words, each arm represents a specific class.For example, x ,1 is only presented to Class 1; x ,2 is only presented to Class 2. This problem has been studied in almost all the neural bandit works [6,31,66,67].Compared to these works, we aim to learn the correlations among classes to improve performance.Thus, we formulate one class as a user (bandit) (i.e., a user in the recommendation scenario) and all the samples belonging to this class are deemed as the data of this user.This set of experiments aims to evaluate M-CNB's ability to learn various non-linear reward functions, as well as the ability of discovering and exploiting the correlations among classes.Additionally, we extended the evaluation by combining the Mnist and Notmnist datasets to simulate a more challenging application scenario, given that both datasets involve 10-class classification problems.
Baselines.We compare M-CNB with SOTA baselines as follows: (1) CLUB [23] clusters users based on the connected components in the user graph and refines the groups incrementally; (2) COFIBA [39] clusters on both the user and arm sides based on the evolving graph, and chooses arms using a UCB-based exploration strategy; (3) SCLUB [40] improves the algorithm CLUB by allowing groups to merge and split, to enhance the group representation; (4) LOCB [4] uses the seed-based clustering and allows groups to be overlapped.Then, it chooses the best group candidates for arm selection; (5) NeuUCB-ONE [67] uses one neural network to formulate all users, and selects arms via a UCB-based recommendation; (6) NeuUCB-IND [67] uses one neural network to formulate one user separately (totally  networks) and applies the same strategy to choose arms.( 7) NeuA+U: we concatenate the arm features and user features together and treat them as the input for the neural network.Note that the user features are only available on Movielens and Yelp datasets.Thus, we only report the results on these two datasets for NeuA+U.(8) NeuralLinear: following the existing work [45,65].A shared neural network is built for all users to get an embedding for each arm.which is fed into the linear bandit with the clustering procedure.Since LinUCB [38] and KernalUCB [54] are outperformed by the above baselines, we will not include them for comparison.
Configurations.We run all experiments on a server with the NVIDIA Tesla V100 SXM2 GPU.For all the baselines, they all have two parameters:  that is to tune the regularization at initialization and  which is to adjust the UCB value.To find their best performance, we conduct the grid search for  and  over (0.01, 0.1, 1) and (0.0001, 0.001, 0.01, 0.1) respectively.For LOCB, the number of random seeds is set as 20 following their default setting.For M-CNB, we set  as 5 and  as 0.4 to tune the cluster, and  is set to 1.To ensure fair comparison, for all neural methods, we use the same simple neural network with 2 fully-connected layers, and the width  is set as 100.To save the running time, we train the neural networks every 10 rounds in the first 1000 rounds and train the neural networks every 100 rounds afterwards.In our implementation, we use Adam [32] for SGD.In the end, we choose the best results for the comparison and report the mean and standard deviation (shadows in figures) of 10 runs for all methods.
Results. Figure 2-4 reports the average regrets of all the methods on the recommendation and classification datasets.Figure 2 displays the regret curves for all the methods evaluated on the MovieLens and Yelp datasets.In these experiments, M-CNB consistently outperforms all the baseline methods, showcasing its effectiveness.Specifically, M-CNB improves performance by 5.8% on Amazon, 7.7 % on Facebook, 8.1 % on MovieLens, and 2.0 % on Yelp, compared to the best-performing baseline.These superior results can be attributed to two specific advantages that M-CNB offers over the two types of baseline methods.In contrast to conventional linear clustering of bandits (CLUB, COFIBA, SCLUB, LOCB), M-CNB has the capability to learn non-linear reward functions.This flexibility allows M-CNB to excel in scenarios where user preferences exhibit non-linearity in terms of arm contexts.In comparison to neural bandits (NeuUCB-ONE, NeuUCB-IND, NeuA+U, NeuralLinear), M-CNB takes advantage of user clustering and leverages the correlations within these clusters, as captured by the meta-learner.This exploitation of inter-user correlations enables M-CNB to enhance recommendation performance.By combining these advantages, M-CNB achieves substantial improvements over the MovieLens and Yelp datasets, demonstrating its prowess in addressing collaborative neural bandit problems and enhancing recommendation systems.Note M-CNB's regret rate decreases on these four datasets, even though the "linear-like" behavior in Figure 2.
Figures 3 and 4 show the regret comparison on ML datasets, where M-CNB outperforms all the baselines.Here, each class can be thought of as a user in these datasets.The ML datasets exhibit nonlinear reward functions concerning the arms, making them challenging for conventional clustering of linear bandits (CLUB, COFIBA, SCLUB, LOCB).These methods may struggle to capture the nonlinearity of the reward functions, resulting in sub-optimal performance.Among the neural baselines, NeuUCB-ONE benefits from the representation power of neural networks.However, it treats all users (classes) as a single cluster, overlooking the variations and correlations among them.On the other hand, NeuUCB-IND deals with users individually, neglecting the potential benefits of leveraging collaborative knowledge among users.NeuralLinear uses one shared embedding (neural network) for all users, which may not be the optimal solution given the user heterogeneity.M-CNB's advantage lies in its ability to exploit shared knowledge within clusters of classes that exhibit strong correlations.It leverages this common knowledge to improve its performances across different tasks, as it can efficiently adapt its meta-learner based on past clusters.Running time analysis.Figure 5 demonstrates the trade-off between running time and cumulative regret on both the Movielens and Mnist datasets, where the unit of the x-axis is seconds.As M-CNB is under the framework of neural bandits, we use NeuUCB-ONE as the baseline (1.0).The results indicate that M-CNB takes comparable computation costs (1.6× on Movielens and 2.9× on Mnist) to NeuUCB-ONE while substantially improving performance.This suggests that M-CNB can be deployed to significantly enhance performance when the user correlation is a crucial factor (e.g., recommendation tasks), with only a moderate increase in computational overhead.Now, let us delve into the analysis of the running time for M-CNB.Specifically, we can break down the computational cost of M-CNB into three main components: (1) Clustering: to form the user cluster (Line 7 in Algorithm 1); (2) Meta adaptation: to train a meta-model (Algorithm 2); (3) User-learner training: to train the user-learners (Lines 14-18 in Algorithm 1).Table 1 provides the breakdown of the time cost for the three main components of M-CNB.Clustering: This part's time cost grows linearly with the number of users  because it has a time complexity of  () for clustering.As discussed previously, leveraging pre-clustering techniques can significantly reduce this cost.It is also important to note that all clustering methods inherently have this time cost, and it is challenging to further reduce it.Meta adaptation: Due to the benefits of meta-learning, this part requires only a few steps of gradient descent to train a model with good performances.Consequently, the time cost for meta-adaptation is relatively trivial.User-learner training: While this part may require more SGD steps to converge, it is important to recognize that it is primarily used for clustering purposes.Therefore, the frequency of training user-learners can be reduced to decrease the cost.In summary, M-CNB aims to achieve the clustering of neural bandits and can manage to strike a good balance between the computational cost and the model performance.Study for  and . Figure 6 illustrates the performance variation of M-CNB concerning the parameters  and .For the sake of discussion, we will focus on  but note that  plays a similar role in terms of controlling clustering.When  is set to a value like 1.1, the exploration range of clusters becomes very narrow.In this case, the inferred cluster size in each round, | N  (x , )|, tends to be small.This means that the inferred cluster N  (x , ) is more likely to consist of true members of   's relative cluster.However, there is a drawback regarding this narrow exploration range: it might result in missing out on potential cluster members in the initial phases of learning.On the other hand, setting  to a larger value, like  = 5, widens the exploration range of clusters.This means that there are more opportunities to include a larger number of members in the inferred cluster.However, continuously increasing  does not necessarily lead to improved performances, because excessively large values of  might result in inferred clusters that include noncollaborative users and clustering noise.Therefore, in practice, we recommend to set  to a relatively large number (e.g.,  = 5) that strikes a balance between the exploration and exploitation.Study for . Figure 7 provides insight into the sensitivity of M-CNB concerning the parameter  in Algorithm 1.It is evident that M-CNB exhibits robust performance across a range of values for .This robustness can be attributed to the strong discriminability of the meta-learner and the derived upper bound.Even with varying  values, the relative order of arms ranked by M-CNB experiences only slightly changes.This consistency in arm rankings demonstrates that M-CNB is capable of maintaining the robust performance, which in turn reduces the need for extensive hyperparameter tuning.

Conclusion
In this paper, we study the Cluster of Neural Bandits problem to incorporate correlation in bandits with generic reward assumptions.Then, we propose a novel algorithm, M-CNB, to solve this problem, where a meta-learner is assigned to represent and rapidly adapt to dynamic clusters, along with an informative UCB-type exploration strategy.Moreover, we provide the instance-dependent regret analysis for M-CNB.In the end, to demonstrate the effectiveness of M-CNB, we conduct extensive experiments to evaluate its empirical performance against strong baselines on recommendation and classification datasets.

A Proof Details of Theorem 5.1
Our proof technique is different from related works.[4,23,24,39,40] are built on the classic linear bandit framework and [31,66,67] utilize the kernel-based analysis in the NTK regime.In contrast, we use the generalization bound of user-learner to bound the error incurred in each round and bridge meta-learner with user-learner by bounding their distance, which leads to our final regret bound.Specifically, we decompose the regret of  rounds into three key terms (Eq.( 27)), where the first term is the error induced by user learner   , the second term is the distance between user learner and meta learner, and the third term is the error induced by the meta learner Θ.Then, Lemma A.10 provides an upper bound for the first term.Lemma A.10 is an extension of Lemma A.7, which is the key to removing the input dimension.Lemma A.7 has two terms with the complexity O ( √  ), where the first term is the training error induced by a class of functions around initialization, the second term is the deviation induced by concentration inequality for  (•;   ).Lemma A.13 bounds the distance between user-learner and meta-learner.Lemma A.14 bounds the error induced by the meta learner using triangle inequality bridged by the user learner.Bounding the three terms in Eq. ( 27) completes the proof.
We first show the lemmas for the analysis of user-learner in Section A.1, the lemmas for meta-learner in Section A.2, the lemma to bridge bandit-learner and meta-learner in Section A.3, and the lemmas for the main workflow in Section A.4.
where () is due to the convexity of L  , () is an application of triangle inequality, and () is the application of Lemma A.2.With the choice of , the proof is completed.□ Lemma A.5 (User Trajectory Ball).Suppose ,  1 ,  2 satisfy the conditions in Theorem 5.1.With probability at least 1−O (  2 ) exp[−Ω( 2/3 )] over randomness of  0 , for any  > 0, it holds uniformly that Proof.Let  ≤ O (/ 1/4 ).The proof follows a simple induction.Obviously,  0 is in ( 0 , ).Suppose that  1 ,  2 , . . .,   ∈ B ( 2 0 , ).We have, for any  ∈ [ ], Then, by the choice of , the proof is complete.□ Lemma A.6 (Instance-dependent Loss Bound).Let L  ( ) = ( (x  ;  ) −   ) 2 /2.Suppose ,  1 ,  2 satisfy the conditions in Theorem 5.1.With probability at least 1 − O (  2 ) exp[−Ω( 2/3 )] over randomness of  0 , given any  > 0 it holds that where Proof.In round , based on Lemma A.5, for any  ∈ [ ], ∥  −  0 ∥ 2 ≤  (/ 1/4 ), which satisfies the conditions in Lemma A.4.Then, based on A.4, for any  > 0 and all  ′ ∈ ( 1 , ), it holds uniformly Therefore, for all  ∈ [ ],  ′ ∈ ( 1 , ), it holds uniformly where () is because of the definition of gradient descent, () is due to the fact 2⟨, Then, for  rounds, we have where () is by simply discarding the last term and () is by  =  2 √  and replacing  with O (1/ ).The proof is completed.□ Lemma A.7.For any  ∈ (0, 1),  > 0, and ,  1 ,  2 satisfy the conditions in Theorem 5.1.In a round  where  ∈  is serving user, let   be the selected arm and   is the corresponding received reward.Then, with probability at least 1 −  over the randomness of initialization, the cumulative regret induced by  up to round  is upper bounded by: where Proof.According to Lemma A.1, given any ∥x∥ 2 = 1,  ≤ 1, for any round  in which  is the serving user, we have . Then, in a round , we define where the expectation is taken over   conditioned on x  .Then, we have where F   denotes the -algebra generated by T   −1 .Thus, we have the following form: Because  1 , . . .,     is the martingale difference sequence, applying Hoeffding-Azuma inequality over  1 , . . .,     , we have For  1 , based on Lemma A.5 and Lemma A.6 , we have where () is an application of Lemma A.6.Combining Eq.( 11) and Eq.( 12) and applying the union bound completes the proof.□ Lemma A.8.For any  ∈ (0, 1),  > 0, and ,  1 ,  2 satisfy the conditions in Theorem 5.1.Suppose N   (x  ) = N   (x  ), ∀ ∈ [ ].After  rounds, with probability 1 −  over the random initialization, the cumulative error induced by the bandit-learners is upper bounded by where Proof.Applying Lemma A.7 over all users, we have where the first inequality is applying the Hoeffding-Azuma inequality to N   (x  ) and the last inequality is based on Lemma A.
For  2 , we have where the inequalities use Triangle inequality.For  3 , we have For  1 , we have where () is the application of Lemma A.5 and () utilizes Lemma A.12 with Lemma A.5.
For  2 , we have where () uses Lemma A.1, A.5, and A.11. Thus, we have For  4 , we have where () is because of Cauchy-Schwarz inequality and the last inequality is by Lemma A.11.
For  1 , applying Lemma A.13, with probability at least 1 − , for any ∥x  ∥ 2 = 1, we have where we ignore the last term as the result of the choice of  for  .For  2 , based on the Lemma A.7, with probability at least 1 − , we have The proof is complete.□

E[
For (2), by standard results of gradient descent on ridge regression,  (  ) , and the optimum is  (0) +   .Therefore, we have ) Reducing the naive O ( √  ) regret upper bound: O ( √  ) is roughly the regret effort to learn a single neural bandit, and thus O ( √  ) are the regret efforts to learn  neural bandits for  users.We reduce the O ( √  ) efforts to O ( √︁  )

Figure 5 :
Figure 5: Running time vs. Performance for all methods.

Figure 6 :
Figure 6: Sensitivity study for  and  on MovieLens Dataset.

Table 1 :
Breakdown time cost for M-CNB in a round (seconds) with different number of users on MovieLens.