Non-Compliant Bandits

Bandit algorithms arose as a standard approach to learning better models online. As they become more popular, they are increasingly deployed in complex machine learning pipelines, where their actions can be overwritten. For example, in ranking problems, a list of recommended items can be modified by a downstream algorithm to increase diversity. This may break the classic bandit algorithms and lead to linear regret. Specifically, if the proposed action is not taken, uncertainty in its estimated mean reward may not get reduced. In this work, we study this setting and call it non-compliant bandits; as the agent tries to learn rewarding actions that comply with a downstream task. We propose two algorithms, compliant contextual UCB (CompUCB) and Thompson sampling (CompTS), which learn separate reward and compliance models. The compliance model allows the agent to avoid non-compliant actions. We derive a sublinear regret bound for CompUCB. We also conduct experiments that compare our algorithms to classic bandit baselines. The experiments show failures of the baselines and that we mitigate them by learning compliance models.


INTRODUCTION
A bandit [5,23,27] is a framework for sequential decision making under uncertainty where the learning agent takes actions, potentially based on context, and observes rewards.The typical goal of the agent is to learn an optimal policy that maximizes the expected cumulative reward, which equivalently minimizes the expected cumulative regret.Many types of bandit algorithms exist: contextual [1,4,10], combinatorial [7,15,20,21,34], and even for multiple objectives [31].The underlying principle in all of them is that the uncertainty in the mean reward estimate of the proposed action is reduced after the action is taken.When the taken action differs from the proposed action, the algorithms can have linear regret.In this work, we study this setting under the name of non-compliant bandits.
An action is non-compliant if it is overwritten or censored later.Non-compliance is common in practice.For instance, an upstream model may recommend a movie that a downstream model or rulebased system overwrites.More specifically, the downstream system may use a propensity model to predict the probability of violence and overwrite violent recommendations for young viewers.When the upstream model does not comply, it may or may not know which movie was actually recommended [6,36].Recommendations of an upstream model may also be censored.For instance, a voice assistant may suppress recommendations that could disturb the user.Since bandit algorithms are a popular approach to learning to recommend [19,20,28,34,41], where action overwriting and censorship are common, we set out to study non-compliance in the bandit setting.
Only three prior works studied a similar setting [17,33,38].In the first two, the agent observes the taken action, similarly to our work.Stirn and Jebara [38] do not make this assumption.However, they model reward and compliance separately, similarly to our work.While the first two works are non-contextual, the last one considers a trivial tabular case, with a separate Bernoulli bandit per context.In summary, non-compliant bandit algorithms that could handle realistic context, and thus be practical, do not exist.Since it is important to integrate bandit algorithms with modern machine learning pipelines, the problem of non-compliant bandits is significantly understudied, and we attempt to fill this gap.
In this work, we propose two non-compliant bandit algorithms: compliant contextual Thompson sampling (CompTS) and compliant contextual upper confidence bound (CompUCB).The algorithms act optimistically with respect to the product of the compliance probability and mean reward.Both the reward and compliance models are contextual and learned separately.We focus on linear reward and logistic compliance models.This leads to simple practical algorithms that are general, due to using features; and both computationally and statistically efficient, due to relying on linear and generalized linear models.We prove a Õ ( √ ) regret bound for CompUCB with  features over  rounds.We experiment with both synthetic and real-world problems, where CompUCB and CompTS are compared to classic bandit baselines and a neural network bandit algorithm.Our experiments show that CompUCB and CompTS have much lower regret than their classic counterparts, and are more efficient than learning a neural network reward model.This paper is organized as follows.In Section 2, we introduce the setting of non-compliant bandits.In Section 3, we propose CompTS and CompUCB for non-compliant bandits.In Section 4, we prove a Õ ( √ ) regret bound for CompUCB with  features over  rounds.In Section 5, we evaluate CompTS and CompUCB on both synthetic and real-world problems.We conclude in Section 6.

SETTING
We adopt the following notation.Random variables are capitalized, except for Greek letters like  .For any positive integer , we define [] = {1, . . ., }.The indicator function is 1{•}.The -th entry of vector  is   .If the vector is already indexed, such as   , we write  , .We denote the maximum and minimum eigenvalues of matrix  ∈ R × by  1 () and   (), respectively.
A contextual bandit [1,4,25,28] is a sequential decision-making problem where the mean rewards of actions depend on context.We model it as follows.A learning agent interacts with the environment for  rounds.In round  ∈ [], it takes an action   ∈ A  and then observes its stochastic reward   ∈ R, where A  ⊆ A is a rounddependent action set and A is the set of all actions.We assume that A ⊆ R  is a set of -dimensional feature vectors.Since A  depends on , we can encode any round-dependent context in it, such as all feasible movies to recommend in round .The mean reward of action  ∈ A is  (;  * ), where  : A × Θ → R is the reward function,  * ∈ Θ is an unknown reward parameter, and Θ is the set of feasible reward parameters.The stochastic reward of action   in round  is   =  (  ;  * ) +   , where   is independent  2 -sub-Gaussian noise.

Many Flavors of Non-Compliance
Our bandit model can be viewed as a contextual bandit where the taken action may differ from the proposed action   .We denote the taken action by Ã , and relate the actions as Ã =  (  ), for some function .The function  may be round-dependent and stochastic, and is unknown to the agent.The reward of action Ã in round  is Ỹ =  ( Ã ;  * ) + ε , where ε is independent  2 -sub-Gaussian noise.
In the rest of this section, we discuss two variants of the problem, where Ã is either observed or not.In both variants, Ỹ is observed, since learning without feedback would be impossible.Single model.Suppose that the taken action Ã is unobserved or we choose not to model it.Then our problem can be still solved as a contextual bandit where the mean reward of action  is  ( ();  * ), and it is learned from pairs (  , Ỹ ).We call this approach a single model.The challenge is that  ( ();  * ) may be a complex function of action , representing how the downstream model replaces  with another action.These models are often complex, and thus arguably hard to mimic and learn.
Classic model.Now suppose that the taken action Ã is observed.Then any contextual bandit algorithm can be used to learn  (;  * ) from pairs ( Ã , Ỹ ).We call this approach classic.It is not immediately clear that this approach may fail.The reason is that when an action   is proposed but the model is updated with ( Ã , Ỹ ), the uncertainty in the mean reward estimate of action   may not be reduced.Therefore, a classic bandit algorithm may repeatedly take the same action whose uncertainty is not reduced and get stuck.
To address the shortcomings of existing bandit algorithms, we propose modeling compliance.At a high level, we learn a function () = P ( () = ), which represents the probability that action  complies.This is possible when both the proposed   and taken Ã actions are observed.This prevents the agent from taking actions whose uncertainty cannot be reduced, which leads to failures of the classic bandit algorithms.

Non-Compliant Bandits
A non-compliant bandit is a variant of a contextual bandit defined as follows.In round  ∈ [], the agent proposes an action   ∈ A  and gets two observations.The first observation is the taken action Ã ∈ A  .This action may differ from the proposed action   .The second observation is the stochastic reward of the taken action Ã , Ỹ =  ( Ã ;  * ) + ε , where ε is independent  2 -sub-Gaussian noise.
We discuss other feedback models in Section 2.1.
We consider a probabilistic compliance model, which models the probability that an action complies.Specifically, the probability that action  ∈ A complies is (; * ), where  : A × Ψ → [0, 1] is the compliance function,  * ∈ Ψ is an unknown compliance parameter, and Ψ is the set of feasible compliance parameters.Since the agent knows both   and Ã , it effectively observes a stochastic compliance indicator   = 1   = Ã , if the proposed and taken actions are the same.This feedback relates to the compliance probability as   = (  ; * ) +   , where   is independent Bernoulli noise.Now we discuss the notion of optimality.An important point to realize is that the classic notion of regret cannot be minimized.As an example, consider a bandit problem where the optimal action is always overwritten by the second best action.Therefore, in this work, we define the optimal action in round  as the one with the highest mean reward weighted by its compliance probability,  , * = arg max  ∈A  (; * )  (;  * ) .
In plain English, this is the highest mean reward that the agent can attain without modeling non-compliance, when   = 0. Specifically, suppose that  (;  * ) ≥ 0 for all actions  ∈ A, and that the reward and compliance noise is independent.Then As discussed earlier, the downstream algorithm is usually complex; and thus hard to mimic and learn.So (; * ) (;  * ) is a reasonable measure of attainable reward.The corresponding expected -round regret is

ALGORITHMS
The key idea in our algorithms is to act optimistically with respect to the product of the compliance probability and mean reward.One natural property of this design is that the agent ultimately learns to take compliant actions, because the optimistic overestimates of the compliance probability of non-compliant actions eventually go to zero.When this happens, the agents starts taking compliant actions.While these may have lower mean rewards, they can be higher in expectation when the compliance is considered.

Algorithm Designs
We consider two algorithm designs.The first algorithm is a form of contextual Thompson sampling (TS) [2-4, 22, 35, 39].We present it in Algorithm 1 and call it CompTS, which stands for compliant contextual TS.The key idea in CompTS is to sample the unknown model parameters from their posterior distributions and then act optimistically with respect to them.The distribution of  * in round  is denoted by  , and   is sampled from it.The distribution of  * in round  is denoted by  , and   is sampled from it.The posteriors are detailed in Sections 3.2 and 3.3.Since our setting is frequentist (Section 2), posterior sampling only serves as a randomization oracle that is optimistic with a sufficiently high probability [4].
The second algorithm relies on upper confidence bounds (UCBs) [1,10,14,29].We present it in Algorithm 2 and call it CompUCB, which stands for compliant contextual UCB.The key idea is to act optimistically with respect to the product of estimated compliance probabilities and mean rewards.The UCB on the mean reward of action  in round  is  , ().The UCB on the compliance probability of action  in round  is  , ().Note that when the mean rewards are non-negative,  , () , () is a valid UCB on (; * ) (;  * ).The UCBs are detailed in Sections 3.2 and 3.3.
To have computationally-efficient implementations of CompTS and CompUCB, we consider specific reward and compliance models.In particular, the reward function is linear,  (;  ) =  ⊤  for any  ∈ A; and the compliance function is logistic, (; ) =  ( ⊤  ) for any  ∈ A, where  () = 1/(1 + exp[−]) is a sigmoid.We make these choices for two reasons.First, both models are very general and flexible, because any function can be approximated by a linear function of non-linear features, which can be engineered based on domain knowledge or previously logged data.Second, exploration with linear and logistic models is well understood.Therefore, we can build on existing techniques, both in the algorithm design and analysis [1,2,10,14,22,29].

Reward Model Estimation
Since the reward of the proposed action   may not be observed, when the action is non-compliant, we estimate the reward model from taken actions Ã and their observations Ỹ .Specifically, the reward parameter in round  is estimated using regularized least squares as where  , is the Gram matrix for parameter  * in round  and  > 0 is a regularization parameter.This design is motivated by LinUCB [1].Note that both θ and  , can be updated online in  ( 2 ) time using the Sherman-Morrison formula.

Compliance Model Estimation
The compliance model is logistic, (; * ) =  ( ⊤  * ), where  is a sigmoid.We estimate it from proposed actions   and their compliance indicator   = 1   = Ã .More specifically, the compliance parameter in round  is estimated using logistic regression as We solve this problem by iteratively reweighted least squares (IRLS) [40].This is an iterative algorithm and each of its iterations takes  () time, since (3) contains  () terms.To speed up the computation, we initialize IRLS in round  with a solution from round  − 1.
After that, IRLS typically converges in a single step.This speedup is simple and sufficient for our needs.Other works on generalized linear bandits could be used to increase computational efficiency.For instance, Jun et al. [16] apply the online Newton step and Ding et al. [12] apply online stochastic gradient descent.Uncertainty in the estimated compliance parameter ψ is modeled using the Gram matrix  , =  −1 ℓ=1  ℓ  ⊤ ℓ , which can be updated online in  ( 2 ) time.An upper confidence bound that holds jointly over all rounds with probability 1 −  is where Σ, =  −1 , and   =  ( √︁  log(1/)).This design is based on Li et al. [29], who compute an upper confidence bound on  ⊤  * instead of  ( ⊤  * ).We state   in the proof of Theorem 1.
The posterior distribution is a multivariate Gaussian centered at the estimated compliance parameter ψ , Finally, note that (3) and Σ, are ill defined unless   ( , ) > 0. To guarantee that this condition holds, we initially explore randomly in CompUCB and CompTS until   ( , ) ≥ 1.Our analysis is also under the assumption that   ( , ) ≥ 1.

ANALYSIS
In this section, we analyze CompUCB and make the following assumptions.First, the feature vectors  ∈ A are bounded as ∥∥ 2 ≤ 1.This assumption is standard and without loss of generality.Second, we assume that  (;  * ) ∈ [0, 1] for all  ∈ A. This can be satisfied by rescaling the original function and thus is without loss of generality.Under this assumption, (; * ) (;  * ) ∈ [0, 1], which we use in the proof.We also assume that  , () ≤ 1, which holds when  , () is clipped at 1.We have  , () ≤ 1 by design.
Our regret bound is stated and discussed below.It depends on the minimum and maximum derivatives of the logistic function in the compliance model, which is standard in generalized linear bandit analyses [22,29].The minimum derivative of the mean function in the neighborhood of  * is where (; ) denotes the derivative of  ( ⊤  ) with respect to  .The maximum derivative is The mean function in the logistic model is a sigmoid.Therefore, its maximum derivative is  max = 1/4.
THEOREM 1.Let  be a round such that holds with probability at least 1 − 1/.Then the -round regret of CompUCB is bounded as

PROOF. The claim is proved in Appendix A. □
Our regret bound has two key terms.The   term is the regret for learning the compliance model.It is Õ ( √ ), since   = Õ ( √ ), and similar to the regret bounds in logistic bandits [14,22,29].The   term is the regret for learning the reward model.It is Õ ( √ ), since   = Õ ( √ ), and similar to the regret bounds in linear bandits [1,10].Therefore, as expected, our regret bound is the sum of regrets of the two learned models.Finally, we want to comment on the technical assumption in (5).The assumption is borrowed from prior works [22,29] and can be satisfied as follows.Let  denote the right-hand side of (5).Then the assumption holds after  =  ( log ) rounds, when the action sets A  are sufficiently diverse.
We would like to discuss our proof next.The most novel part is the regret decomposition in the first half, where we decompose the regret into those of logistic and linear models.After that, we apply concentration bounds of Abbasi-Yadkori et al. [1] and Kveton et al. [22], and finish the proof with the elliptical lemma.Our regret decomposition generalizes those in contextual cascading bandits [42] to logistic models.Our attempts to prove a similar bound for CompTS failed.Such bounds are hard to prove in partial observation problems.For instance, Cheung et al. [8] proved a frequentist regret bound for Thompson sampling in a non-contextual cascading bandit with a huge constant of 4 √  8064 (Lemma 4.3 and the last equation in Section 4 therein).Finally, the strength of our proof is modularity, because the concentration arguments can be easily replaced.As an example, we believe that the bound of Kveton et al. [22] could be replaced with that of Faury et al. [13] to improve dependence on the minimum derivative  min , which is  −2 min now.

EXPERIMENTS
We conduct four experiments.In Section 5.2, we experiment with various compliance types.In Section 5.3, we evaluate the robustness of CompTS and CompUCB to model misspecification.In Section 5.4, we study the scalability of our algorithms.Finally, we experiment with a real-world problem in Section 5.5.
Our algorithms, CompTS and CompUCB, are implemented with linear reward and logistic compliance models.In all experiments, the actions are overwritten as follows.If the proposed action   does not comply,   = 0, Ã is a random compliant action.If no action complies, Ã is a random action.

Baselines
We have three baselines in all experiments: LinTS [4], LinUCB [1], and EnsembleNN [32].Both LinTS and LinUCB are examples of classic bandit algorithms (Section 2.1).They have the same reward model as CompTS and CompUCB, and illustrate that exploration fails when the compliance of actions is not modeled, even if the reward model is correctly specified.
EnsembleNN is an example of a single model algorithm in Section 2.1.It learns a complex non-linear model of the mean reward of the proposed action, instead of modeling the reward and compliance, as in CompTS and CompUCB.EnsembleNN explores using Thompson sampling, where the true posterior distribution is approximated by an ensemble of neural networks.Each network has 2 fully-connected layers with ReLU activation functions.In each round , one network from the ensemble is chosen uniformly at random.The action that maximizes the mean reward under the chosen network is taken and all networks in the ensemble are updated using its observed reward.The performance of EnsembleNN is sensitive to its hyper-parameters.Therefore, we tune them.The final tuned values are ensemble size 30, mini-batch window size 32, learning rate 0.2, hidden layer size 100, prior variance 0.01, and perturbation variance 0.05.
After the tuning, EnsembleNN performs well in all experiments and shows the versatility of complex models.In comparison, CompTS and CompUCB do not need any tuning and are also less computationally costly.The former is arguably hard in the online setting, since the problem instance is unknown in advance.In terms of computation time, EnsembleNN is 3 times more computationally costly than CompTS and CompUCB in Section 5.2 (750 versus 250 seconds), and 10 times more costly than LinTS and LinUCB (75 seconds).We note that further optimization of all algorithms is possible.

Compliance Type
We start our experiments with a synthetic problem.The number of features is  = 10 and the number of actions is  = 50.The reward parameter is sampled i.i.d.from N (0  ,   ).The feature vectors of the actions are sampled uniformly at random from [−1, 1]  .The reward noise is Gaussian N (0, 0.5 2 ).All simulations are averaged over 100 independent runs.Our results are reported in Figure 1.In Figure 1b, we experiment with a stochastic compliance model   ∼ Ber( ( ⊤   * )), where the compliance parameter  * is sampled i.i.d.from N (0  ,   ) in each run.This is the same compliance model as in our algorithm design (Section 3) and analysis (Section 4).We observe that learning of the compliance model is very beneficial.Specifically, both LinTS and LinUCB do not model compliance, and have linear regret.Our algorithms also outperform EnsembleNN, which does try to model the structure of our problem.
In Figure 1c, we experiment with deterministic compliance   = 1  ,1 ≥ 0.5 , where  ,1 is the first entry of the action feature vector   .The challenge of this setting is that our model is misspecified, as the actions either always or never comply.Despite this, CompTS and CompUCB learn near-optimal actions, as evidenced by their sublinear regret.In contrast, both LinTS and LinUCB have linear regret.Our algorithms also outperform EnsembleNN.
Finally, in Figure 1a, we experiment with a full compliance model   = 1.This setting validates that CompTS and CompUCB are implemented correctly.In particular, when all actions comply, CompTS and CompUCB reduce to LinTS and LinUCB, respectively.We also observe that EnsembleNN performs the worst.This is not surprising, since the reward function in this experiment is linear but EnsembleNN tries to approximate it using a 2-layer neural network.

Model Misspecification
Contextual bandit algorithms are known to be sensitive to model misspecification.In this section, we study this topic.In all experiments, we use the stochastic compliance model from Section 5.2, and modify compliance model features in CompTS and CompUCB as follows.In feature overspecification, we add 5 additional compliance features that are sampled uniformly at random from [−1, 1].In this case, CompTS and CompUCB have to learn that the additional features are irrelevant, which decreases their statistical efficiency.In proper feature specification, the compliance features remain unchanged.Finally, in feature underspecification, we randomly remove 2 compliance features out of 10.The challenge of this setting is that the compliance model is misspecified.Our results are reported in Figure 2. We observe three major trends.First, both CompTS and CompUCB perform well when the compliance features are correctly specified (Figure 2b).Second, when the compliance features are overspecified (Figure 2a), the statistical efficiency of CompTS and CompUCB decreases, which results in a slightly higher regret.Nevertheless, both algorithms still outperform all baselines.Finally, when the compliance features are underspecified, CompTS and CompUCB can have linear regret.In this case, it is most beneficial to learn a more complex model using EnsembleNN.Surprisingly, even learning of an incorrect model of compliance, by CompTS and CompUCB, is more beneficial than not learning it at all, by LinTS and LinUCB.

Scalability
Now we examine how the regret scales with the number of features  and actions .All experiments in this section are variants of the stochastic compliance setting in Section 5.2.
Our results are reported in Figure 3.We observe that both CompTS and CompUCB have sublinear regret in all plots, and outperform the classic bandit algorithms, whose regret is always linear.For a fixed , the regret of CompTS and CompUCB increases slowly with .This suggests that the proposed algorithms can scale to large action sets.For  = 25, CompUCB has a slightly higher regret initially than LinTS and LinUCB.As the number of rounds  increases, CompUCB quickly learns a good policy.The best performing algorithm is CompTS.Its performance gap over the other algorithms grows with the number of features .

MovieLens Experiments
The goal of these experiments is to showcase CompTS and CompUCB beyond synthetic problems.We experiment with the MovieLens 1M dataset [24], with 1 million ratings from 6 040 users for 3 883 movies.The dataset is used as follows.We complete the sparse rating matrix  using alternating least squares [11] with rank  = 5.This rank is high enough to yield a low prediction error.The learned factorization is  =  ⊤ .The -th row of  , denoted by   , is the latent factor representing user .The -th row of  , denoted by   , is the latent factor representing movie .
Our results are averaged over 100 random runs.In each run, we simulate interactions of a recommender system with randomly arriving users over  rounds.In interaction  ∈ [], we first choose a random user .Then we select 100 random movies as the action set A  in round .The feature vector of movie  recommended to user  is vec(   ⊤  ), where vec() is a vectorization of matrix .Since both   and   are 5-dimensional vectors, vec(   ⊤  ) has 25 dimensions.The mean reward for recommending movie  to user  is    ⊤  and the agent observes it with Gaussian noise N (0, 0.5 2 ).Compliance type.We study the same three compliance types as in Section 5.2.Our results are reported in Figure 4. We observe that all trends are similar to Figure 1.One exception is that CompUCB is less statistically efficient, which is manifested by a higher regret relative to the other algorithms.This is because the number of features  increased.The regret remains sublinear though.
Realistic compliance rules.We experiment with three realistic scenarios.Since the MovieLens dataset does not contain compliance feedback, we generate it based on user and item features.The first non-compliance scenario is age restricted compliance.Initially we wanted to block horror and crime movies for young users.Unfortunately, these users comprise only about 3% of the MovieLens dataset.Therefore, we decided to suppress children movies for users above age 18.The other two non-compliance scenarios are motivated by pop-up messages on mobile devices, which are supposed to increase user engagement.In the first scenario, we measure user activity by the number of days since the user left a rating in a 30-day window.We predict the activity using a linear model of user features.A recommendation is suppressed if the user activity is at least 2 days.We call this model linear compliance.In the other scenario, we measure user activity using lapse, which takes value 0 if the user is active in the subsequent 30 days, and 1 otherwise.We predict the lapse using a logistic model of user features.A recommendation is suppressed if the lapse is lower than 0.5.We call this model logistic compliance.If the suppression was based on user features only, we would either recommend everything or nothing.Therefore, we limit it only to movies  whose first entry of   is greater than 0.5, similarly to our experiments in Figures 1c and 4c.The thresholds are chosen so that about 50% of recommendations are suppressed on average.
Our results are reported in Figure 5.In all experiments, CompTS and CompUCB perform well, and CompTS is the best algorithm.Although CompUCB has a higher regret than LinTS in Figure 5a, it is never linear.LinTS and LinUCB have linear regret in Figures 5b and  5c.Overall, we observe that CompTS and CompUCB perform robustly in all tested compliance models.

CONCLUSIONS
We study non-compliant bandits, where an agent takes actions that may be overwritten by a downstream algorithm.We propose UCB and Thompson sampling algorithms for this setting, which model the reward and compliance separately.The algorithms are contextual and efficient.We prove a Õ ( √ ) regret bound for CompUCB with  features over  rounds.Our empirical results show that CompTS and CompUCB consistently outperform their classic counterparts, LinTS and LinUCB, in both synthetic and real-world problems.In all but one experiment, CompTS outperforms EnsembleNN, which learns an expressive neural network reward model, instead of modeling the structure of our problem.This is despite the fact that EnsembleNN was tuned, which would be impossible in the true online setting.
Non-compliance is understudied in bandits and we hope that this paper will encourage more work on this important practical topic.Our work can be readily extended in four directions.First, we make an assumption that the rewards of taken actions are always observed.Our algorithm design and analysis can be generalized to a slightly more general setting, where the reward is only observed when the proposed action complies.Second, while we show empirically that CompTS performs well, we do not bound its regret.We discuss technical challenges that prevented the analysis after Theorem 1. Third, similarly to classic bandit algorithms, model misspecification can be an issue, and additional work is needed to improve the robustness of CompTS and CompUCB.Finally, our work can be extended to other popular bandit models for recommender systems, such as combinatorial semi-bandits [7,15,21], online learning to rank [9,18,20,26,34], and collaborative filtering and low-rank bandits [19,30,37,41].

A PROOF OF THEOREM 1
To simplify notation, we define  () = (; * ) (;  * ).We start with trivial upper bounds on the regret in the first  rounds and the -round regret due to (5) not holding in round , If (5) holds in round , we have P ∥ ψ −  * ∥ 2 > 1 ≤ 1/ for any round  ≥  [22].Therefore, we can further bound the regret as where   = ∥ ψ −  * ∥ 2 ≤ 1 .To simplify notation, we do not write 1{  } in the rest of the analysis.
Let   be the history of all interactions of the agent up to round .Let  , () and  , () be high-probability upper and lower confidence bounds, respectively, on the mean reward of action  in round .Let  , () and  , () be the corresponding quantities for the compliance probability of action  in round .Specifically, let  , () ≤  (;  * ) ≤  , () ,  , () ≤ (; * ) ≤  , () , hold jointly over all actions  and rounds  with probability at least 1 − 2.Let = ( jointly with probability at least 1 − , where ∥ * ∥ 2 ≤ .In our case,  = 1/2, and the definition of  , leads to an upper confidence bound  This completes the proof.□

Figure 1 :Figure 2 :
Figure 1: Evaluation of CompTS and CompUCB on three compliance models.From left to right, we experiment with (a) full compliance, (b) stochastic compliance, and (c) deterministic compliance.

Figure 3 :
Figure 3: Evaluation of CompTS and CompUCB on 9 synthetic problems of different sizes.We vary the number of features, from  = 5 to  = 25, and the size of the action set, from  = 20 to  = 100.

Figure 4 :Figure 5 :
Figure 4: Evaluation of CompTS and CompUCB on three compliance models in the MovieLens dataset.From left to right, we experiment with (a) full compliance, (b) stochastic compliance, and (c) deterministic compliance.