skip to main content
research-article
Open Access

Stochastic Top K-Subset Bandits with Linear Space and Non-Linear Feedback with Applications to Social Influence Maximization

Published:03 February 2022Publication History

Skip Abstract Section

Abstract

There are numerous real-world problems where a user must make decisions under uncertainty. For the problem of influence maximization on a social network, for example, the user must select a set of K influencers who will jointly have a large influence on many users. With the lack of prior knowledge about the diffusion process or even topological information, this problem becomes quite challenging. This problem can be cast as a combinatorial bandit problem, where the user can repeatedly choose a candidate set of K out of N arms at each time, with an aim to achieve an efficient trade-off between exploration and exploitation.

In this work, we present the first combinatorial bandit algorithm for which the only feedback is a non-linear reward of the selected K arms. No other feedback is needed. In the context of influence maximization, this means no feedback in the form of which nodes or edges were activated needs to be available, just the amount of influence. The novel algorithm we propose, CMAB-SM, is based on a divide-and-conquer strategy. It is computationally and storage efficient. Over a time horizon T, the proposed algorithm achieves a regret bound of Õ(K1/2N1/3T2/3). This bound is sub-linear in all of the parameters: T, N, and K.

We empirically demonstrate our algorithm’s performance using the applications of influence maximization and product cross-selling. For influence maximization, we provide experiments on real-world social networks, showing that the proposed CMAB algorithm outperforms bandit-specific and social-influence-domain-specific algorithms in terms of empirical run-time and expected influence. For product cross-selling, we also demonstrate that the proposed CMAB algorithm outperforms considered baselines on synthetic data.

Skip 1INTRODUCTION Section

1 INTRODUCTION

There are numerous settings where agents make decisions sequentially within an unknown, stochastic environment, and each decision involves choosing a subsets of options. Two important examples are social influence maximization and product cross-selling.

In lieu of traditional media advertising, a company may repeatedly select a set of social media “influencers” to sponsor, wanting those influencers to inspire their followers, and followers of those followers, to discuss and post about the product. The problem of sequentially selecting a set of influencers to sponsor, known as social influence maximization [15], can be challenging. The dynamics of how influence spreads among social network users may be complex and difficult to accurately model, with privacy settings limiting visibility, unknown user preferences, overlapping audiences, and interactions between multiple social networks.

Researchers have designed algorithms for social influence maximization based on algorithms for combinatorial multi-armed bandits with semi-bandit feedback. Those algorithms assume knowledge of the network topology [20, 30], influence propagation traces [20], node-level feedback [45], edge-level feedback [40], etc. However, such feedback may be unavailable due to privacy settings. We consider the case when such feedback is not available and model the problem as a combinatorial multi-armed bandit with full-bandit feedback, i.e. no knowledge of the underlying social network or diffusion process except the number of posts.

The problem maximal-profit item selection with cross-selling [42], which is defined as follows: “if selling items together is more profitable than selling them individually then which two or more items should be sold together”. The problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling in an online manner.

Other motivating applications include erasure-coded storage where \(K\) out of \(N\) servers are picked to serve a content request [43], and daily advertising campaigns involving a set of \(K\) sub-campaigns [36, 44].

All the applications discussed above can be modeled as combinatorial multi-armed bandit (CMAB) problems. There are \(N\) options or “arms” available at each time. At each time, the agent can choose up to \(K\) arms and will receive a reward depending on the chosen subset of arms. The goal of the agent is to obtain as much reward as possible over the time horizon. As in the traditional MAB problem, where the agent can only select \(K=1\) arm, the agent may need to balance “exploration,” that is picking a subset of \(K\) arms to learn about the subset’s reward distribution, and “exploitation,” that is picking a subset of \(K\) arms that currently seems the best, despite the agent having uncertainty about each subset.

Unlike the traditional setting, however, when the agent can choose any subset of \(K\gt 1\) arms out of all \({n \choose k}\) such subsets, it may be impossible to even try each subset once. Furthermore, when the subset’s reward is an unknown, non-linear function of the individual arms’ rewards, it may be impossible to infer each arms’ individual performance. Previous combinatorial MAB algorithms that assume a linear function or require additional feedback in the form of each individual arms’ performance [11, 35] will not work for this setting.

In this paper, we propose an efficient algorithm, CMAB-SM to solve the CMAB problem where the reward is a non-linear function of individual arm rewards and no additional feedback is available. To our knowledge, this is the first paper to do so. CMAB-SM achieves a regret bound of \(\widetilde{O}({K^\frac{1}{2}N^\frac{1}{3}T^\frac{2}{3}})\). Our proposed algorithm only requires \(O(N)\) storage complexity and \(\tilde{O}(K)\) per-round time complexity.1

1.1 Our Contributions

We now summarize the main contributions of this paper:

(1)

We propose CMAB-SM, the first, efficient algorithm for the CMAB problem for non-linear “full-bandit” feedback (i.e. no extra feedback for individual arms).

(2)

We prove that over a horizon \(T\), CMAB-SM achieves a regret of \(\widetilde{O}({K^\frac{1}{2}N^\frac{1}{3}T^\frac{2}{3}})\), which is sub-linear with respect to each parameter.

(3)

We prove that CMAB-SM has \(O(N)\) space complexity and \(O(TK\log K)\) time complexity.

(4)

We evaluate CMAB-SMś performance empirically in multiple problems, including influence maximization. CMAB-SM outperforms the \(\epsilon\)-CD algorithm (a domain-specific algorithm based on the credit distribution model [20]).

(5)

Lastly, the design and analysis of CMAB-SM uses the theory of stochastic dominance to order the arms without needed to estimate individual arm distributions. This technique obviates the need of additional feedback. This method may be of independent interest for MAB problems.

1.2 Key Techniques

We now summarize the method and proof techniques used. CMAB-SM divides all \(N\) arms into groups of \(K+1\) arms, such that each group contains only \(K+1\) actions as there are \(K\) arms to choose from \(K+1\) arms. Since choosing \(K\) out of \(K+1\) is equivalent to removing 1 out of \(K+1\), there is a one-to-one mapping between arms and actions in a group. We sort the actions (and the corresponding arms) in each group which requires time steps of polynomial order in \(K\). We then merge those groups one by one and obtain the best \(K\) arms.

For the analysis, we assume certain properties on the reward distributions of the different arms, and on the function of rewards of individual arms. More precisely, we use the theory of stochastic dominance to differentiate between cumulative distribution functions. The reward distributions of different arms are assumed to dominate or be dominated by each other. Further, the non-linear function is assumed to be symmetric in the rewards obtained from each arm, and the mean of the function is assumed to be continuous (in terms of dominated inputs). These properties are satisfied in case of a few reward distributions (e.g., Bernoulli rewards), and few non-linear functions (e.g., maximum).

1.3 Related Work

Combinatorial Bandits have been studied where the agent chooses \(K\) of the \(N\) arms in each round [1, 5, 10, 12, 13, 14]. In these works, it is assumed that the reward function in each round is linear in the different arms. They consider a setting where at time \(t\), the agent selects an arm \(x_t\in D_t\) and observes a reward \(\theta ^Tx_t\), where \(D_t\subset \mathbb {R}^K\) is the decision set and \(\theta \in \mathbb {R}^K\) is a constant vector. Due to this linear function, the problem is also called online linear optimization. The algorithms proposed in these works use the linearity of the reward function to estimate rewards of individual arms and achieve a regret of \(O(\sqrt {T})\). Weights are then assigned to each of the \(\binom{N}{K}\) actions to decide the action in the next round; such approaches are not computationally efficient for large \(N\) as \(|D_t| = \binom{N}{K}\). The work of [34] reduces the space complexity at the cost of regret bounds. However, they still loop over all the arms and the regret bound becomes exponential in \(K\). For a linear function, such as sum of rewards, we can construct arms which is a binary \(K\)-sparse vector of length \(N\). Such a setting has \(N\) unknown variables, and those unknowns could be obtained using least squares as done by [12] or regularized least squares [1]. We consider non-linear functions for which the expected rewards of individual arms could not be obtained using least squares solution.

[16, 22, 33] studied the problem of generalized linear models (GLM) where reward \(r_t\) is a function \(\left(f(z):\mathbb {R}\rightarrow \mathbb {R}\right)\) of \(z=\theta ^T x_t\). Generalized linear models assume the distribution of bandit reward \(r_t\) belong to a cannonical exponential family. The exponential distribution allows the use log-likelihood maximization to obtain estimates of arm parameters which increase the likelihood of observed rewards. Generalized Linear Models assume that the expected reward of the arm played is a non linear function of the linear combination of features of the action played with a fixed parameter. This is different than our setup where we assume the reward of the action played is a non linear function of individual realization of rewards of each arm.

GLM (and linear) models have a long and rich history across many disciplines such as finding a target item among multiple options [29]. GLMs also have many interesting theoretical and statistical properties. But there are settings where GLMs do not accurately model rewards. For example, in the case where multiple arms are selected and the joint reward is the maximum of individual rewards, the joint rewards is not a linear combination of the individual arm rewards.

[27] provides an UCB style algorithm for matroid bandits, where the agent selects a maximal independent set of rank \(K\) to maximize sum of rewards of each arm. They assume rewards of each of the \(K\) arms is also observed in each round. Such a setup, where the rewards of each of the \(K\) arms is also available to agent, is referred to as a semi-bandit problem. [17] also considered the problem of semi-bandits for the problem of maximum weighted matching for cognitive radio applications. [28] showed that the UCB algorithm provides a tight regret bound for the semi-bandit combinatorial bandit problem with linear reward function. The authors of [11] considered the combinatorial semi-bandit problem with non-linear rewards using a UCB style analysis. The authors of [35] assumed the combinatorial bandit problem with non-linear reward function and feedback, where the feedback is a linear combination of rewards of the \(K\) arms. Such feedback of a linear function of rewards allows for the recovery of individual rewards. In contrast to prior works, this paper does not consider the availability of individual arm rewards or a linear feedback. With only aggregate, non-linear feedback, it might not be possible to obtain the exact values of the rewards of base arms.

1.4 Organization

The rest of the paper is organized as follows. In Section 2, we provide the model under consideration and the assumptions that are taken for the analysis. Section 3 presents the proposed algorithm, CMAB-SM. The main result is provided in Section 4. Section 5 illustrates our results on a synthetic example for the cross-selling application. Further evaluations on synthetic problems are provided in the Appendix. In section 6, we present the results for the application to Influence Maximization and compare the proposed approach with some standard domain-specific methods. Section 7 presents the conclusions with directions for future work.

Skip 2PROBLEM FORMULATION AND ASSUMPTIONS Section

2 PROBLEM FORMULATION AND ASSUMPTIONS

2.1 Problem Setup

We now describe the stochastic combinatorial multi-armed bandit problem we consider. There are \(N\) “arms” labeled as \(i \in [N] = \lbrace 1, 2, \ldots , N\rbrace\). Each time an arm is chosen or “played,” there is a reward. Let \(X_{i,t} \in [0,1]\) be a random variable denoting the reward of the \(i^{th}\) arm, at time-step \(t\) (also referred to as time \(t\)). We assume that \(X_{i,t}\) are independent across time and arms, and for any arm the distribution is identical at all times. Also, if some analysis is independent of the time variable, we will drop the subscript \(t\) and write only \(X_i\). The rewards could be discrete valued, continuous valued, or mixed.

At each time instant, the agent chooses an action \({\bf a} = (a_1, a_2,\ldots , a_K)\) which is a \(K\)-tuple of arms. Let \(\mathcal {A} = \lbrace {\bf a}\in [N]^K \ | \ {\bf a}(i) \ne {\bf a}(j)\ \forall \ i,j:\ 1 \le i\lt j\le K\rbrace\) be the set of all such actions which can be constructed using \(N\) arms. Thus the cardinality of \(\mathcal {A}\) is \(\binom{N}{K}\). We denote the action played at time \(t\) as \({\bf a}_t \in \mathcal {A}\). For an action \({\bf a}=(a_1, a_2, \ldots , a_K)\) let \({\bf d}_{{\bf a}, t} = (X_{a_1,t}, X_{a_2,t}, \dots , X_{a_K,t})\in [0,1]^K\) denote the column vector of arm rewards at time \(t\) from action \({\bf a}\). The reward \(r_{\bf a}(t)\) of action \({\bf a}\) at time \(t\) is a bounded function \(f:[0,1]^K\rightarrow [0,1]\) of the rewards from the arms chosen in that action, \[\begin{align*} r_{\bf a}(t) &= f({\bf d}_{{\bf a}, t}). \end{align*}\]

Later in the text, we will skip index \(t\) for brevity, where it is unambiguous. If at time \(t\), action \({\bf a}_t\) is played, \({\bf d}_{{\bf a}_t, t}\) will be simplified to \({\bf d}_{{\bf a}_t}\). Also, if we analyze behavior for action \({\bf a}\) such that its reward vector is independent and identically distributed across time, we will drop the subscript and use only \({\bf d}_{\bf a}\) for brevity.

In many practical systems, the real reward is a non-linear function of noisy reward instead of a non-linear function of expected rewards plus noise. For example, consider a distributed system. A job may be forked into multiple parallel tasks. The completion time of the job depends on the maximum time taken to finish any of the task. Hence, the completion time of the job is a non-linear function of the completion time of the sub-tasks. Another example for non-linear function of individual rewards is item selection in cross selling. In cross-selling, the total profit of the seller is the sum of individual profits plus an additional advantage made from the combined transaction. The additional advantage can be modelled as a quadratic term of the individual items sold [42].

For a linear case, the two cases are equivalent as (1) \[\begin{align} r_{\bf a}(t) = f({\bf d}_{{\bf a}_t}) = \theta ^T{\bf d}_{{\bf a}_t} = \theta ^T(\mathbb {E}\left[{\bf d}_{{\bf a}_t}\right]+\eta _t) = \theta ^T\mathbb {E}\left[{\bf d}_{{\bf a}_t}\right]+\theta ^T\eta _t = f\left(\mathbb {E}\left[{\bf d}_{{\bf a}_t}\right]\right) + \epsilon _t \end{align}\] where \(\epsilon _t\), and \(\eta _t\) are noise terms.

For a non-linear case, this formulation does not simply reduces to function of expected rewards plus noise. However, we can still write the reward as the expected bandit reward plus noise \(\epsilon _t\), or (2) \[\begin{align} r_t &= \mathbb {E}[f({\bf d}_{{\bf a}_t})] + \epsilon _t \end{align}\] Similar to the work of [6], we aim to maximize the expected reward for the actions selected over time. Further, we intend to improve the finite time regret bounds for the same. We denote the expected reward of any action \({\bf a} \in \mathcal {A}\) as \(\mu _{\bf a}\), or \(\mu _{\bf a} = \mathbb {E}[r_{{\bf a}}(t)]\) for all \(t\in \lbrace 1, 2.\ldots , T\rbrace\).

We assume that there is an action \({\bf a}^*\) for which the expected reward \(\mu _{ {\bf a}^*}\) is highest among all actions \({\bf a} \in \mathcal {A}\), \[\begin{align*} {\bf a}^* &= \arg \max _{{\bf a}\in \mathcal {A}}\mu _{\bf a} \end{align*}\] We refer to this action \({\bf a}^*\) as “optimal.” Given an optimal action, regret for an action at time \(t\) can be defined as follows.

Definition 2.1

(Regret).

The regret of an action \({\bf a}_t\) at time \(t\) is defined as the difference between the reward obtained by the optimal action and the reward obtained by \({\bf a}_t\), or (3) \[\begin{eqnarray} R(t) = r_{{\bf a}^*}(t) - r_{{\bf a}_t}(t) \end{eqnarray}\]

The objective is to minimize the expected regret accumulated during the entire time horizon, (4) \[\begin{align} W(T)= \mathbb {E}_{{\bf a}_1, r_{{\bf a}_1}(1), \ldots , {\bf a}_T, r_{{\bf a}_T}(T) }\left[\sum ^{T}_{t=1}R(t)\right] \end{align}\]

2.2 Assumptions

We now discuss the technical assumptions which are required to prove the regret bounds for the CMAB-SM algorithm. We note that many of the assumptions are not required for the algorithm to work, only to prove the guarantees.

The function \(f\) is assumed to be symmetric, so that the ordering within the tuple does not matter. In other words, the rewards for an action is symmetric in its constituent arms. This assumption is true for certain problem settings where the ordering among the individual arms is not important, like the maximum of rewards, or the sum of rewards of the individual arms. This assumption is given as follows.

Assumption 1 (Symmetry).

\(f\) is a symmetric function of the rewards obtained by the constituent arms. More precisely, let \(\Pi ({\bf d})\), be an arbitrary permutation of \({\bf d}\). Then, the reward observed will be identical for both \(\Pi ({\bf d})\) and \({\bf d}\), or (5) \[\begin{eqnarray} f\left({\bf d}\right) \!\!\!\!&=&\!\!\!\! f\left(\Pi \left({\bf d}\right)\right) \end{eqnarray}\]

In the rest of the text, we denote \(\Pi (\cdot)\) as a permutation, where \(\Pi ({\bf y})\) is one of the possible permutations of the vector y. We now define gap \(\Delta\) between two actions as follows.

Definition 2.2

(Gap).

The Gap \(\Delta _{{\bf a}_1, {\bf a}_2}\) between any two actions \({\bf a}_1, {\bf a}_2 \in \mathcal {A}\) is defined as the difference of expected rewards of the actions, or (6) \[\begin{eqnarray} \Delta _{{\bf a}_1, {\bf a}_2} = \mu _{{\bf a}_1} - \mu _{{\bf a}_2} = \mathbb {E}[f({\bf d}_{{\bf a}_1})] - \mathbb {E}[f({\bf d}_{{\bf a}_2})] \end{eqnarray}\]

We assume that there is an optimal action \({\bf a}^*\) for which the expected reward is highest among all actions \({\bf a} \in \mathcal {A}\). We denote the reward of the optimal action by \(\mu _{{\bf a}^*}\). The Gap of an action \({\bf a} \in \mathcal {A}\) with respect to the optimal action is simply written as \(\Delta _{\bf a}\), or, \(\Delta _{\bf a} = \Delta _{{\bf a}^*, {\bf a}}\)

Remark 1.

Using linearity of expectation it can be seen that, \(\mathbb {E}\left[R(t)\right] = \Delta _{{\bf a}_t}\)

From Remark 1, the expected regret accumulated during the entire time horizon can be written as, (7) \[\begin{eqnarray} W(T) = \mathbb {E}\left[\sum ^{T}_{t=1}R(t)\right] = \mathbb {E}\left[\sum ^{T}_{t=1}\Delta _{{\bf a}_t}\right] \end{eqnarray}\] We define the maximum regret of all possible actions as \(R_{max} = \max _{{\bf a}\in \mathcal {A}}{\Delta _{{\bf a}}}\).

We now use the concept of stochastic dominance to order two arms. Assume that there exists a first-order stochastic dominance between any two arms which is defined as follows.

Definition 2.3

(First-Order Stochastic Dominance).

A random variable \(X\) has first-order stochastic dominance (FSD) over another random variable \(Y\) (or \(X \succ Y\)), if for any outcome \(x\), \(X\) gives at least as high a probability of receiving at least \(x\) as does \(Y\), and for some \(x\), \(X\) gives a higher probability of receiving at least \(x\), (8) \[\begin{align} X \succ Y \Leftrightarrow P\left(X \ge x\right) \ge P\left(Y \ge x\right) \forall x \in \mathbb {R}, \nonumber \nonumber\\ P\left(X \ge x\right) \gt P\left(Y \ge x\right) \text{ for some } x \in \mathbb {R}. \end{align}\]

Assumption 2 (FSD between Arms).

There exists a dominance ordering between all the arms, which is defined using FSD. In other words, for each pair of arms \(i\) and \(j\), either \(X_{i} \succ X_{j}\) or \(X_{j} \succ X_{j}\).

The FSD implies second order stochastic dominance, which indicates that the mean of the dominating random variable is at least as much as the mean of the dominated random variable [7, 21]. This is summarized in the following lemma.

Lemma 2.4 ([7, 21]).

If a random variable \(X\) has FSD over another random variable \(Y\) (or, \(X \succ Y\)), then the expected value of \(X\) is at least the expected value of \(Y\), or (9) \[\begin{eqnarray} \mathbb {E}\left[X\right] \ge \mathbb {E}\left[Y\right] \end{eqnarray}\]

Remark 2.

From Lemma 2.4, we note that if arm \(i\) dominates arm \(j\), then the mean reward for arm \(i\) is strictly greater than that of arm \(j\). Thus, under Assumption 2, (10) \[\begin{align} \mathbb {E}[X_{i}] \ne \mathbb {E}[X_{j}], \forall i, j \in \lbrace 1,\ldots ,N\rbrace \end{align}\]

Such strict dominance exists for Bernoulli and exponential reward distribution functions.

Since we can construct a new action by changing the arms of an existing action, we define the replacement function \(h(\cdot ,\cdot ,\cdot)\) which changes an element \(i\) of a given reward vector \({\bf d}\) (where each entry in the reward vector is a random variable with the distribution of the corresponding arms).

Definition 2.5

(Replacement Function).

The replacement function \(h(\cdot)\) is defined as a function on \(\mathbb {R}^{K+2}\), which replaces the \(i^{th}\) element of vector \({\bf d}\) with \(x\), or \[\begin{eqnarray*} h({\bf d}, i, x) = \left({\bf d}(1),\ \dots ,\ {\bf d}(i-1),\ x,\ {\bf d}(i+1),\ \dots ,\ {\bf d}(K) \right). \end{eqnarray*}\]

For a random variable \(X\), \(h({\bf d}, i, X)\) is also a random variable. We also assume that the expected reward of an action is strictly increasing function of the rewards obtained by the individual arms.

Assumption 3 (Strictly Increasing).

\(f(\cdot)\) is element-wise, strictly increasing function of the individual rewards obtained by the constituent arms. More precisely, (11) \[\begin{equation} f\left(h\left({\bf d}, i, x\right)\right)\ \gt f\left(h\left({\bf d}, i, y\right)\right) \ \forall x \gt y \text{ ; } x, y \in [0,1]\ \forall ~{\bf d}\in [0,1]^K \end{equation}\]

Even though we assume strictly increasing function, the analysis also holds for strictly decreasing function \(f\) by transforming the reward function as \(f_{n}({\bf d}) = 1-f({\bf d})\). In order to compare the distance between individual reward vectors from two different actions, we need to find the difference in the two individual reward vectors up to a permutation, since the reward function is permutation invariant. With this distance metric in mind, we assume that \(f(\cdot)\) is Lipschitz continuous (in an expected sense), which is formally described in the following.

Assumption 4 (Continuity of Expected Rewards).

The expected value of \(f(\cdot)\) is Lipschitz continuous with respect to the expected value of the rewards obtained by the individual arms, meaning (12) \[\begin{align} \big |\ \mathbb {E}\left[f({\bf d}_1)\right] - \mathbb {E}\left[f({\bf d}_2)\right]\big | \ \le \ U_1 \min _{\Pi }{\big |\big |}\mathbb {E}[{\bf d}_1] - \Pi (\mathbb {E}[{\bf d}_2]){\big |\big |_2} \end{align}\] for any given random vectors \({\bf d}_1\) and \({\bf d}_2\) and for some \(U_1\lt \infty\), where \(\Pi\) is minimized over all permutations of \(\lbrace 1, \ldots , K\rbrace\).

Corollary 2.6.

Assumption 4 also implies (13) \[\begin{align} \big |\mathbb {E}\left[f(h\left({\bf d}, i, X\right))\right] - \mathbb {E}\left[f(h\left({\bf d}, i, Y\right))\right]\big | \ \le \ U_1 \big | \mathbb {E}[X] - \mathbb {E}[Y]\big | \end{align}\]

for any given random vector \({\bf d}\) and any \(i \in \lbrace 1, \ldots , K\rbrace\).

We further assume a lower bound in (13) as formally stated in the following assumption.

Assumption 5 (Continuity of Individual Expected Rewards).

We also assume that the continuity given in (13) also has a similar lower bound. More precisely, there is a \(U_2\lt \infty\) such that (14) \[\begin{align} \big |\mathbb {E}[X] - \mathbb {E}[Y]\big | \le U_2\big |\left(\mathbb {E}\left[f(h\left({\bf d}, i, X\right))\right] - \mathbb {E}\left[f(h\left({\bf d}, i, Y\right))\right]\right)\big | \end{align}\] for any given random vectors \({\bf d}\) and any \(i \in \lbrace 1, \ldots , K\rbrace\).

Assumption 5 holds for many well behaved functions in practical scenarios. For example, \(f(\cdot) = \max (\cdot)\) with Bernoulli rewards for individual arms,2 \(f(\cdot)\) = sum of individual rewards, or \(f(\cdot)\) = concave utility function of sum of individual rewards.

Corollary 2.7.

Combining Corollary 2.6 and Assumption 5 and defining \(U = \max (U_1, U_2)\), we have (15) \[\begin{align} \frac{1}{U}\big |\mathbb {E}[X] - \mathbb {E}[Y]\big | &\le \big |\left(\mathbb {E}\left[f(h\left({\bf d}, i, X\right))\right] - \mathbb {E}\left[f(h\left({\bf d}, i, Y\right))\right]\right)\big | \end{align}\] (16) \[\begin{align} &\le U \big |\mathbb {E}[X] - \mathbb {E}[Y]\big |, \end{align}\] for any given random vector \({\bf d}\) and any \(i \in \lbrace 1, \ldots , K\rbrace\).

We note that linear bandits become a special case of the assumptions we considered for \(U = \sqrt {K}\).

Skip 3PROPOSED ALGORITHM Section

3 PROPOSED ALGORITHM

The proposed algorithm, called CMAB-SM, is an explore then exploit strategy which aims to minimize the expected regret, be computationally efficient, and have a storage complexity which is linear with \(N\) and independent of \(K\). CMAB-SM, described in Algorithm 1, utilizes the fact that for CMAB problem, choosing \(K\) arms from a set of \(K+1\) arms has \(K+1\) actions thus making the problem solvable using the standard MAB approach. If \(N = K+1\), then the complexity is \(\binom{K+1}{K} = K+1\), and only \(K+1\) actions needs to be optimized. The intuition to this approach follows closely to the Merge Sort algorithm, where the elements of the array are noisy. Hence, to obtain a better estimate of the array elements, we sample them repeatedly.

We construct a group \(G\) which is a vector of length \(K+1\) consisting of arm indices, or \(G\in [N]^{K+1}\). Then, we can construct \(K+1\) actions, each action using all but one entries in the group \(G\).

Let \({\bf a}^G_{-i}\) be an action in group \(G\) with \(G(i)^{th}\) arm left out, where \(G(i), i\in \lbrace 1, \ldots , K+1\rbrace\), is the \(i^{th}\) entry of the group.

With \(X_{G(i)}\) denoting the reward of arm \(G(i)\), the individual reward vector \({\bf d}_{{\bf a}^G_{-i}}\) with the action \({\bf a}^G_{-i}\) is (17) \[\begin{align} {\bf d}_{{\bf a}^G_{-i}} = \big (X_{G(1)}, \dots , X_{G(i-1)}, X_{G(i+1)}, \dots ,X_{G(K+1)}\big). \end{align}\]

The (random) reward obtained at any time with this action is \(r_{{\bf a}^G_{-i}} = f({\bf d}_{{\bf a}^G_{-i}})\), with a mean reward of \(\mu _{{\bf a}^G_{-i}} = \mathbb {E}[f({\bf d}_{{\bf a}^G_{-i}})]\). The next result shows that an ordering on \(\binom{K+1}{K}\) actions made using \(K+1\) arms gives an ordering on \(K+1\) arms under the considered assumptions.

Lemma 3.1.

An ordering on \(\binom{K+1}{K}\) actions, in group \(G\), made using \(K+1\) arms gives an ordering on \(K+1\) arms. In other words, if an ordering exists between actions \({\bf a}^G_{-i}\) and \({\bf a}^G_{-j}\), then an ordering exists between arms \(G(i)\) and \(G(j)\). More precisely, (18) \[\begin{eqnarray} \mu _{{\bf a}^G_{-i}} \gt \mu _{{\bf a}^G_{-j}}\Rightarrow \mathbb {E}\left[X_{G(i)}\right] \lt \mathbb {E}\left[X_{G(j)}\right] \end{eqnarray}\]

Proof.

(Outline): If we have two actions made from group \(G\) by excluding arm \(G(i)\) and arm \(G(j)\) respectively, then the arm with higher reward will increase the joint reward \(f\) as \(f\) is an increasing function. The detailed proof is provided in Appendix B.□

CMAB-SM divides all \(N\) arms into groups of \(K+1\) arms arbitrarily. Each group now contains only \(K+1\) actions. If the last group contains less than \(K+1\) arms (if \(N\mod {(}K+1) \gt 0\)), arms from other groups are added (repeated) to have \(K+1\) arms in the last group. CMAB-SM then picks the first group of \(K+1\) arms and orders the arms in the group using SORT subroutine. Using this subroutine, the \(K+1\) arms in the group are ordered with respect to expected individual rewards. We also consider \(G^*\) as the best \(K\) arms seen so far, which are the top \(K\) arms in \(G_1\). It later proceeds in \(k \in \lbrace 2, \ldots , \frac{N}{K+1}\rbrace\) rounds.3 In \(k^{th}\) round it performs SORT on \(G_k\) and merge \(G_k\) and \(G^*\) using MERGE subroutine to obtain a new \(G^*\). The SORT subroutine orders the \(K+1\) arms in \(G_k\). The MERGE subroutine takes the best \(K\) arms before this round, \(G^*\); and the best \(K\) arms from the SORT subroutine on \(G_k\) and merges them to find the best \(K\) arms seen so far and saves them as \(G^*\). This is then inputted to the next value of \(k\) to merge with other groups.

At the end of \((\frac{N}{K+1})^{th}\) round, we would have played all arms in each group and merged them, thus resulting in an optimal action which maximizes the expected reward for the remaining time slots. Apart from the sort and merge scheme, we also use a hyperparameter \(\lambda\) in our algorithm. \(\lambda\) denotes the minimum gap the agent can resolve between any two arms. We choose the value of hyperparameter which minimizes the regret from the CMAB-SM algorithm. Following the regret analysis, the hyperparameter is specified in (29). If there is any the gap between arm \(i\) and arm \(j\), the algorithm cannot determine which arm is better with high probability and selects the arm with higher sample mean as the better arm. This behaviour is common in both SORT and MERGE subroutines. We now describe the algorithms used in CMAB-SM which are SORT and MERGE subroutines in detail.

3.1 SORT

The SORT subroutine is given in Algorithm 2. In this subroutine, we play \(K+1\) actions formed from \(K+1\) arms in a group \(G\), each action corresponding to one left out arm. The subroutine proceeds in rounds similar to UCB algorithm by [6]. By the end of round \(r\), each action is played \(n_r\) times so that the expected reward of each action can be estimated within \(\pm \Delta _r\). We will show in Lemma 4.1 that the value of \(n_r\) is chosen to obtain high probability confidence intervals of radius \(\Delta _r\) using Hoeffding’s Inequality. At the end of each round, the estimates are used to sort the arms, where the arms \(G(i), G(j)\) are considered sorted when the upper bound on reward estimate of action \({\bf a}^G_{-i}\) is less than lower bound of action \({\bf a}^G_{-j}\). When an arm is placed at its true sorted location in the group, its corresponding action is not sampled again. The procedure ends when \(\Delta _r\lt \lambda\) or when all \(K+1\) arms are sorted. At the end of the algorithm, only top \(K\) arms are provided as output.

3.2 MERGE

The MERGE subroutine is given in Algorithm 3. The MERGE subroutine aims to merge two groups, each with \(K\) sorted arms to give sorted best \(K\) arms. Since we only want the best \(K\) arms from the merged \(2K\) arms to be sorted, it can be done with only \(K+1\) arm comparisons.

Starting with two \(K\)-sized sorted groups \(G_1\), and \(G_2\) and an optimal group which is empty at the start of the subroutine, we identify the best \(K\) out of \(2\times K\) arms by figuring the best arm one by one. Starting with both \(i\) and \(j\) as 1, we construct a new action by replacing the \(i^{th}\) arm of group \(G_1\) by the \(j^{th}\) arm of group \(G_2\). Note that if after replacement the reward is bigger, it implies that the added \(j^{th}\) arm of \(G_2\) is the next arm in the sorted final list, else the \(i^{th}\) arm of \(G_1\) is the next arm in the sorted final list. In order to differentiate between the two actions, the procedure similar to the SORT subroutine is used. We sample the actions for \(n_r\) time steps, where we will show in Lemma 4.2 that the value of \(n_r\) is chosen to obtain high probability confidence intervals of radius \(\Delta _r\) using Hoeffding’s Inequality. Based on whether the \(i^{th}\) arm of \(G_1\) or \(j^{th}\) arm of \(G_2\) made in the optimal set, \(i\) or \(j\) is incremented and the procedure is repeated till the \(K\) best arms in the merger of the two groups are obtained.

3.3 Complexity of CMAB-SM

We now analyze the complexity of CMAB-SM for both storage and computation at each time step. Detailed subroutines are provided in Appendix G, with the key pseudo-codes in Algorithm 13.

The algorithm, while running SORT or MERGE subroutine, stores the reward of each action in the group, and sorts all the actions. The total storage at any step is no more than \(O(K)\). Even when the groups are being merged, \(O(K)\) temporary storage is used for the merged rewards. This merged group is then used to decide the action in the exploiting phase. Thus, the maximum storage at any time is \(O(K)\) for the subroutines and \(O(N)\) for CMAB-SM. To evaluate the computational complexity at each time-step, we consider the three cases of what the algorithm may be doing at a time step.

(1) In the SORT subroutine, at the end of each iteration of the while loop, arms in the group are sorted requiring \(O(K\log K)\) computations. It then loops over all the arms to place them in the correct order which takes \(O(K)\) steps. Thus, the computational complexity in the worst case time-step in sort is \(O(K\log K)\). (2) The MERGE subroutine at any time step either runs action and saves the result, or performs comparisons which are all \(O(1)\) at each time. (3) After the MERGE is complete, the best action is available which is then exploited thus making the complexity in the exploit phase as \(O(1)\). Thus, the overall complexity at any time is \(O(K\log K)\) which happens due to sorting the actions for the removal of sub-optimal arms after every round in the SORT subroutine.

In each call to SORT, the actions are sorted with respect to their mean observed rewards. From Lemma 3.1, an ordering is also obtained for the corresponding arms. In the MERGE subroutine, a new action is constructed from an old action by replacing exactly one arm. The ordering between the old action and the new action gives the ordering between the replaced arm and the new arm. Note that the inequality conditions work in different directions for SORT and MERGE algorithms.

3.4 Other Design Options

We note that an algorithm can be constructed by keeping the first \(K-1\) arms fixed. The algorithm will now select the best arm from remaining \(N-K+1\) arms. This arm will now always be kept in first \(K-1\) arms. The process is repeated until all \(K\) places are filled. However, this algorithm has two issues, (1) Higher complexity: The algorithm will need to sort \(N-K+1\) arms into place which increases the sorting complexity from \(\tilde{O}(K)\) to \(\tilde{O}(N)\). (2) More exploration steps, since this algorithm will now perform exploration after placing every arm among the first \(K\) Group. This increases the time required for exploration by a factor of \(K\), which would increase the order of regret bound. This makes the CMAB-SM a better choice compared to a naïve implementation of UCB by fixing \(K-1\) arms.

(20)
Skip 4REGRET ANALYSIS OF THE PROPOSED ALGORITHM Section

4 REGRET ANALYSIS OF THE PROPOSED ALGORITHM

In this section, we will present the main result of the paper, related to the regret analysis of the proposed algorithm.

4.1 Bounds on Exploitation Regret and Exploration Time

In this subsection, we will bound the regret in the exploitation phase, which indicates the loss in reward due to choosing an incorrect action at the end of the MERGE algorithm. We will also bound the time spent in the SORT and the MERGE subroutines, which is the exploration phase.

We first find the time taken in the sort and merge subroutines.

Lemma 4.1 (Sort Time Requirement).

SORT subroutine 2 gives correct ordering on \(K+1\) actions in a group \(G\) with probability \(1-\frac{K}{2N^2T^2}\), up to precision \(\lambda\) defined in equation(19), where the actions are chosen for at most \[\begin{equation*} \left(\sum _{i=1}^{K+1} \frac{64 U^2 \log 2NT}{\max (\delta _{G(i)}^2, \lambda ^2)}\right) \end{equation*}\] time steps, where \(\delta _{G(i)}\) is given in (19).

Proof.

(Outline) We sample each action till the confidence intervals around estimated means of any two actions are separated. The confidence intervals reduce as number of samples increases from concentration bounds from Lemma A.1 with a lower limit of \(\lambda\). Using Corollary 2.7 and Hoeffding’s Inequality (Lemma A.1) we bound the number of samples required for separation with high confidence. We then use union bounds to bound the total numbers of samples required for each action. The detailed proof is provided in Appendix C.□

Lemma 4.2 (Merge Time Requirement).

MERGE subroutine 5 merges arms in two groups \(G_1\) and \(G_2\) to \(G\) correctly with probability \(1-\frac{K}{2N^2T^2}\), up to precision \(\lambda\) as defined in equation (19), where the total number of time steps needed to merge is at most \[\begin{eqnarray*} \left(\sum _{i=1}^{K+1} \frac{64 U^2 \log 2NT}{\max (\delta _{G(i)}^2, \lambda ^2)} \right), \end{eqnarray*}\]

where \(\delta _{G(i)}\) is given in (19).

Proof.

(Outline) While merging two groups \(G_1\) and \(G_2\), we sample two actions till the confidence intervals around the estimated means of both actions are separated by twice the confidence intervals around the estimates. Again using Lemma A.1 and Corollary 2.7, we bound the number of samples required reduce the confidence intervals sufficiently enough to order two arms. The detailed proof is provided in Appendix C.□

In order to bound the regret in the exploitation phase, we first characterize the probability that the action decided by CMAB-SM is not the best action.

Lemma 4.3 (Total Error Probability).

The probability that the action selected by CMAB-SM during the exploitation phase is not the best action (up to precision defined in Equation 19) is at most \(\frac{1}{NT^2}\).

Proof.

(Outline) We use union bounds to calculate total error probability of the algorithm. We use Lemmas 4.1 and 4.2 to calculate failure probability of each sort and merge event respectively. There are total \(\frac{N}{K+1}\) groups to be sorted, and \(\frac{N}{K+1}-1\) groups to be merged. Taking union bound over the total number of failure events, and probability of each failure event gives an upper bound on total error probability of the algorithm. The detailed proof is provided in Appendix D.□

In the next result, we bound the time spent in exploring, including the SORT and MERGE subroutines for all groups.

Lemma 4.4 (Bound on Exploration Time Steps).

Total time-steps used to SORT all \(\frac{N}{K+1}\) groups, and merge these sorted groups one after the other is bounded as (21) \[\begin{eqnarray} T_{exp} \le \frac{128NU^2\log {2NT}}{\lambda ^2} \end{eqnarray}\]

Proof.

(Outline) Lemma 4.1 gives the maximum number of samples required to Sort one group. Similarly, Lemma 4.2 gives the number of samples required to merge two groups. Since there are \(\frac{N}{K+1}\) groups to be sorted, and \(\frac{N}{K+1}-1\) groups to be merged. Summing over total number of samples for each groups gives an upper bound on total samples required for exploration. The detailed proof is provided in Appendix E

In the following result, we bound the expected regret in the exploitation phase, caused by CMAB-SM selecting incorrect action.

Lemma 4.5 (Bounded Exploitation Regret).

The expected regret when a sub-optimal action \({\bf \hat{a}}^*\) is returned by CMAB-SM is bounded as (22) \[\begin{eqnarray} \mathbb {E}\left[\Delta _{{\bf \hat{a}}^*}\right] \le U\lambda \sqrt {K} + \frac{U\sqrt {K}}{N^2T^2} \end{eqnarray}\]

Proof.

(Outline) Regret can arise in Exploitation phase when either SORT algorithm or MERGE algorithm had a failure event. Regret can also come in exploitation phase if two arms are have expected rewards close enough that SORT or MERGE algorithm cannot distinguish between them with high confidence. Combining these two sources of regret and using Assumption 4 gives the upper bound on the expected regret during the exploitation phase. The detailed proof is provided in Appendix F.□

4.2 Main Result

Our main result is presented in Theorem 4.6, which states that CMAB-SM algorithm achieves a sub linear expected regret.

Theorem 4.6.

CMAB-SM algorithm described in Algorithm 1 has an expected regret accumulated during the entire time horizon upper bounded as (23) \[\begin{equation} W(T) = \tilde{O}\left(N^\frac{1}{3}K^\frac{1}{2}T^\frac{2}{3}\right) \end{equation}\]

Proof.

(Outline) We first note that regret of the algorithm for playing sub-optimal action can come during the exploration phase, or during exploitation phase if the exploration resulted in a suboptimal action. Time steps CMAB-SM uses for exploration is the total time spend in SORT and MERGE subroutines. We bound the time steps in both subroutines by \(\frac{128NU^2\log {2NT}}{\lambda ^2}\) using Lemma 4.4. Further, if the algorithm results in an sub-optimal arm, the expected regret in a single time step of exploitation phase is \((U\lambda \sqrt {K} + \frac{U\sqrt {K}}{NT^2})\) by using Lemma 4.5. By choosing an optimal value of \(\lambda\) as defined in 19 we obtain the required bound. Having described the outline, we next give the detailed steps of the proof.

(Detailed Steps) We note that the expected regret till time \(T\) is sum of expected regret accumulated at each round. We can rewrite it as sum of two phases of the algorithm which are exploration and exploitation as, (24) \[\begin{eqnarray} W(T)&=& \sum ^{T}_{t = 1}\mathbb {E}\left[R(t)\right] \end{eqnarray}\] (25) \[\begin{eqnarray} &=& \sum ^{T_{exp}}_{t = 1}\mathbb {E}\left[R(t)\right] + \left(T-T_{exp}\right)\times \mathbb {E}\left[\Delta _{\hat{{\bf a}}^*}\right] \end{eqnarray}\] (26) \[\begin{eqnarray} &\le & \sum ^{T_{exp}}_{t = 1}\mathbb {E}\left[R(t)\right] + T\mathbb {E}\left[\Delta _{\hat{{\bf a}}^*}\right] \end{eqnarray}\] (27) \[\begin{eqnarray} &\le & \sum ^{T_{exp}}_{t = 1}\max \left(R(t)\right) + T\mathbb {E}\left[\Delta _{\hat{{\bf a}}^*}\right] \end{eqnarray}\] (28) \[\begin{eqnarray} &\le & T_{exp}\max \left(R(t)\right) + T\mathbb {E}\left[\Delta _{\hat{{\bf a}}^*}\right], \end{eqnarray}\] where (24) follows from splitting the regret into exploration-exploitation phase, (25) follows since \(T-T_{exp}\le T\), and (26) follows since mean is at most the maximum.

Using the values for maximum regret in any round, Lemma 4.5, inequality (70), maximum exploration time from Lemma 4.4, and maximum exploitation regret from Lemma 4.5, we have, (29) \[\begin{eqnarray} &&W(T)\nonumber \nonumber\\ &\le & U\sqrt {K}\frac{128NU^2\log {2NT}}{\lambda ^2} + T\left(U\lambda \sqrt {K} + \frac{U\sqrt {K}}{NT^2}\right) \nonumber \nonumber\\ &=& \left(\frac{128NU^3\sqrt {K}\log {2NT}}{\lambda ^2} + TU\lambda \sqrt {K}\right) + \frac{U\sqrt {K}}{NT}. \end{eqnarray}\] We now choose a value of \(\lambda\) to optimize \(W(T)\). Since during the implementation of algorithm \(U\) is most likely an unknown quantity, we use the following value of \(\lambda\) (30) \[\begin{equation} \lambda =\left(\frac{256N\log {2NT}}{T}\right)^\frac{1}{3} \end{equation}\] Choosing the value of \(\lambda\) as defined in (29), we have the total regret of the algorithm as \[\begin{eqnarray*} W(T)&\le & (U^3+2U)\sqrt {K}\left(32N\log {2NT}\right)^{\frac{1}{3}}T^{\frac{2}{3}} + \frac{U\sqrt {K}}{NT} \end{eqnarray*}\] This proves the result as in the statement of the Theorem.□

This trick where we tune \(\lambda\) after we define the precision in each SORT/MERGE round allows us to eliminate the dependence on potentially hard to order sequences of items. The intuition behind this is, in a finite time horizon, any agent wants to work out the best possible arm it can get, however can do so only up to a certain precision permitted by finite the time available.

4.3 Handling Insufficient Exploration Time

We note that there can be an instance where the algorithm is run with insufficient time for exploration. We first characterize what value of \(T\) would result in the algorithm to run with insufficient exploration time. Then, we evaluate the regret in such a scenario.

Note that we assumed that rewards of each arm lies between \([0,1]\) in Section 2. This results in the gap between any two arms is less than 1. For the optimal \(\lambda\) defined in Equation 19, any \(T \le 256N\log (2NT)\) will make \(\lambda \ge 1\) which serves no practical purpose based on our assumption. In that case, we arbitrarily select one of the \({{N}\choose {K}}\) actions and still suffer a maximum regret of \(TU\sqrt {K} \le \lambda TU\sqrt {K}\). We have, (31) \[\begin{align} W(T) &\le U\sqrt {K}T \end{align}\] (32) \[\begin{align} &\le TU\sqrt {K}\lambda \end{align}\] (33) \[\begin{align} &= U\sqrt {K}(256N\log (2NT))^{1/3}T^{2/3} \end{align}\] We note that the sub-linear regret in Equation (32) grow as \(\tilde{O}(T^{2/3})\) but with a large multiplication constant. Hence, the linear regret in Equation (30) provides a better bound because of limited time to explore all arms.

4.4 Handling Unknown Time Horizon Using Doubling Trick

We now analyze the case where the time horizon \(T\) is unknown and the algorithm requires to optimize actions without the knowledge of \(T\) to tune \(\lambda\). We use the standard doubling trick from Multi-Armed Bandit literature [6, 8]. To use doubling trick we start the algorithm from \(T_0 = 0\). We then restart the algorithm after every \(T_l = 2^l,\ l=1,2, \ldots\) time steps, till the algorithm reaches the unknown \(T\). Each restart of the algorithm runs for \(T_l - T_{l-1}\) steps with \(T_0 = 0\) with \(\lambda _l =(\frac{256 N\log 2N(T_l - T_{l-1})}{T_l - T_{l-1}})^{1/3}\)

To show that the regret is bounded by \(T^{2/3}\) for the doubling algorithm, we use Theorem 4 from [8] which we state in the following lemma.

Lemma 4.7.

If an algorithm \(\mathcal {A}\) satisfies \(R_T(\mathcal {A}_T) \le cT^\gamma (\log T)^\delta + f(T)\), for \(0\lt \gamma \lt 1\), \(\delta \ge 0\) and for \(c\gt 0\), and an increasing function \(f(t) = o(t^\gamma (\log t)^\delta (\text{at } t\rightarrow \infty)\), then anytime version \(\mathcal {A}^{\prime } := \mathcal {DT}(\mathcal {A}, (T_i)_{i\in \mathbb {N}})\) with geometric sequence \((T_i)_{i\in \mathbb {N}}\) of parameters \(T_0\in \mathbb {N}^*\), \(b\gt 1, (i.e.,T_i = \lfloor T_0b^i\rfloor)\) with the condition \(T_0(b-1) \gt 1\) if \(\delta \gt 0\) satisfies, (34) \[\begin{align} R_T(\mathcal {A}^{\prime }) \le l(\gamma , \delta , T_0, b)cT^\gamma (\log T)^\delta + g(T), \end{align}\] with an increasing function \(g(t) = o(t^\gamma (\log t)^\delta)\) and a constant loss \(l(\gamma , \delta , T_0, b)\gt 1\), (35) \[\begin{align} l(\gamma , \delta , T_0, b) := \left(\left(\frac{\log (T_0(b-1)+1)}{\log (T_0(b-1))}\right)^\delta \right)\times \frac{b^\gamma (b-1)^\gamma }{b^\gamma -1} \end{align}\]

Using Lemma 4.7 for \(b = 2, \gamma = 2/3, \delta = 1/3\), we can convert our algorithm to an anytime algorithm.

4.5 Comparison between CMAB-SM and UCB Algorithm

We now compare the regret bound with the one that would be achieved by using the UCB approach on each of the \(\binom{N}{K}\) actions.

Lemma 4.8 (UCB Regret, [6]).

For a Multi Armed Bandit setting with action space \(\mathcal {A}\), time horizon T, and precision \(\lambda \approx \sqrt {\frac{|\mathcal {A}|\log {|\mathcal {A}|}}{T}}\), the expected regret accumulated during entire time horizon T using improved UCB algorithm is upper bounded as, \[\begin{equation*} W(T) \le \sqrt {|\mathcal {A}|T}\frac{\log {\left(|\mathcal {A}|\log {|\mathcal {A}|}\right)}}{\sqrt {\log {|\mathcal {A}|}}} \end{equation*}\]

Bounding the size of action space by using Stirling’s approximation [38], we get expected regret accumulated regret at time \(T\) of UCB algorithm as, \[\begin{eqnarray*} W_{UCB}(T) = \tilde{O}\left(\left(\frac{eN}{K}\right)^{\frac{K}{2}}T^{\frac{1}{2}}\right) \end{eqnarray*}\]

For UCB approach to outperform CMAB-SM, \(T\) has to be very large. More formally, \[\begin{eqnarray*} W(T) \gt W_{UCB}(T), \text{ when }T = \tilde{\Omega }\left(\frac{e^{3K}N^{3K-2}}{K^{3K+3}}\right) \end{eqnarray*}\]

Even for an agent which can play \(10^{12}\) actions per second, this will take about 10 million years to outperform CMAB-SM for a setup with \(N=30\) and \(K=15\). Hence, for all practical problems when an agent has a large number of arms to play simultaneously, the CMAB-SM algorithm will outperform the UCB algorithm.

4.6 Discussion of \(\tilde{O}(T^{2/3})\) Regret Bound

Lower bound of \(\Omega (\sqrt {NT})\) for Linear Bandits was proven in [12], and lower bound of \(\Omega (\sqrt {KNT})\) is proven for semi-bandits by [28]. Note that any \(\tilde{O}(\sqrt {T})\) regret bound algorithm compares all the actions with best possible action either by getting individual regret for semi-bandits or by estimating individual regret as in linear bandits [1, 12, 28].

Individual sub-optimal action \({\bf a}\) is eliminated in \(O(\frac{1}{\Delta _{\bf a}^2})\) number of samples. The regret from this action then becomes \(\Delta _{\bf a}\times \frac{1}{\Delta _{\bf a}^2} = \frac{1}{\Delta _{\bf a}}\) and hence, the cumulative regret is of the form of \(\frac{1}{\lambda }T + \lambda T\). This gives \(O(\sqrt {T})\) regret for \(\lambda = \frac{1}{\sqrt {T}}\).

Due to the bandit feedback, we do not directly obtain the rewards of the individual arms but we only get an ordering on any two arms which can be compared. To eliminate arms early we need some estimator of individual arms similar to linear bandits. Also, the regret accumulated to eliminate arm \(i\) is not of the form \(O(1/\Delta _i)\). This also hinders the development of an \(\tilde{O}(\sqrt {T})\) bound with linear space and time complexity.

Skip 5NUMERICAL EVALUATION Section

5 NUMERICAL EVALUATION

In this section, we evaluate CMAB-SM under a synthetic problem setting. We compare the result with improved UCB algorithm as described in [6]. Since this paper provides the first result with non-linear reward functions for CMAB problem with bandit feedback, we compare with the UCB algorithm [6] which is optimal for small \(N\) and \(K\) while having the regret scale with \(\binom{N}{K}\).

For evaluations, we ran the algorithm for \(T = 10^6\) time steps and averaged over 30 runs. We compare cumulative regret at each \(t\) starting from \(t=0\), which is defined as, \[\begin{eqnarray*} W(t) = \sum ^t_{t^{\prime }=1}R(t^{\prime }) \end{eqnarray*}\]

We consider two values of \(N \in \lbrace 12, 24\rbrace\). For \(N=12\), we choose \(K \in \lbrace 2, 3, 5\rbrace\), while for \(N=24\), we choose \(K \in \lbrace 2, 3, 5, 7, 11\rbrace\). Since the arms must have FSD over each other, we describe one example single parameter distributions for the reward of each arm that have this property. We consider a random variable \(Y_i\) which follows an exponential distribution with parameter \(\lambda _i\), \(Y_i \sim exp(\lambda _i)\). Since this random variable can take values in \([0,\infty)\), we transform the variable using \(\arctan\) function to limit it to the set \([0,\pi /2)\) as (36) \[\begin{eqnarray} X_i &=& \frac{2}{\pi }\arctan (Y_i). \end{eqnarray}\]

We note that arm \(i\) has FSD over arm \(j\) if \(\lambda _i\gt \lambda _j\). Thus, this reward distribution satisfies Assumption 2 as long as no two arms have same parameter, or \(\lambda _i\ne \lambda _j\) for any \(i\ne j\). Figure 3(b) in Appendix plots \(P(X\ge x)\) of the reward function for different values of \(\lambda _i\). We see that \(P(X\ge x)\) is larger for the distribution with larger value of \(\lambda\), and for any two different values of \(\lambda\), there is \(x\) (e.g., any \(x\in (0,1]\)) such that \(P(X\ge x)\) are not the same thus showing that the reward distributions satisfy Assumption 2.

We consider an online portal that can display \(K\) products because of certain limitations. Assume that the reward, which indicates the profit from the sale of a product, from each arm follows the distribution as defined in (35). However, there is an additional benefit received when multiple products are sold together, e.g., reduced overhead/shipping costs. We define a non linear reward \(r\) as a function \(f\) of individual arms as follows: (37) \[\begin{equation} f\left(X_1, X_2,\ldots , X_K\right) = \frac{2}{K(K+1)}\sum _{i = 1}^K\sum _{j \ge i}^K X_i X_j. \end{equation}\] The expected value of the reward in terms of expected value of rewards of individual arms is \[\begin{eqnarray*} \mathbb {E}\left[f\left(X_1, X_2,\ldots , X_K\right)\right]\nonumber \nonumber &=& \frac{2}{K(K+1)}\left(\sum _{i = 1}^K\mathbb {E}\left[X_i^2\right] + \sum _{i = 1}^K\sum _{j \gt i}^K\mathbb {E}\left[X_i\right]\mathbb {E}\left[X_j\right]\right), \end{eqnarray*}\] This result follows from linearity of expectation. We note that the expected reward is strictly increasing with respect to the expected values of individual rewards. Figure 1 shows the evaluation results for setting where the reward of an action is the function described in Equation (36) of the rewards of individual rewards of the arms. We see that for both values of \(N\), CMAB-SM outperforms UCB for \(K\gt 3\) in the time step range considered. Further, for \(N=24\) and \(K=2\), the gap between the proposed algorithm and UCB is small. In summary, when the value of \(\binom{N}{K}\) is moderately large, and \(T\) is not significantly large (\(T\lt \tilde{O}(\frac{e^{3K}N^{3K-2}}{K^{3K+3}})\)), CMAB-SM outperforms the baseline. Further, the computation and storage complexity of the proposed algorithm are much better as compared to the baseline, as seen in Section 3.3.

Fig. 1.

Fig. 1. Empirical regret of UCB and CMAB-SM algorithm for the case when the reward of actions is a non linear function of rewards of individual arms for various values of \( N \) and \( K \). As it can be seen from the plots, except for the case of \( K=2 \), CMAB-SM incurs significantly lower regret than UCB algorithm.

In Appendix H, we further consider two more examples for Bernoulli distribution, where the sum and maximum of rewards are considered as the reward functions.

Skip 6APPLICATION TO SOCIAL INFLUENCE MAXIMIZATION Section

6 APPLICATION TO SOCIAL INFLUENCE MAXIMIZATION

We now apply CMAB-SM and evaluate it for solving the problem of Social Influence Maximization, which in the words of Domingos & Richardson [15] is defined as follows: “if we can try to convince a subset of individuals in a social network to adopt a new product or innovation, and the goal is to trigger a large cascade of further adoptions, which set of individuals should we target?”

6.1 Problem Description

Consider a social network \(G=(V,E,P)\), where \(V\) is the node set representing individuals, \(E=\lbrace (u,v): u,v \in V\rbrace\) is the set of directed (\(u\) to \(v\)) connections among the individuals, and \(P=\lbrace p_{e}\!\!: e \in E\rbrace\) is the set the strengths/weights of the connections in \(E\).

For the standard influence maximization problem [23], we select a ‘seed’ set \(S\) of \(K\gt 0\) individuals to initiate a cascade. Analogous to diseases spreading between individuals, the set of users \(S\) will attempt to influence their neighbors in the social network \(G\), and those infected neighbors will attempt to influence their neighbors, and so on. We consider the simple, discrete-time independent cascades diffusion model. In this model, if individual \(u\) becomes influenced, then at the next time step, individual \(u\) attempts to influence each of his neighbors \(v\) in the network with probability \(p_{(u,v)}\). All the influence attempts are statistically independent of each other. Our goal in picking the seed set \(S\) is to have a large cascade. Let \(y^{(v)} = 1\) denote that node \(v\) was influenced in the cascade and \(y^{(v)} = 0\) otherwise. Let \(\sigma (S)\) denote the expected cascade size when \(S\) is the seed set, (38) \[\begin{eqnarray} \sigma (S) &=& \mathbb {E}\left[\sum _{v \in V} y^{(v)} \bigg | \text{individuals in } S \text{ were initially influenced} \right]. \end{eqnarray}\] The goal is to find \({\text{arg,max}}_{S\subseteq V:\ |S|=K} \sigma (S)\).

Note that the word time-step in the above diffusion model is different from the word time-step in an adaptive/online algorithm. In fact, for every time-step in an adaptive algorithm, the entire diffusion (all time-steps of the aforementioned independent cascade model) takes place.

In the traditional problem, \(G\) is fully known, not only the edge set \(E\) but also the diffusion parameters \(P\). Here, we consider an adaptive version of the problem, where only the individuals (node set \(V\)) are known. At each time \(t\), we select a seed set \(S_t\) and observe a single resulting cascade \(\lbrace y^{(v)}_t\rbrace _{v\in V}\). Using the information from those outcomes, at time \(t+1\) we select a (possibly new) seed set, and so on. We want to maximize the average, expected (39) \[\begin{eqnarray} {\text{arg,max}}_{ \lbrace S_t \subseteq V: |S_t|=K\rbrace _{t=1}^T} \ \frac{1}{T} &&\sum _{t=1}^T \ \mathbb {E}\left[\sum _{v \in V} y_t^{(v)} \bigg | \text{individuals in } S_t \text{ were initially influenced} \right] \end{eqnarray}\]

Recent works on adaptive influence maximization assume semi-bandit feedback, such as which specific nodes were influenced or even along which edges influence propagated [39, 41]. In [30], \(\epsilon\)-greedy strategies are proposed for adaptive influence maximization although assumptions are made on edge parameter distributions.

6.2 Algorithms

We propose to solve the adaptive influence maximization problem using CMAB-SM. Note that CMAB-SM is a general procedure for CMAB problems and does not use any special properties of network diffusion models or knowledge about the edge set \(E\) or parameters \(P\).

The reward \(r_{\bf a}(t)\) for an action \({\bf a}\) of choosing seed set \(S\) at time \(t\) is the count of infected individuals in a cascade run at time \(t\) where only nodes in \(S\) were initially infected, \[\begin{equation*} r_{\bf a}(t) = \sum _{v \in V} y_t^{(v)} \ \text{ where only } v \in S \text{ were initially influenced}. \end{equation*}\]

The aggregate reward function \(r_{\bf a}(t)\) cannot be decomposed into a function of “rewards” for individual nodes like \(f({\bf d}_{{\bf a}_t})\) in (1). Thus, it does not strictly satisfy the assumptions used to prove performance guarantees for CMAB-SM. However, the reward function \(r_{\bf a}(t)\) satisfies properties related to those in the assumptions. We discuss this in more detail in Appendix I.

We compare the performance of CMAB-SM against an \(\epsilon\)-greedy version of the credit distribution model [20], we call it \(\epsilon\)-CD. The credit distribution model was proposed to infer influences from historical data without knowledge of the diffusion parameters. It does require the edge set \(E\); to provide a fair comparison to CMAB-SM, we provide the credit distribution method with the edge set of the complete directed graph (for which edges \(e^{\prime } \not\in E\) have \(p_{e^{\prime }} = 0\)).

The \(\epsilon\)-CD algorithm follows standard \(\epsilon\)-greedy method for bandits. For a fraction \(\epsilon\) of the total time horizon \(T\), a random size \(K\) subset of nodes is chosen. For the rest of the time horizon it exploits the best seed set of size \(K\) chosen by the credit-distribution algorithm using the diffusion history obtained from \(\epsilon T\) time steps. The credit-distribution algorithm uses the list of infected nodes from the \(t-1\) previous cascades and when exactly did they get infected within those cascades to estimate the expected reward of different size \(K\) seed sets. The seed set it estimates as best is then used at time \(t\). Note that the proposed algorithm CMAB uses more limited feedback–only the number of nodes infected in each cascade.

It is also of interest to understand how close does CMAB-SM get to the actual optimal set. However, for large networks, it can be computationally intractable to find the optimal set. Hence, instead of the optimal set, we compare CMAB-SM with the greedy-best optimal set obtained by the simulation-based greedy algorithm which is guaranteed [23] to have an influence of at least \(1-\frac{1}{e}\) times the influence of the actual optimal set. Moreover, the simulation-based greedy algorithm is computationally very expensive as compared to CMAB-SM as it involves large number of costly Monte-Carlo simulations.

Note that, contrary to the cross-selling application discussed in Section 5, we do not use the LinUCB algorithm for our comparisons as it is not scalable for large values of \(N\).

6.3 Experimental Setup

For our experiments, we considered the Facebook Friends network [31] which has 4,039 nodes with 88,234 undirected edges. We used a representative community (sub-network) of the Facebook network which has 534 nodes and 8,158 undirected edges. It was identified using the the Louvain method for community detection [9]. The selected community originally had one node with extremely high degree of 533 within the selected community, which was removed. Each undirected edge was then replaced by two directed edges. Following [23], for each node \(v\in V\), the parameter of each edge entering \(v\) was set to \(1/\text{indegree}(v)\). We used a time horizon of \(T=100,\!000\). At each time step \(t\), each method selected a seed set \(S_t\), which was used to run a single cascade. The total number of influenced users was set as the reward.

For the \(\epsilon\)-CD algorithm, we used the exploration probabilities \(\epsilon =0.10,0.25\). For the offline greedy algorithm, we averaged 1,000 Monte-Carlo diffusions for each candidate subset.

We used seed-set sizes \(K = 2,4,8\). Performance of each algorithm was measured using observed reward, averaged over 10 runs. For aesthetic reasons, we smoothed the average across runs using a simple moving average of the data for the previous 500 time points.

For the simulation-based greedy algorithm, we have calculated the expected influence of any subset as Monte-Carlo average of the observed influences based on 1,000 diffusions.

6.4 Results

The observed reward for each algorithm, averaged over 10 independent runs and further smoothed using moving average filter on the time-axis, is shown in Figure 2, separately for each \(K=2,4,8\).

Fig. 2.

Fig. 2. Averaged observed reward for different algorithms for different seed set sizes for the Facebook friends network.

For all values of \(K\), CMAB-SM significantly out-performs the UCB algorithm. For the size of this problem (\(N=534\) and \(K\in \lbrace 2,4,8\rbrace\)), the UCB algorithm is unable to even finish sampling each possible set of \(K\) nodes once, even for \(K=2\). Thus, although for sufficiently large \(T\) the UCB algorithm should eventually find the optimal set, it needs a much larger time horizon \(T\) to do so. The efficient divide-and-conquer strategy of CMAB-SM allows it to do well while the exhaustive-search based UCB algorithm is still in the exploration phase.

For all values of \(K\), CMAB-SM achieves a lower cumulative reward than the \(\epsilon\)-CD algorithms (area under the curves in Figure 2). Recall that the \(\epsilon\)-CD algorithms are domain-specific and receive significantly more feedback than our proposed domain-agnostic algorithm CMAB-SM does, so we expect that the \(\epsilon\)-CD algorithms would do better. We note that the CMAB-SM algorithm is in SORT phase until around \(t=4\times 10^4\). The CMAB-SM algorithm finishes exploration around \(t=9 \times 10^4\) which is consistent with Lemma 14 of the main text with \(\lambda\) defined in Equation (19).

However, despite CMAB-SM being domain-agnostic and receiving less feedback, the final seed sets selected by CMAB-SM are significantly better than those selected by the \(\epsilon\)-CD algorithms for \(K=2\) and \(K=8\), and slightly better for \(K=4\). In fact, for \(K=2\), despite having no explicit knowledge of the underlying network or diffusion dynamics, and receiving minimal feedback, CMAB-SM finds the same set as the greedy offline method, which is provably near-optimal. This result may in part be due to probabilistic stochastic dominance holding with a high probability, especially for \(K=2\).

We do not formally compare empirical run-times of the different algorithms because we wrote Python3 implementations for CMAB-SM and UCB, but for \(\epsilon\)-CD, we used Python3 code that called an executable for the credit distribution method [19] which required extensive disk I/O. In Table 1, we provide complexity bounds, suggesting \(\epsilon\)-CD is inherently slower than CMAB-SM. In terms of both time and space complexity, CMAB-SM algorithm outperforms \(\epsilon\)-CD algorithm (a domain-specific algorithm) and UCB algorithm (a bandit-specific algorithm).

Table 1.
AlgorithmAmortized Time-ComplexityWorst-Case Space-Complexity
CMAB-SM\(O(K \log K)\)\(O(N)\)
\(\epsilon\)-CD\(\Omega (\epsilon N)\)\(\Omega \left(\epsilon NT \right)\)
UCB\(O(N^K K \log N)\)\(O(N^K)\)

Table 1. Computational Complexities of the Different Algorithms

We compare the amortized time-complexity for all the algorithms. We do so because exactly once for the entire time-horizon, the \(\epsilon\)-CD algorithm scrapes through the history of independent-cascade diffusions accumulated over \(\epsilon T\) time steps where each diffusion can be of the length \(N\) in the worst-case. Hence, \(\epsilon N T\) amount of time consumed by this bottleneck step is effectively distributed over \(T\) time steps. However, the space-complexity cannot be distributed across all time steps, as we would need a computer of at least worst-case space-complexity to store the history of independent-cascade diffusions accumulated over \(\epsilon T\) time steps. Hence, we compare the worst-case space complexity for all algorithms.

Remark 3.

For sufficient exploration time, there is an additional factor of \(T^{-\frac{1}{3}}\) in amortized time-complexity of the CMAB-SM algorithm as the exploration, and sorting of actions, only happens for \(O(T^\frac{2}{3})\) steps.

Skip 7CONCLUSIONS AND FUTURE WORK Section

7 CONCLUSIONS AND FUTURE WORK

This paper considers the problem of combinatorial multi-armed bandits with non-linear rewards, where agent chooses \(K\) out of \(N\) arms in each time-step and receives an aggregate reward. A novel algorithm, called CMAB-SM, is proposed which is shown to be computationally efficient and has a space complexity which is linear in number of base arms. The algorithm is analyzed in terms of a regret bound, and is shown to outperform the approach of considering the combinatorial action as the arm for limited time horizon \(T\). CMAB-SM provides a way to resolve two challenges in the combinatorial bandits problem. The first is that the feedback is non-linear in the individual arms, and the second is that the space complexity in the previous approaches could be large due to exploding action space.

CMAB-SM works efficiently for large \(N\) and \(K\). However, finding an algorithm which is efficient in space complexity, having a regret bound which increases as \(T^{1/2}\) instead of \(T^{2/3}\), and not having combinatorial factors in the regret bound, is an important research direction. Followed by our work, this problem was considered in [3], where an accept-reject based algorithm is proposed that achieves a regret of \(\widetilde{O}(K \sqrt {KNT})\). We note that even though the bound on \(T\) is better in [3], the bounds on \(K\) and \(N\) are not. Indeed, our bounds are better when \(T\lt \widetilde{O}(NK^6)\). Due to such high requirement of \(T\) for [3], we note that their approach does not outperform the approach in this paper for the social influence maximization. Thus, an algorithm that is efficient in all parameters \(N\), \(K\), and \(T\) is left for the future. Considering non-symmetric functions of individual rewards and studying the applications of more general settings to Social Influence Maximization remains as a future work.

Footnotes

  1. 1 \(\tilde{O}()\) is the big-\(O\) notation up to logarithmic factors.

    Footnote
  2. 2 We note that maximum function, in general, does not satisfy Assumption 5. However, if an arm exists such that its rewards are always higher compared to other arms, the agent can place any other arm among the \(K-1\) arms, without incurring any regret.

    Footnote
  3. 3 We treat \(\frac{N}{K+1}\) as an integer. However, replacing \(\frac{N}{K+1}\) to \(\lceil \frac{N}{K+1}\rceil\) would not change the order analysis.

    Footnote

REFERENCES

  1. [1] Abbasi-Yadkori Yasin, Pal David, and Szepesvari Csaba. 2011. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24. 23122320.Google ScholarGoogle Scholar
  2. [2] Agarwal Mridul and Aggarwal Vaneet. 2018. Regret bounds for stochastic combinatorial multi-armed bandits with linear space complexity. arXiv preprint arXiv:1811.11925 (2018).Google ScholarGoogle Scholar
  3. [3] Agarwal Mridul, Aggarwal Vaneet, Quinn Christopher J., and Umrawal Abhishek. 2021. DART: aDaptive accept reject for non-linear top-k subset identification. In Proc. AAAI.Google ScholarGoogle Scholar
  4. [4] Agarwal Mridul, Aggarwal Vaneet, Quinn Christopher J., and Umrawal Abhishek K.. 2021. Stochastic Top-\(K\) subset bandits with linear space and non-linear feedback. In Algorithmic Learning Theory. PMLR, 306339.Google ScholarGoogle Scholar
  5. [5] Audibert Jean-Yves, Bubeck Sébastien, and Lugosi Gábor. 2014. Regret in online combinatorial optimization. Math. Oper. Res. 39, 1 (Feb. 2014), 3145. https://doi.org/10.1287/moor.2013.0598Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Auer Peter and Ortner Ronald. 2010. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica 61, 1–2 (2010), 5565.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Bawa Vijay S.. 1975. Optimal rules for ordering uncertain prospects. Journal of Financial Economics 2, 1 (1975), 95121.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Besson Lilian and Kaufmann Emilie. 2018. What doubling tricks can and can’t do for multi-armed bandits. arXiv.org (2018).Google ScholarGoogle Scholar
  9. [9] Blondel Vincent D., Guillaume Jean-Loup, Lambiotte Renaud, and Lefebvre Etienne. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008, 10 (2008), P10008.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Cesa-Bianchi Nicolò and Lugosi Gábor. 2012. Combinatorial bandits. J. Comput. Syst. Sci. 78, 5 (Sept. 2012), 14041422. https://doi.org/10.1016/j.jcss.2012.01.001Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Chen Wei, Wang Yajun, and Yuan Yang. 2013. Combinatorial multi-armed bandit: General framework and applications. In Proceedings of the 30th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 28), Dasgupta Sanjoy and McAllester David (Eds.). PMLR, Atlanta, Georgia, USA, 151159. http://proceedings.mlr.press/v28/chen13a.html.Google ScholarGoogle Scholar
  12. [12] Dani Varsha, Hayes Thomas, and Kakade Sham. 2008. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory. 355366.Google ScholarGoogle Scholar
  13. [13] Dani Varsha, Hayes Thomas P., and Kakade Sham M.. 2008. Stochastic linear optimization under bandit feedback. In 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, Servedio Rocco A. and Zhang Tong (Eds.). Omnipress, 355366. http://colt2008.cs.helsinki.fi/papers/80-Dani.pdf.Google ScholarGoogle Scholar
  14. [14] Dani Varsha, Kakade Sham M., and Hayes Thomas P.. 2008. The price of bandit information for online optimization. In Advances in Neural Information Processing Systems 20, Platt J. C., Koller D., Singer Y., and Roweis S. T. (Eds.). Curran Associates, Inc., 345352.Google ScholarGoogle Scholar
  15. [15] Domingos Pedro and Richardson Matt. 2001. Mining the network value of customers. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 5766.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Filippi Sarah, Cappe Olivier, Garivier Aurelien, and Szepesvari Csaba. 2010. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems 23. 586594.Google ScholarGoogle Scholar
  17. [17] Gai Yi, Krishnamachari Bhaskar, and Jain Rahul. 2010. Learning multiuser channel allocations in cognitive radio networks: A combinatorial multi-armed bandit formulation. In New Frontiers in Dynamic Spectrum, 2010 IEEE Symposium on. IEEE, 19.Google ScholarGoogle Scholar
  18. [18] Gopalan Aditya, Mannor Shie, and Mansour Yishay. 2014. Thompson sampling for complex online problems. In Proceedings of the 31st International Conference on Machine Learning. 100108.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Goyal Amit. 2011. Credit Distribution Source Code Release. https://www.cs.ubc.ca/ goyal/code-release.php. Accessed: 2019-04-10.Google ScholarGoogle Scholar
  20. [20] Goyal Amit, Bonchi Francesco, and Lakshmanan Laks VS. 2011. A data-based approach to social influence maximization. Proceedings of the VLDB Endowment 5, 1 (2011), 7384.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Hadar Josef and Russell William R.. 1969. Rules for ordering uncertain prospects. The American Economic Review 59, 1 (1969), 2534.Google ScholarGoogle Scholar
  22. [22] Jun Kwang-Sung, Bhargava Aniruddha, Nowak Robert, and Willett Rebecca. 2017. Scalable generalized linear bandits: Online computation and hashing. In Advances in Neural Information Processing Systems 30. 98108.Google ScholarGoogle Scholar
  23. [23] Kempe David, Kleinberg Jon, and Tardos Éva. 2003. Maximizing the spread of influence through a social network. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 137146.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Krafft O. and Schmitz N.. 1969. A note on Hoeffding’s Inequality. J. Amer. Statist. Assoc. 64, 327 (1969), 907912.Google ScholarGoogle Scholar
  25. [25] Kveton Branislav, Li Chang, Lattimore Tor, Markov Ilya, Rijke Maarten de, Szepesvari Csaba, and Zoghi Masrour. 2018. BubbleRank: Safe online learning to rerank. arXiv preprint arXiv:1806.05819 (2018).Google ScholarGoogle Scholar
  26. [26] Kveton Branislav, Szepesvari Csaba, Wen Zheng, and Ashkan Azin. 2015. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 37), Bach Francis and Blei David (Eds.). PMLR, Lille, France, 767776. http://proceedings.mlr.press/v37/kveton15.html.Google ScholarGoogle Scholar
  27. [27] Kveton Branislav, Wen Zheng, Ashkan Azin, Eydgahi Hoda, and Eriksson Brian. 2014. Matroid bandits: Fast combinatorial optimization with learning. arXiv preprint arXiv:1403.5045 (2014).Google ScholarGoogle Scholar
  28. [28] Kveton Branislav, Wen Zheng, Ashkan Azin, and Szepesvari Csaba. 2015. Tight regret bounds for stochastic combinatorial semi-bandits. In Artificial Intelligence and Statistics. 535543.Google ScholarGoogle Scholar
  29. [29] Kveton Branislav, Zaheer Manzil, Szepesvari Csaba, Li Lihong, Ghavamzadeh Mohammad, and Boutilier Craig. 2020. Randomized exploration in generalized linear bandits. In 23rd International Conference on Artificial Intelligence and Statistics.Google ScholarGoogle Scholar
  30. [30] Lei Siyu, Maniu Silviu, Mo Luyi, Cheng Reynold, and Senellart Pierre. 2015. Online influence maximization. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 645654.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Leskovec Jure and Mcauley Julian J.. 2012. Learning to discover social circles in ego networks. In Advances in Neural Information Processing Systems. 539547.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Li Lihong, Chu Wei, Langford John, and Schapire Robert E.. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web. 661670.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Li Lihong, Lu Yu, and Zhou Dengyong. 2017. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning. 20712080.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Liau David, Song Zhao, Price Eric, and Yang Ger. 2018. Stochastic multi-armed bandits in constant space. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics(Proceedings of Machine Learning Research, Vol. 84), Storkey Amos and Perez-Cruz Fernando (Eds.). PMLR, Playa Blanca, Lanzarote, Canary Islands, 386394. http://proceedings.mlr.press/v84/liau18a.html.Google ScholarGoogle Scholar
  35. [35] Lin Tian, Abrahao Bruno, Kleinberg Robert, Lui John, and Chen Wei. 2014. Combinatorial partial monitoring game with linear feedback and its applications. In Proceedings of the 31st International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 32), Xing Eric P. and Jebara Tony (Eds.). PMLR, Bejing, China, 901909. http://proceedings.mlr.press/v32/lind14.html.Google ScholarGoogle Scholar
  36. [36] Nuara Alessandro, Trovo Francesco, Gatti Nicola, Restelli Marcello, et al. 2018. A combinatorial-bandit algorithm for the online joint bid/budget optimization of pay-per-click advertising campaigns. In Thirty-Second AAAI Conference on Artificial Intelligence. 18401846.Google ScholarGoogle Scholar
  37. [37] Rejwan Idan and Mansour Yishay. 2020. Top-\(k\) combinatorial bandits with full-bandit feedback. In Algorithmic Learning Theory. 752776.Google ScholarGoogle Scholar
  38. [38] Slomson Alan B.. 1997. Introduction to Combinatorics. CRC Press.Google ScholarGoogle Scholar
  39. [39] Vaswani Sharan, Kveton Branislav, Wen Zheng, Ghavamzadeh Mohammad, Lakshmanan Laks VS, and Schmidt Mark. 2017. Model-independent online learning for influence maximization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 35303539.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Wen Zheng, Kveton Branislav, Valko Michal, and Vaswani Sharan. 2016. Online influence maximization under independent cascade model with semi-bandit feedback. arXiv preprint arXiv:1605.06593 (2016).Google ScholarGoogle Scholar
  41. [41] Wen Zheng, Kveton Branislav, Valko Michal, and Vaswani Sharan. 2017. Online influence maximization under independent cascade model with semi-bandit feedback. In Advances in Neural Information Processing Systems. 30223032.Google ScholarGoogle Scholar
  42. [42] Wong Raymond Chi-Wing, Fu Ada Wai-Chee, and Wang K.. 2003. MPIS: Maximal-profit item selection with cross-selling considerations. In Third IEEE International Conference on Data Mining. 371378.Google ScholarGoogle Scholar
  43. [43] Xiang Yu, Lan Tian, Aggarwal Vaneet, and Chen Yih-Farn R.. 2016. Joint latency and cost optimization for erasure-coded data center storage. IEEE/ACM Transactions on Networking (TON) 24, 4 (2016), 24432457.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Zhang Weinan, Zhang Ying, Gao Bin, Yu Yong, Yuan Xiaojie, and Liu Tie-Yan. 2012. Joint optimization of bid and budget allocation in sponsored search. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 11771185.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Zhang Zhijie, Chen Wei, Sun Xiaoming, and Zhang Jialin. 2021. Online influence maximization with node-level feedback using standard offline oracles. arXiv preprint arXiv:2109.06077 (2021).Google ScholarGoogle Scholar

Index Terms

  1. Stochastic Top K-Subset Bandits with Linear Space and Non-Linear Feedback with Applications to Social Influence Maximization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Article Metrics

        • Downloads (Last 12 months)449
        • Downloads (Last 6 weeks)30

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!