Abstract
We consider a bandit problem with K task types from which the controller activates one task at a time. Each task takes a random and possibly heavy-tailed completion time, and a reward is obtained only after the task is completed. The task types are independent from each other, and have distinct and unknown distributions for completion time and reward. For a given time horizon τ, the goal of the controller is to schedule tasks adaptively so as to maximize the reward collected until τ expires. In addition, we allow the controller to interrupt a task and initiate a new one. In addition to the traditional exploration-exploitation dilemma, this interrupt mechanism introduces a new one: should the controller complete the task and get the reward, or interrupt the task for a possibly shorter and more rewarding alternative? We show that for all heavy-tailed and some light-tailed completion time distributions, this interruption mechanism improves the reward linearly over time. Applications of this model include server scheduling, optimal free sampling strategies in advertising and adaptive content selection. From a learning perspective, the interrupt mechanism necessitates learning the whole arm distribution from truncated observations. For this purpose, we propose a robust learning algorithm named UCB-BwI based on median-of-means estimator for possibly heavy-tailed reward and completion time distributions. We show that, in a K-armed bandit setting with an arbitrary set of L possible interrupt times, UCB-BwI achieves O(Kłog(τ)+KL) regret. We also prove that the regret under any admissible policy is Ømega(Kłog(τ)), which implies that UCB-BwI is order optimal.
- Venkatachalam Anantharam, Pravin Varaiya, and Jean Walrand. 1987. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards. IEEE Trans. Automat. Control, Vol. 32, 11 (1987), 977--982.Google Scholar
Cross Ref
- Søren Asmussen. 2008. Applied probability and queues. Vol. 51. Springer Science & Business Media.Google Scholar
- Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. 2013. Bandits with knapsacks. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on. IEEE, 207--216. Google Scholar
Digital Library
- Albert-Laszlo Barabasi. 2005. The origin of bursts and heavy tails in human dynamics. Nature, Vol. 435, 7039 (2005), 207.Google Scholar
Cross Ref
- Theophilus Benson, Aditya Akella, and David A Maltz. 2010. Network traffic characteristics of data centers in the wild. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement. ACM, 267--280.Google Scholar
Digital Library
- Donald A Berry and Bert Fristedt. 1985. Bandit problems: sequential allocation of experiments (Monographs on statistics and applied probability). London: Chapman and Hall, Vol. 5 (1985), 71--87.Google Scholar
Cross Ref
- Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, Vol. 5, 1 (2012), 1--122.Google Scholar
Cross Ref
- Sébastien Bubeck, Nicolo Cesa-Bianchi, and Gábor Lugosi. 2013. Bandits with heavy tail. IEEE Transactions on Information Theory, Vol. 59, 11 (2013), 7711--7717. Google Scholar
Digital Library
- Jun Cai and José Garrido. 1999. A unified approach to the study of tail probabilities of compound distributions. Journal of Applied Probability, Vol. 36, 4 (1999), 1058--1073.Google Scholar
Cross Ref
- Olivier Catoni et al. 2012. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l'Institut Henri Poincaré, Probabilités et Statistiques, Vol. 48. Institut Henri Poincaré, 1148--1185.Google Scholar
- Nicolo Cesa-Bianchi and Gábor Lugosi. 2012. Combinatorial bandits. J. Comput. System Sci., Vol. 78, 5 (2012), 1404--1422. Google Scholar
Digital Library
- Richard Combes, Alexandre Proutiere, Donggyu Yun, Jungseul Ok, and Yung Yi. 2014. Optimal rate sampling in 802.11 systems. In INFOCOM, 2014 Proceedings IEEE. IEEE, 2760--2767.Google Scholar
Cross Ref
- Varsha Dani, Thomas P Hayes, and Sham M Kakade. 2008. Stochastic linear optimization under bandit feedback. (2008).Google Scholar
- Brian C Dean, Michel X Goemans, and Jan Vondrdk. 2004. Approximating the stochastic knapsack problem: The benefit of adaptivity. In 45th Annual IEEE Symposium on Foundations of Computer Science. IEEE, 208--217. Google Scholar
Digital Library
- Allen B Downey. 2001. Evidence for long-tailed distributions in the internet. In Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement. ACM, 229--241. Google Scholar
Digital Library
- Robert G Gallager. 2013. Stochastic processes: theory for applications .Cambridge University Press.Google Scholar
- Harsh Gupta, Atilla Eryilmaz, and R Srikant. 2018. Low-Complexity, Low-Regret Link Rate Selection in Rapidly-Varying Wireless Channels. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 540--548.Google Scholar
- Allan Gut. 2009. Stopped random walks .Springer.Google Scholar
- András György, Levente Kocsis, Ivett Szabó, and Csaba Szepesvári. 2007. Continuous Time Associative Bandit Problems.. In IJCAI. 830--835. Google Scholar
Digital Library
- Mor Harchol-Balter. 1999. The E ect of Heavy-Tailed Job Size Distributions on Computer System Design.. In Proc. of ASA-IMS Conf. on Applications of Heavy Tailed Distributions in Economics, Engineering and Statistics .Google Scholar
- Predrag R Jelenković and Jian Tan. 2013. Characterizing heavy-tailed distributions induced by retransmissions. Advances in Applied Probability, Vol. 45, 1 (2013), 106--138.Google Scholar
Cross Ref
- Pooria Joulani, Andras Gyorgy, and Csaba Szepesvári. 2013. Online learning under delayed feedback. In International Conference on Machine Learning. 1453--1461. Google Scholar
Digital Library
- Haya Kaspi and Avishai Mandelbaum. 1998. Multi-armed bandits in discrete and continuous time. Annals of Applied Probability (1998), 1270--1290.Google Scholar
- Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. 2010. Regret bounds for sleeping experts and bandits. Machine learning, Vol. 80, 2--3 (2010), 245--272. Google Scholar
Digital Library
- Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, Vol. 6, 1 (1985), 4--22. Google Scholar
Digital Library
- Keqin Liu and Qing Zhao. 2011. Multi-armed bandit problems with heavy-tailed reward distributions. In Communication, control, and computing (allerton), 2011 49th annual allerton conference on. IEEE, 485--492.Google Scholar
Cross Ref
- Avi Mandelbaum. 1987. Continuous multi-armed bandits and multiparameter processes. The Annals of Probability (1987), 1527--1556.Google Scholar
- Stanislav Minsker et al. 2015. Geometric median and robust estimation in Banach spaces. Bernoulli, Vol. 21, 4 (2015), 2308--2335.Google Scholar
Cross Ref
- Rajeev Motwani, Steven Phillips, and Eric Torng. 1994. Nonclairvoyant scheduling. Theoretical computer science, Vol. 130, 1 (1994), 17--47. Google Scholar
Digital Library
- Jayakrishnan Nair, Adam Wierman, and Bert Zwart. 2013. The fundamentals of heavy-tails: properties, emergence, and identification. In ACM SIGMETRICS Performance Evaluation Review, Vol. 41. ACM, 387--388. Google Scholar
Digital Library
- Sidney I Resnick et al. 1997. Heavy tail modeling and teletraffic data: special invited paper. The Annals of Statistics, Vol. 25, 5 (1997), 1805--1869.Google Scholar
Cross Ref
- Robert Sheahan, Lester Lipsky, Pierre M Fiorini, and Søren Asmussen. 2006. On the completion time distribution for tasks that must restart from the beginning if a failure occurs. ACM SIGMETRICS Performance Evaluation Review, Vol. 34, 3 (2006), 24--26. Google Scholar
Digital Library
- Peter Whittle. 1980. Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society. Series B (Methodological) (1980), 143--149.Google Scholar
- Yingce Xia, Haifang Li, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2015. Thompson Sampling for Budgeted Multi-Armed Bandits.. In IJCAI. 3960--3966. Google Scholar
Digital Library
- Yingce Xia, Tao Qin, Weidong Ma, Nenghai Yu, and Tie-Yan Liu. 2016. Budgeted Multi-Armed Bandits with Multiple Plays. In IJCAI. 2210--2216. Google Scholar
Digital Library
Index Terms
Learning to Control Renewal Processes with Bandit Feedback
Recommendations
Learning to Control Renewal Processes with Bandit Feedback
We consider a bandit problem with K task types from which the controller activates one task at a time. Each task takes a random and possibly heavy-tailed completion time, and a reward is obtained only after the task is completed. The task types are ...
Learning to Control Renewal Processes with Bandit Feedback
SIGMETRICS '19: Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer SystemsWe consider a bandit problem with K task types from which the controller activates one task at a time. Each task takes a random and possibly heavy-tailed completion time, and a reward is obtained only after the task is completed. The task types are ...
Bandit Learning with Biased Human Feedback
AAMAS '19: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent SystemsWe study a multi-armed bandit problem with biased human feedback. In our setting, each arm is associated with an unknown reward distribution. When an arm is played, a user receives a realized reward drawn from the distribution of the arm. She then ...






Comments