Abstract
We study multi-armed bandit (MAB) problems with additional observations, where in each round, the decision maker selects an arm to play and can also observe rewards of additional arms (within a given budget) by paying certain costs. In the case of stochastic rewards, we develop a new algorithm KL-UCB-AO which is asymptotically optimal when the time horizon grows large, by smartly identifying the optimal set of the arms to be explored using the given budget of additional observations. In the case of adversarial rewards, we propose H-INF, an algorithm with order-optimal regret. H-INF exploits a two-layered structure where in each layer, we run a known optimal MAB algorithm. Such a hierarchical structure facilitates the regret analysis of the algorithm, and in turn, yields order-optimal regret. We apply the framework of MAB with additional observations to the design of rate adaptation schemes in 802.11-like wireless systems, and to that of online advertisement systems. In both cases, we demonstrate that our algorithms leverage additional observations to significantly improve the system performance. We believe the techniques developed in this paper are of independent interest for other MAB problems, e.g., contextual or graph-structured MAB.
- Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. 2013. From bandits to experts: A tale of domination and independence Proceedings of NIPS. Google Scholar
Digital Library
- Kareem Amin, Satyen Kale, and Gerald Tesauro Deepak Turaga. 2015. Budgeted Prediction With Expert Advice. In Proceedings of AAAI. Google Scholar
Digital Library
- Jean-Yves Audibert and Sébastien Bubeck. 2010. Regret bounds and minimax policies under partial monitoring. The Journal of Machine Learning Research Vol. 11 (2010), 2785--2836. Google Scholar
Digital Library
- P. Auer, N. Cesa-Bianchi, and P. Fischer. 2002 a. Finite time analysis of the multiarmed bandit problem. Machine Learning, Vol. 47, 2--3 (2002), 235--256. Google Scholar
Digital Library
- Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2002 b. The nonstochastic multiarmed bandit problem. SIAM J. Comput. Vol. 32, 1 (2002), 48--77. Google Scholar
Digital Library
- J. Bicket. 2005. Bit-rate selection in wireless networks. In PhD thesis, Massachusetts Institute of Technology.Google Scholar
- Swapna Buccapatnam, Atilla Eryilmaz, and Ness B. Shroff. 2014. Stochastic bandits with side observations on networks Proceedings of ACM SIGMETRICS. Google Scholar
Digital Library
- Stéphane Caron, Branislav Kveton, Marc Lelarge, and Smriti Bhagat. 2012. Leveraging Side Observations in Stochastic Bandits Proceedings of UAI. Google Scholar
Digital Library
- Nicolo Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. 1997. How to use expert advice. Journal of the ACM (JACM) Vol. 44, 3 (1997), 427--485. Google Scholar
Digital Library
- Nicolò Cesa-Bianchi and Gábor Lugosi. 2006. Prediction, Learning, and Games. Cambridge University Press. Google Scholar
Digital Library
- Richard Combes, Chong Jiang, and R. Srikant. 2015. Bandits with Budgets: Regret Lower Bounds and Optimal Algorithms Proceedings of ACM SIGMETRICS. Google Scholar
Digital Library
- Richard Combes, Alexandre Proutiere, Donggyu Yun, Jungseul Ok, and Yung Yi. 2014. Optimal rate sampling in 802.11 systems. In Proceedings of IEEE INFOCOM.Google Scholar
Cross Ref
- L. Deek, E. Garcia-Villegas, E. Belding, S.-J. Lee, and K. Almeroth. 2013. Joint rate and channel width adaptation in 802.11 MIMO wireless networks Proceedings of IEEE SECON.Google Scholar
- P. Frazier, D. Kempe, J. Kleinberg, and R. Kleinberg. 2014. Incentivizing exploration. In Proceedings of the fifteenth ACM conference on Economics and computation. 5--22. Google Scholar
Digital Library
- Yoav Freund and Robert E. Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences Vol. 55, 1 (1997), 119--139. Google Scholar
Digital Library
- A. Garivier and O. Cappé. 2011. The KL-UCB algorithm for bounded stochastic bandits and beyond Proceedings of COLT.Google Scholar
- Thore Graepel, Joaquin Q. Candela, Thomas Borchert, and Ralf Herbrich. 2010. Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft's bing search engine. In Proceedings of the 27th International Conference on Machine Learning (ICML-10). Google Scholar
Digital Library
- Adam Kalai and Santosh Vempala. 2005. Efficient algorithms for online decision problems. J. Comput. System Sci. Vol. 71, 3 (2005), 291--307. Google Scholar
Digital Library
- Satyen Kale. 2014. Multiarmed bandits with limited expert advice. In Proceedings of COLT.Google Scholar
- Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. 2012. On Bayesian upper confidence bounds for bandit problems Proceedings of AISTATS.Google Scholar
- Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. 2016. On the Complexity of Best Arm Identification in Multi-Armed Bandit Models. The Journal of Machine Learning Research Vol. 17 (2016), 1--42. Google Scholar
Digital Library
- Tomávs Kocák, Gergely Neu, Michal Valko, and Remi Munos. 2014. Efficient learning by implicit exploration in bandit problems with side observations Proceedings of NIPS. Google Scholar
Digital Library
- Ron kohavi. 2015. Online Controlled Experiments: Lessons from Running A/B/N Tests for 12 Years Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google Scholar
Digital Library
- T.L. Lai and H. Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics Vol. 6, 1 (1985), 4--2. Google Scholar
Digital Library
- Shie Mannor and Ohad Shamir. 2011. From bandits to experts: On the value of side-observations Proceedings of NIPS. Google Scholar
Digital Library
- Y. Mansour, A. Slivkins, and V. Syrgkanis. 2015. Bayesian incentive-compatible bandit exploration. Proceedings of the Sixteenth ACM Conference on Economics and Computation. 565--582. Google Scholar
Digital Library
- J. Garcia N. Duy. 2011. A practical approach to rate adaptation for multi-antenna systems Proceedings of 19th IEEE International Conference on Network Protocols. 331--340. Google Scholar
Digital Library
- Ioannis Pefkianakis, Yun Hu, Starsky H.Y. Wong, Hao Yang, and Songwu Lu MIMO rate adaptation in 802.11n wireless networks proceedings of ACM Mobicom. Google Scholar
Digital Library
- Herbert Robbins. 1952. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. Vol. 58, 5 (1952), 527--535.Google Scholar
Cross Ref
- Yevgeny Seldin, Peter Bartlett, Koby Crammer, and Yasin Abbasi-Yadkori. 2014. Prediction with Limited Advice and Multiarmed Bandits with Paid Observations Proceedings of ICML. Google Scholar
Digital Library
- Yevgeny Seldin, Koby Crammer, and Peter Bartlett. 2013. Open Problem: Adversarial Multiarmed Bandits with Limited Advice Proceedings of COLT.Google Scholar
- Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. 2013. Ranked bandits in metric spaces: learning diverse rankings over large document collections. The Journal of Machine Learning Research Vol. 14, 1 (2013), 399--436. Google Scholar
Digital Library
- Min Xu, Tao Qin, and Tie-Yan Liu. 2013. Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising Proceedings of NIPS. Google Scholar
Digital Library
Index Terms
Multi-armed Bandit with Additional Observations
Recommendations
Multi-armed Bandit with Additional Observations
SIGMETRICS '18We study multi-armed bandit (MAB) problems with additional observations, where in each round, the decision maker selects an arm to play and can also observe rewards of additional arms (within a given budget) by paying certain costs. We propose ...
Multi-armed Bandit with Additional Observations
SIGMETRICS '18: Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer SystemsWe study multi-armed bandit (MAB) problems with additional observations, where in each round, the decision maker selects an arm to play and can also observe rewards of additional arms (within a given budget) by paying certain costs. We propose ...
Multi-armed bandit problem with known trend
We consider a variant of the multi-armed bandit model, which we call multi-armed bandit problem with known trend, where the gambler knows the shape of the reward function of each arm but not its distribution. This new problem is motivated by different ...






Comments