skip to main content
research-article

Multi-armed Bandit with Additional Observations

Published:03 April 2018Publication History
Skip Abstract Section

Abstract

We study multi-armed bandit (MAB) problems with additional observations, where in each round, the decision maker selects an arm to play and can also observe rewards of additional arms (within a given budget) by paying certain costs. In the case of stochastic rewards, we develop a new algorithm KL-UCB-AO which is asymptotically optimal when the time horizon grows large, by smartly identifying the optimal set of the arms to be explored using the given budget of additional observations. In the case of adversarial rewards, we propose H-INF, an algorithm with order-optimal regret. H-INF exploits a two-layered structure where in each layer, we run a known optimal MAB algorithm. Such a hierarchical structure facilitates the regret analysis of the algorithm, and in turn, yields order-optimal regret. We apply the framework of MAB with additional observations to the design of rate adaptation schemes in 802.11-like wireless systems, and to that of online advertisement systems. In both cases, we demonstrate that our algorithms leverage additional observations to significantly improve the system performance. We believe the techniques developed in this paper are of independent interest for other MAB problems, e.g., contextual or graph-structured MAB.

References

  1. Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. 2013. From bandits to experts: A tale of domination and independence Proceedings of NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Kareem Amin, Satyen Kale, and Gerald Tesauro Deepak Turaga. 2015. Budgeted Prediction With Expert Advice. In Proceedings of AAAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jean-Yves Audibert and Sébastien Bubeck. 2010. Regret bounds and minimax policies under partial monitoring. The Journal of Machine Learning Research Vol. 11 (2010), 2785--2836. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Auer, N. Cesa-Bianchi, and P. Fischer. 2002 a. Finite time analysis of the multiarmed bandit problem. Machine Learning, Vol. 47, 2--3 (2002), 235--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2002 b. The nonstochastic multiarmed bandit problem. SIAM J. Comput. Vol. 32, 1 (2002), 48--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Bicket. 2005. Bit-rate selection in wireless networks. In PhD thesis, Massachusetts Institute of Technology.Google ScholarGoogle Scholar
  7. Swapna Buccapatnam, Atilla Eryilmaz, and Ness B. Shroff. 2014. Stochastic bandits with side observations on networks Proceedings of ACM SIGMETRICS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Stéphane Caron, Branislav Kveton, Marc Lelarge, and Smriti Bhagat. 2012. Leveraging Side Observations in Stochastic Bandits Proceedings of UAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Nicolo Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. 1997. How to use expert advice. Journal of the ACM (JACM) Vol. 44, 3 (1997), 427--485. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Nicolò Cesa-Bianchi and Gábor Lugosi. 2006. Prediction, Learning, and Games. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Richard Combes, Chong Jiang, and R. Srikant. 2015. Bandits with Budgets: Regret Lower Bounds and Optimal Algorithms Proceedings of ACM SIGMETRICS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Richard Combes, Alexandre Proutiere, Donggyu Yun, Jungseul Ok, and Yung Yi. 2014. Optimal rate sampling in 802.11 systems. In Proceedings of IEEE INFOCOM.Google ScholarGoogle ScholarCross RefCross Ref
  13. L. Deek, E. Garcia-Villegas, E. Belding, S.-J. Lee, and K. Almeroth. 2013. Joint rate and channel width adaptation in 802.11 MIMO wireless networks Proceedings of IEEE SECON.Google ScholarGoogle Scholar
  14. P. Frazier, D. Kempe, J. Kleinberg, and R. Kleinberg. 2014. Incentivizing exploration. In Proceedings of the fifteenth ACM conference on Economics and computation. 5--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yoav Freund and Robert E. Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences Vol. 55, 1 (1997), 119--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Garivier and O. Cappé. 2011. The KL-UCB algorithm for bounded stochastic bandits and beyond Proceedings of COLT.Google ScholarGoogle Scholar
  17. Thore Graepel, Joaquin Q. Candela, Thomas Borchert, and Ralf Herbrich. 2010. Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft's bing search engine. In Proceedings of the 27th International Conference on Machine Learning (ICML-10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Adam Kalai and Santosh Vempala. 2005. Efficient algorithms for online decision problems. J. Comput. System Sci. Vol. 71, 3 (2005), 291--307. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Satyen Kale. 2014. Multiarmed bandits with limited expert advice. In Proceedings of COLT.Google ScholarGoogle Scholar
  20. Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. 2012. On Bayesian upper confidence bounds for bandit problems Proceedings of AISTATS.Google ScholarGoogle Scholar
  21. Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. 2016. On the Complexity of Best Arm Identification in Multi-Armed Bandit Models. The Journal of Machine Learning Research Vol. 17 (2016), 1--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Tomávs Kocák, Gergely Neu, Michal Valko, and Remi Munos. 2014. Efficient learning by implicit exploration in bandit problems with side observations Proceedings of NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ron kohavi. 2015. Online Controlled Experiments: Lessons from Running A/B/N Tests for 12 Years Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T.L. Lai and H. Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics Vol. 6, 1 (1985), 4--2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Shie Mannor and Ohad Shamir. 2011. From bandits to experts: On the value of side-observations Proceedings of NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Mansour, A. Slivkins, and V. Syrgkanis. 2015. Bayesian incentive-compatible bandit exploration. Proceedings of the Sixteenth ACM Conference on Economics and Computation. 565--582. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Garcia N. Duy. 2011. A practical approach to rate adaptation for multi-antenna systems Proceedings of 19th IEEE International Conference on Network Protocols. 331--340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ioannis Pefkianakis, Yun Hu, Starsky H.Y. Wong, Hao Yang, and Songwu Lu MIMO rate adaptation in 802.11n wireless networks proceedings of ACM Mobicom. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Herbert Robbins. 1952. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. Vol. 58, 5 (1952), 527--535.Google ScholarGoogle ScholarCross RefCross Ref
  30. Yevgeny Seldin, Peter Bartlett, Koby Crammer, and Yasin Abbasi-Yadkori. 2014. Prediction with Limited Advice and Multiarmed Bandits with Paid Observations Proceedings of ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yevgeny Seldin, Koby Crammer, and Peter Bartlett. 2013. Open Problem: Adversarial Multiarmed Bandits with Limited Advice Proceedings of COLT.Google ScholarGoogle Scholar
  32. Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. 2013. Ranked bandits in metric spaces: learning diverse rankings over large document collections. The Journal of Machine Learning Research Vol. 14, 1 (2013), 399--436. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Min Xu, Tao Qin, and Tie-Yan Liu. 2013. Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising Proceedings of NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Multi-armed Bandit with Additional Observations

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!