Abstract
We consider stochastic bandit problems with a continuous set of arms and where the expected reward is a continuous and unimodal function of the arm. For these problems, we propose the Stochastic Polychotomy (SP) algorithms, and derive finite-time upper bounds on its regret and optimization error. We show that, for a class of reward functions, the SP algorithm achieves a regret and an optimization error with optimal scalings, i.e., $O(\sqrtT )$ and $O(1/\sqrtT )$ (up to a logarithmic factor), respectively. SP constitutes the first order-optimal algorithm for non-smooth expected reward functions, as well as for smooth functions with unknown smoothness. The algorithm is based on sequential statistical tests used to successively trim an interval that contains the best arm with high probability. These tests exhibit a minimal sample complexity which confers to SP its adaptivity and optimality. Numerical experiments actually reveal that the algorithm even outperforms state-of-the-art algorithms that exploit the knowledge of the smoothness of the reward function. The performance of SP is further illustrated on the problem of setting optimal reserve prices in repeated second-price auctions; there, the algorithm is evaluated on real-world data.
- A. Agarwal, D. Foster, D. Hsu, S. Kakade, and A. Rakhlin. 2013. Stochastic Convex Optimization with Bandit Feedback. SIAM Journal on Optimization , Vol. 23, 1 (2013), 213--240.Google Scholar
Digital Library
- R. Agrawal. 1995. The Continuum-Armed Bandit Problem. SIAM Journal on Control and Optimization , Vol. 33, 6 (Nov. 1995), 1926--1951.Google Scholar
Digital Library
- J. Audibert, S. Bubeck, and R. Munos. 2010. Best Arm Identification in Multi-Armed Bandits. In Proc. of COLT .Google Scholar
- P. Auer, R. Ortner, and C. Szepesvári. 2007. Improved rates for the stochastic continuum-armed bandit problem. In Learning Theory . Springer, 454--468.Google Scholar
- Baruch Awerbuch and Robert Kleinberg. 2008. Online Linear Optimization and Adaptive Routing. J. Comput. Syst. Sci. , Vol. 74, 1 (Feb. 2008), 97--114.Google Scholar
Digital Library
- Avrim Blum, Vijay Kumar, Atri Rudra, and Felix Wu. 2004. Online learning in online auctions. Theoretical Computer Science , Vol. 324, 2 (2004), 137 -- 146. Online Algorithms: In Memoriam, Steve Seiden.Google Scholar
Digital Library
- Sé bastien Bubeck, Yin Tat Lee, and Ronen Eldan. 2017. Kernel-based methods for bandit convex optimization. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19--23, 2017. 72--85. https://doi.org/10.1145/3055399.3055403Google Scholar
Digital Library
- S. Bubeck, R. Munos, G. Stoltz, and C Szepesvári. 2008. Online optimization in X-armed bandits. In Proc. of NIPS .Google Scholar
- S. Bubeck, G. Stoltz, and J. Yu. 2011. Lipschitz Bandits without the Lipschitz Constant. In Proc. of ALT .Google Scholar
- Lin Chen, Mingrui Zhang, and Amin Karbasi. 2019. Projection-Free Bandit Convex Optimization. In Proceedings of Machine Learning Research (Proceedings of Machine Learning Research), , Kamalika Chaudhuri and Masashi Sugiyama (Eds.), Vol. 89. PMLR, 2047--2056. http://proceedings.mlr.press/v89/chen19f.htmlGoogle Scholar
- Richard Combes, Stefan Magureanu, and Alexandre Proutiere. 2017. Minimal Exploration in Structured Stochastic Bandits. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 1763--1771. http://papers.nips.cc/paper/6773-minimal-exploration-in-structured-stochastic-bandits.pdfGoogle Scholar
- Richard Combes, Stefan Magureanu, Alexandre Proutiere, and Cyrille Laroche. 2015. Learning to Rank: Regret Lower Bounds and Efficient Algorithms. SIGMETRICS Perform. Eval. Rev. , Vol. 43, 1 (June 2015), 231--244.Google Scholar
Digital Library
- R. Combes, J. Ok, A. Proutiere, D. Yun, and Y. Yi. 2018. Optimal Rate Sampling in 802.11 systems: Theory, Design, and Implementation. IEEE Transactions on Mobile Computing (2018).Google Scholar
- R. Combes and A. Proutiere. 2014a. Unimodal Bandits: Regret lower bounds and Optimal Algorithms. In Proc. of ICML .Google Scholar
- R. Combes and A. Proutiere. 2014b. Unimodal Bandits: Regret lower bounds and Optimal Algorithms. Technical Report, http://arxiv.org/abs/1405.5096.Google Scholar
- R. Combes and A. Proutiere. 2015. Dynamic Rate and Channel Selection in Cognitive Radio Systems. IEEE Journal on Selected Areas in Communications , Vol. 33, 5 (May 2015), 910--921.Google Scholar
Digital Library
- E. W. Cope. 2009. Regret and Convergence Bounds for a Class of Continuum-Armed Bandit Problems. IEEE Trans. Automat. Contr. , Vol. 54, 6 (2009), 1243--1253.Google Scholar
Cross Ref
- V. Dani, T. Hayes, and S Kakade. 2008. Stochastic Linear Optimization under Bandit Feedback. In Proc. of COLT .Google Scholar
- E. Even-Dar, S. Mannor, and Y. Mansour. 2006. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. Journal of Machine Learning Research , Vol. 7 (2006), 1079--1105.Google Scholar
Digital Library
- A. Garivier and O. Cappé. 2011. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proc. of COLT .Google Scholar
- Jean-Bastien Grill, Michal Valko, and Ré mi Munos. 2015. Black-box optimization of noisy functions with unknown smoothness. In Neural Information Processing Systems .Google Scholar
- K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck. 2014. lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits. Proc. of COLT (2014).Google Scholar
- K. Jamieson, R. Nowak, and B. Recht. 2012. Query Complexity of Derivative-Free Optimization. In Proc. of NIPS .Google Scholar
- Kwang-Sung Jun, Aniruddha Bhargava, Robert Nowak, and Rebecca Willett. 2017. Scalable Generalized Linear Bandits: Online Computation and Hashing. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., USA, 98--108. http://dl.acm.org/citation.cfm?id=3294771.3294781Google Scholar
- S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. 2012. PAC Subset Selection in Stochastic Multi-armed Bandits. In Proc. of ICML .Google Scholar
- J. Kiefer. 1953. Sequential minimax search for a maximum. Proc. Amer. Math. Soc. , Vol. 4, 3 (1953), 502--506.Google Scholar
Cross Ref
- R. Kleinberg. 2004. Nearly Tight Bounds for the Continuum-Armed Bandit Problem. In Proc. of NIPS .Google Scholar
- R. Kleinberg, A. Slivkins, and E. Upfal. 2008. Multi-armed bandits in metric spaces. In Proc. of ACM STOC. 681--690.Google Scholar
- T.L. Lai and H. Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics , Vol. 6, 1 (1985), 4--2.Google Scholar
Digital Library
- Andrea Locatelli and Alexandra Carpentier. 2018. Adaptivity to Smoothness in X-armed bandits. In Proceedings of the 31st Conference On Learning Theory, Vol. 75. 1463--1492.Google Scholar
- Shiyin Lu, Guanghui Wang, Yao Hu, and Lijun Zhang. 2019. Optimal Algorithms for Lipschitz Bandits with Heavy-tailed Rewards. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research), , Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, Long Beach, California, USA, 4154--4163. http://proceedings.mlr.press/v97/lu19c.htmlGoogle Scholar
- S. Magureanu, R. Combes, and A. Proutiere. 2014. Lipschitz Bandits: Regret Lower Bounds and Optimal Algorithms. In Proc. of COLT .Google Scholar
- Stefan Magureanu, Alexandre Proutiere, Marcus Isaksson, and Boxun Zhang. 2017. Online Learning of Optimally Diverse Rankings. Proc. ACM Meas. Anal. Comput. Syst. , Vol. 1, 2 (Dec. 2017), 32:1--32:26.Google Scholar
Digital Library
- S. Mannor and J. Tsitsiklis. 2004. The Sample Complexity of Exploration in the Multi-Armed Bandit Problem. Journal of Machine Learning Research , Vol. 5 (Dec. 2004), 623--648.Google Scholar
- Roger B Myerson. 1981. Optimal auction design. Mathematics of operations research , Vol. 6, 1 (1981), 58--73.Google Scholar
- Michael Rothschild. 1974. A two-armed bandit theory of market pricing. Journal of Economic Theory , Vol. 9, 2 (1974), 185--202.Google Scholar
Cross Ref
- O. Shamir. 2013. On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization. In Proc. of COLT .Google Scholar
- Ohad Shamir. 2017. An Optimal Algorithm for Bandit and Zero-order Convex Optimization with Two-point Feedback. J. Mach. Learn. Res. , Vol. 18, 1 (Jan. 2017), 1703--1713. http://dl.acm.org/citation.cfm?id=3122009.3153008Google Scholar
- Aleksandrs Slivkins. 2014. Contextual Bandits with Similarity Information. J. Mach. Learn. Res. , Vol. 15, 1 (Jan. 2014), 2533--2568. http://dl.acm.org/citation.cfm?id=2627435.2670330Google Scholar
Digital Library
- James C. Spall. 2003. Introduction to Stochastic Search and Optimization .John Wiley & Sons, Inc.Google Scholar
Digital Library
- William R Thompson. 1933. On the Likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika , Vol. 25, 3--4 (12 1933), 285--294.Google Scholar
Cross Ref
- Jonathan Weed, Vianney Perchet, and Philippe Rigollet. 2016. Online learning in repeated auctions. In 29th Annual Conference on Learning Theory , , Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir (Eds.), Vol. 49. 1562--1583.Google Scholar
- J. Yu and S. Mannor. 2011. Unimodal Bandits. In Proc. of ICML .Google Scholar
- Weinan Zhang, Shuai Yuan, Jun Wang, and Xuehua Shen. 2014. Real-Time Bidding Benchmarking with iPinYou Dataset. arXiv preprint arXiv:1407.7073 (2014).Google Scholar
Index Terms
Unimodal Bandits with Continuous Arms: Order-optimal Regret without Smoothness
Recommendations
Unimodal Bandits with Continuous Arms: Order-optimal Regret without Smoothness
SIGMETRICS '20: Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer SystemsWe consider stochastic bandit problems with a continuous set of arms and where the expected reward is a continuous and unimodal function of the arm. For these problems, we propose the Stochastic Polychotomy (SP) algorithms, and derive finite-time upper ...
Unimodal Bandits with Continuous Arms: Order-optimal Regret without Smoothness
We consider stochastic bandit problems with a continuous set of arms and where the expected reward is a continuous and unimodal function of the arm. For these problems, we propose the Stochastic Polychotomy (SP) algorithms, and derive finite-time upper ...
Dueling Bandits: From Two-dueling to Multi-dueling
AAMAS '20: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent SystemsWe study a general multi-dueling bandit problem, where an agent compares multiple options simultaneously and aims to minimize the regret due to selecting suboptimal arms. This setting generalizes the traditional two-dueling bandit problem and finds many ...






Comments