skip to main content
research-article

Unimodal Bandits with Continuous Arms: Order-optimal Regret without Smoothness

Published:27 May 2020Publication History
Skip Abstract Section

Abstract

We consider stochastic bandit problems with a continuous set of arms and where the expected reward is a continuous and unimodal function of the arm. For these problems, we propose the Stochastic Polychotomy (SP) algorithms, and derive finite-time upper bounds on its regret and optimization error. We show that, for a class of reward functions, the SP algorithm achieves a regret and an optimization error with optimal scalings, i.e., $O(\sqrtT )$ and $O(1/\sqrtT )$ (up to a logarithmic factor), respectively. SP constitutes the first order-optimal algorithm for non-smooth expected reward functions, as well as for smooth functions with unknown smoothness. The algorithm is based on sequential statistical tests used to successively trim an interval that contains the best arm with high probability. These tests exhibit a minimal sample complexity which confers to SP its adaptivity and optimality. Numerical experiments actually reveal that the algorithm even outperforms state-of-the-art algorithms that exploit the knowledge of the smoothness of the reward function. The performance of SP is further illustrated on the problem of setting optimal reserve prices in repeated second-price auctions; there, the algorithm is evaluated on real-world data.

References

  1. A. Agarwal, D. Foster, D. Hsu, S. Kakade, and A. Rakhlin. 2013. Stochastic Convex Optimization with Bandit Feedback. SIAM Journal on Optimization , Vol. 23, 1 (2013), 213--240.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Agrawal. 1995. The Continuum-Armed Bandit Problem. SIAM Journal on Control and Optimization , Vol. 33, 6 (Nov. 1995), 1926--1951.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Audibert, S. Bubeck, and R. Munos. 2010. Best Arm Identification in Multi-Armed Bandits. In Proc. of COLT .Google ScholarGoogle Scholar
  4. P. Auer, R. Ortner, and C. Szepesvári. 2007. Improved rates for the stochastic continuum-armed bandit problem. In Learning Theory . Springer, 454--468.Google ScholarGoogle Scholar
  5. Baruch Awerbuch and Robert Kleinberg. 2008. Online Linear Optimization and Adaptive Routing. J. Comput. Syst. Sci. , Vol. 74, 1 (Feb. 2008), 97--114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Avrim Blum, Vijay Kumar, Atri Rudra, and Felix Wu. 2004. Online learning in online auctions. Theoretical Computer Science , Vol. 324, 2 (2004), 137 -- 146. Online Algorithms: In Memoriam, Steve Seiden.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Sé bastien Bubeck, Yin Tat Lee, and Ronen Eldan. 2017. Kernel-based methods for bandit convex optimization. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19--23, 2017. 72--85. https://doi.org/10.1145/3055399.3055403Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Bubeck, R. Munos, G. Stoltz, and C Szepesvári. 2008. Online optimization in X-armed bandits. In Proc. of NIPS .Google ScholarGoogle Scholar
  9. S. Bubeck, G. Stoltz, and J. Yu. 2011. Lipschitz Bandits without the Lipschitz Constant. In Proc. of ALT .Google ScholarGoogle Scholar
  10. Lin Chen, Mingrui Zhang, and Amin Karbasi. 2019. Projection-Free Bandit Convex Optimization. In Proceedings of Machine Learning Research (Proceedings of Machine Learning Research), , Kamalika Chaudhuri and Masashi Sugiyama (Eds.), Vol. 89. PMLR, 2047--2056. http://proceedings.mlr.press/v89/chen19f.htmlGoogle ScholarGoogle Scholar
  11. Richard Combes, Stefan Magureanu, and Alexandre Proutiere. 2017. Minimal Exploration in Structured Stochastic Bandits. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 1763--1771. http://papers.nips.cc/paper/6773-minimal-exploration-in-structured-stochastic-bandits.pdfGoogle ScholarGoogle Scholar
  12. Richard Combes, Stefan Magureanu, Alexandre Proutiere, and Cyrille Laroche. 2015. Learning to Rank: Regret Lower Bounds and Efficient Algorithms. SIGMETRICS Perform. Eval. Rev. , Vol. 43, 1 (June 2015), 231--244.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Combes, J. Ok, A. Proutiere, D. Yun, and Y. Yi. 2018. Optimal Rate Sampling in 802.11 systems: Theory, Design, and Implementation. IEEE Transactions on Mobile Computing (2018).Google ScholarGoogle Scholar
  14. R. Combes and A. Proutiere. 2014a. Unimodal Bandits: Regret lower bounds and Optimal Algorithms. In Proc. of ICML .Google ScholarGoogle Scholar
  15. R. Combes and A. Proutiere. 2014b. Unimodal Bandits: Regret lower bounds and Optimal Algorithms. Technical Report, http://arxiv.org/abs/1405.5096.Google ScholarGoogle Scholar
  16. R. Combes and A. Proutiere. 2015. Dynamic Rate and Channel Selection in Cognitive Radio Systems. IEEE Journal on Selected Areas in Communications , Vol. 33, 5 (May 2015), 910--921.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. W. Cope. 2009. Regret and Convergence Bounds for a Class of Continuum-Armed Bandit Problems. IEEE Trans. Automat. Contr. , Vol. 54, 6 (2009), 1243--1253.Google ScholarGoogle ScholarCross RefCross Ref
  18. V. Dani, T. Hayes, and S Kakade. 2008. Stochastic Linear Optimization under Bandit Feedback. In Proc. of COLT .Google ScholarGoogle Scholar
  19. E. Even-Dar, S. Mannor, and Y. Mansour. 2006. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. Journal of Machine Learning Research , Vol. 7 (2006), 1079--1105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Garivier and O. Cappé. 2011. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proc. of COLT .Google ScholarGoogle Scholar
  21. Jean-Bastien Grill, Michal Valko, and Ré mi Munos. 2015. Black-box optimization of noisy functions with unknown smoothness. In Neural Information Processing Systems .Google ScholarGoogle Scholar
  22. K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck. 2014. lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits. Proc. of COLT (2014).Google ScholarGoogle Scholar
  23. K. Jamieson, R. Nowak, and B. Recht. 2012. Query Complexity of Derivative-Free Optimization. In Proc. of NIPS .Google ScholarGoogle Scholar
  24. Kwang-Sung Jun, Aniruddha Bhargava, Robert Nowak, and Rebecca Willett. 2017. Scalable Generalized Linear Bandits: Online Computation and Hashing. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., USA, 98--108. http://dl.acm.org/citation.cfm?id=3294771.3294781Google ScholarGoogle Scholar
  25. S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. 2012. PAC Subset Selection in Stochastic Multi-armed Bandits. In Proc. of ICML .Google ScholarGoogle Scholar
  26. J. Kiefer. 1953. Sequential minimax search for a maximum. Proc. Amer. Math. Soc. , Vol. 4, 3 (1953), 502--506.Google ScholarGoogle ScholarCross RefCross Ref
  27. R. Kleinberg. 2004. Nearly Tight Bounds for the Continuum-Armed Bandit Problem. In Proc. of NIPS .Google ScholarGoogle Scholar
  28. R. Kleinberg, A. Slivkins, and E. Upfal. 2008. Multi-armed bandits in metric spaces. In Proc. of ACM STOC. 681--690.Google ScholarGoogle Scholar
  29. T.L. Lai and H. Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics , Vol. 6, 1 (1985), 4--2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Andrea Locatelli and Alexandra Carpentier. 2018. Adaptivity to Smoothness in X-armed bandits. In Proceedings of the 31st Conference On Learning Theory, Vol. 75. 1463--1492.Google ScholarGoogle Scholar
  31. Shiyin Lu, Guanghui Wang, Yao Hu, and Lijun Zhang. 2019. Optimal Algorithms for Lipschitz Bandits with Heavy-tailed Rewards. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research), , Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, Long Beach, California, USA, 4154--4163. http://proceedings.mlr.press/v97/lu19c.htmlGoogle ScholarGoogle Scholar
  32. S. Magureanu, R. Combes, and A. Proutiere. 2014. Lipschitz Bandits: Regret Lower Bounds and Optimal Algorithms. In Proc. of COLT .Google ScholarGoogle Scholar
  33. Stefan Magureanu, Alexandre Proutiere, Marcus Isaksson, and Boxun Zhang. 2017. Online Learning of Optimally Diverse Rankings. Proc. ACM Meas. Anal. Comput. Syst. , Vol. 1, 2 (Dec. 2017), 32:1--32:26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Mannor and J. Tsitsiklis. 2004. The Sample Complexity of Exploration in the Multi-Armed Bandit Problem. Journal of Machine Learning Research , Vol. 5 (Dec. 2004), 623--648.Google ScholarGoogle Scholar
  35. Roger B Myerson. 1981. Optimal auction design. Mathematics of operations research , Vol. 6, 1 (1981), 58--73.Google ScholarGoogle Scholar
  36. Michael Rothschild. 1974. A two-armed bandit theory of market pricing. Journal of Economic Theory , Vol. 9, 2 (1974), 185--202.Google ScholarGoogle ScholarCross RefCross Ref
  37. O. Shamir. 2013. On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization. In Proc. of COLT .Google ScholarGoogle Scholar
  38. Ohad Shamir. 2017. An Optimal Algorithm for Bandit and Zero-order Convex Optimization with Two-point Feedback. J. Mach. Learn. Res. , Vol. 18, 1 (Jan. 2017), 1703--1713. http://dl.acm.org/citation.cfm?id=3122009.3153008Google ScholarGoogle Scholar
  39. Aleksandrs Slivkins. 2014. Contextual Bandits with Similarity Information. J. Mach. Learn. Res. , Vol. 15, 1 (Jan. 2014), 2533--2568. http://dl.acm.org/citation.cfm?id=2627435.2670330Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. James C. Spall. 2003. Introduction to Stochastic Search and Optimization .John Wiley & Sons, Inc.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. William R Thompson. 1933. On the Likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika , Vol. 25, 3--4 (12 1933), 285--294.Google ScholarGoogle ScholarCross RefCross Ref
  42. Jonathan Weed, Vianney Perchet, and Philippe Rigollet. 2016. Online learning in repeated auctions. In 29th Annual Conference on Learning Theory , , Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir (Eds.), Vol. 49. 1562--1583.Google ScholarGoogle Scholar
  43. J. Yu and S. Mannor. 2011. Unimodal Bandits. In Proc. of ICML .Google ScholarGoogle Scholar
  44. Weinan Zhang, Shuai Yuan, Jun Wang, and Xuehua Shen. 2014. Real-Time Bidding Benchmarking with iPinYou Dataset. arXiv preprint arXiv:1407.7073 (2014).Google ScholarGoogle Scholar

Index Terms

  1. Unimodal Bandits with Continuous Arms: Order-optimal Regret without Smoothness

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!