skip to main content
research-article
Public Access

Adaptive Discretization for Episodic Reinforcement Learning in Metric Spaces

Published:17 December 2019Publication History
Skip Abstract Section

Abstract

We present an efficient algorithm for model-free episodic reinforcement learning on large (potentially continuous) state-action spaces. Our algorithm is based on a novel Q-learning policy with adaptive data-driven discretization. The central idea is to maintain a finer partition of the state-action space in regions which are frequently visited in historical trajectories, and have higher payoff estimates. We demonstrate how our adaptive partitions take advantage of the shape of the optimal Q-function and the joint space, without sacrificing the worst-case performance. In particular, we recover the regret guarantees of prior algorithms for continuous state-action spaces, which additionally require either an optimal discretization as input, and/or access to a simulation oracle. Moreover, experiments demonstrate how our algorithm automatically adapts to the underlying structure of the problem, resulting in much better performance compared both to heuristics and Q-learning with uniform discretization.

References

  1. Peter Auer, Thomas Jaksch, and Ronald Ortner. 2009. Near-optimal Regret Bounds for Reinforcement Learning. In Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (Eds.). Curran Associates, Inc., 89--96. http://papers.nips.cc/paper/3401-near-optimal-regret-bounds-for-reinforcement-learning.pdfGoogle ScholarGoogle Scholar
  2. Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. 2017. Minimax Regret Bounds for Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (ICML'17). JMLR.org, 263--272. http://dl.acm.org/citation.cfm?id=3305381.3305409Google ScholarGoogle Scholar
  3. Luce Brotcorne, Gilbert Laporte, and Frederic Semet. 2003. Ambulance location and relocation models. European journal of operational research , Vol. 147, 3 (2003), 451--463.Google ScholarGoogle Scholar
  4. Sébastien Bubeck, Nicolo Cesa-Bianchi, et almbox. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning , Vol. 5, 1 (2012), 1--122.Google ScholarGoogle ScholarCross RefCross Ref
  5. Sébastien Bubeck, Gilles Stoltz, Csaba Szepesvári, and Rémi Munos. 2009. Online optimization in X-armed bandits. In Advances in Neural Information Processing Systems. 201--208.Google ScholarGoogle Scholar
  6. Joshua Comden, Sijie Yao, Niangjun Chen, Haipeng Xing, and Zhenhua Liu. 2019. Online Optimization in Cloud Resource Provisioning: Predictions, Regrets, and Algorithms. Proc. ACM Meas. Anal. Comput. Syst. , Vol. 3, 1, Article 16 (March 2019), bibinfonumpages30 pages. https://doi.org/10.1145/3322205.3311087Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Kefan Dong, Yuanhao Wang, Xiaoyu Chen, and Liwei Wang. 2019. Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP. arXiv preprint arXiv:1901.09311 (2019).Google ScholarGoogle Scholar
  8. Simon S Du, Yuping Luo, Ruosong Wang, and Hanrui Zhang. 2019. Provably Efficient $ Q $-learning with Function Approximation via Distribution Shift Error Checking Oracle. arXiv preprint arXiv:1906.06321 (2019).Google ScholarGoogle Scholar
  9. Alison L Gibbs and Francis Edward Su. 2002. On choosing and bounding probability metrics. International statistical review , Vol. 70, 3 (2002), 419--435.Google ScholarGoogle Scholar
  10. Jin C, Jordan M.I , Allen-Zhu Z, Bubeck S, and NeurIPS 2018 32nd Conference on Neural Information Processing Systems. 2018. Is Q-learning provably efficient? Adv. neural inf. proces. syst. Advances in Neural Information Processing Systems , Vol. 2018-December (2018), 4863--4873. OCLC: 8096900528.Google ScholarGoogle Scholar
  11. Sham Kakade, Michael Kearns, and John Langford. 2003. Exploration in Metric State Spaces. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning (ICML'03). AAAI Press, 306--312. http://dl.acm.org/citation.cfm?id=3041838.3041877Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. 2019. Bandits and Experts in Metric Spaces. J. ACM , Vol. 66, 4, Article 30 (May 2019), bibinfonumpages77 pages. https://doi.org/10.1145/3299873Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. Lakshmanan, Ronald Ortner, and Daniil Ryabko. 2015. Improved Regret Bounds for Undiscounted Continuous Reinforcement Learning. In Proceedings of the 32nd International Conference on Machine Learning (Proceedings of Machine Learning Research), , Francis Bach and David Blei (Eds.), Vol. 37. PMLR, Lille, France, 524--532. http://proceedings.mlr.press/v37/lakshmanan15.htmlGoogle ScholarGoogle Scholar
  14. Tor Lattimore and Csaba Szepesvári. 2018. Bandit algorithms. preprint (2018).Google ScholarGoogle Scholar
  15. Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. 2016. Resource Management with Deep Reinforcement Learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks (HotNets '16). ACM, New York, NY, USA, 50--56. https://doi.org/10.1145/3005745.3005750Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Winter Mason and Duncan J Watts. 2012. Collaborative learning in networks. Proceedings of the National Academy of Sciences , Vol. 109, 3 (2012), 764--769.Google ScholarGoogle ScholarCross RefCross Ref
  17. Ronald Ortner. 2013. Adaptive aggregation for reinforcement learning in average reward Markov decision processes. Annals of Operations Research , Vol. 208, 1 (01 Sep 2013), 321--336. https://doi.org/10.1007/s10479-012--1064-yGoogle ScholarGoogle ScholarCross RefCross Ref
  18. Ronald Ortner and Daniil Ryabko. 2012. Online Regret Bounds for Undiscounted Continuous Reinforcement Learning. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1763--1771. http://papers.nips.cc/paper/4666-online-regret-bounds-for-undiscounted-continuous-reinforcement-learning.pdfGoogle ScholarGoogle Scholar
  19. Ian Osband and Benjamin Van Roy. 2014. Model-based Reinforcement Learning and the Eluder Dimension. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1466--1474. http://papers.nips.cc/paper/5245-model-based-reinforcement-learning-and-the-eluder-dimension.pdfGoogle ScholarGoogle Scholar
  20. Martin L. Puterman. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming 1st ed.). John Wiley & Sons, Inc., New York, NY, USA.Google ScholarGoogle Scholar
  21. Devavrat Shah and Qiaomin Xie. 2018. Q-learning with Nearest Neighbors. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 3111--3121. http://papers.nips.cc/paper/7574-q-learning-with-nearest-neighbors.pdfGoogle ScholarGoogle Scholar
  22. Max Simchowitz and Kevin Jamieson. 2019. Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs. arxiv: cs.LG/1905.03814Google ScholarGoogle Scholar
  23. Aleksandrs Slivkins. 2015. Contextual Bandits with Similarity Information . Journal of machine learning research : JMLR. , Vol. 15, 2 (2015), 2533--2568. OCLC: 5973068319.Google ScholarGoogle Scholar
  24. Aleksandrs Slivkins. 2019. Introduction to Multi-Armed Bandits. arxiv: cs.LG/1904.07272Google ScholarGoogle Scholar
  25. Zhao Song and Wen Sun. 2019. Efficient Model-free Reinforcement Learning in Metric Spaces . arXiv:1905.00475 [cs, stat] (May 2019). http://arxiv.org/abs/1905.00475 arXiv: 1905.00475.Google ScholarGoogle Scholar
  26. Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction .MIT press.Google ScholarGoogle Scholar
  27. Tianyu Wang, Weicheng Ye, Dawei Geng, and Cynthia Rudin. 2019. Towards Practical Lipschitz Stochastic Bandits . arXiv e-prints, Article arXiv:1901.09277 (Jan 2019), bibinfonumpagesarXiv:1901.09277 pages.arxiv: stat.ML/1901.09277Google ScholarGoogle Scholar
  28. Nirandika Wanigasekara and Christina Lee Yu. 2019. Nonparametric Contextual Bandits in an Unknown Metric Space. arxiv: cs.LG/1908.01228Google ScholarGoogle Scholar
  29. Christopher John Cornish Hellaby Watkins. 1989. Learning from delayed rewards. (1989).Google ScholarGoogle Scholar
  30. Lin Yang and Mengdi Wang. 2019 a. Sample-Optimal Parametric Q-Learning Using Linearly Additive Features. In International Conference on Machine Learning. 6995--7004.Google ScholarGoogle Scholar
  31. Lin F Yang, Chengzhuo Ni, and Mengdi Wang. 2019. Learning to Control in Metric Space with Optimal Regret. arXiv preprint arXiv:1905.01576 (2019).Google ScholarGoogle Scholar
  32. Lin F. Yang and Mengdi Wang. 2019 b. Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound . arXiv:1905.10389 [cs, stat] (May 2019). http://arxiv.org/abs/1905.10389 arXiv: 1905.10389.Google ScholarGoogle Scholar

Index Terms

  1. Adaptive Discretization for Episodic Reinforcement Learning in Metric Spaces

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!