Abstract
We present an efficient algorithm for model-free episodic reinforcement learning on large (potentially continuous) state-action spaces. Our algorithm is based on a novel Q-learning policy with adaptive data-driven discretization. The central idea is to maintain a finer partition of the state-action space in regions which are frequently visited in historical trajectories, and have higher payoff estimates. We demonstrate how our adaptive partitions take advantage of the shape of the optimal Q-function and the joint space, without sacrificing the worst-case performance. In particular, we recover the regret guarantees of prior algorithms for continuous state-action spaces, which additionally require either an optimal discretization as input, and/or access to a simulation oracle. Moreover, experiments demonstrate how our algorithm automatically adapts to the underlying structure of the problem, resulting in much better performance compared both to heuristics and Q-learning with uniform discretization.
- Peter Auer, Thomas Jaksch, and Ronald Ortner. 2009. Near-optimal Regret Bounds for Reinforcement Learning. In Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (Eds.). Curran Associates, Inc., 89--96. http://papers.nips.cc/paper/3401-near-optimal-regret-bounds-for-reinforcement-learning.pdfGoogle Scholar
- Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. 2017. Minimax Regret Bounds for Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (ICML'17). JMLR.org, 263--272. http://dl.acm.org/citation.cfm?id=3305381.3305409Google Scholar
- Luce Brotcorne, Gilbert Laporte, and Frederic Semet. 2003. Ambulance location and relocation models. European journal of operational research , Vol. 147, 3 (2003), 451--463.Google Scholar
- Sébastien Bubeck, Nicolo Cesa-Bianchi, et almbox. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning , Vol. 5, 1 (2012), 1--122.Google Scholar
Cross Ref
- Sébastien Bubeck, Gilles Stoltz, Csaba Szepesvári, and Rémi Munos. 2009. Online optimization in X-armed bandits. In Advances in Neural Information Processing Systems. 201--208.Google Scholar
- Joshua Comden, Sijie Yao, Niangjun Chen, Haipeng Xing, and Zhenhua Liu. 2019. Online Optimization in Cloud Resource Provisioning: Predictions, Regrets, and Algorithms. Proc. ACM Meas. Anal. Comput. Syst. , Vol. 3, 1, Article 16 (March 2019), bibinfonumpages30 pages. https://doi.org/10.1145/3322205.3311087Google Scholar
Digital Library
- Kefan Dong, Yuanhao Wang, Xiaoyu Chen, and Liwei Wang. 2019. Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP. arXiv preprint arXiv:1901.09311 (2019).Google Scholar
- Simon S Du, Yuping Luo, Ruosong Wang, and Hanrui Zhang. 2019. Provably Efficient $ Q $-learning with Function Approximation via Distribution Shift Error Checking Oracle. arXiv preprint arXiv:1906.06321 (2019).Google Scholar
- Alison L Gibbs and Francis Edward Su. 2002. On choosing and bounding probability metrics. International statistical review , Vol. 70, 3 (2002), 419--435.Google Scholar
- Jin C, Jordan M.I , Allen-Zhu Z, Bubeck S, and NeurIPS 2018 32nd Conference on Neural Information Processing Systems. 2018. Is Q-learning provably efficient? Adv. neural inf. proces. syst. Advances in Neural Information Processing Systems , Vol. 2018-December (2018), 4863--4873. OCLC: 8096900528.Google Scholar
- Sham Kakade, Michael Kearns, and John Langford. 2003. Exploration in Metric State Spaces. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning (ICML'03). AAAI Press, 306--312. http://dl.acm.org/citation.cfm?id=3041838.3041877Google Scholar
Digital Library
- Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. 2019. Bandits and Experts in Metric Spaces. J. ACM , Vol. 66, 4, Article 30 (May 2019), bibinfonumpages77 pages. https://doi.org/10.1145/3299873Google Scholar
Digital Library
- K. Lakshmanan, Ronald Ortner, and Daniil Ryabko. 2015. Improved Regret Bounds for Undiscounted Continuous Reinforcement Learning. In Proceedings of the 32nd International Conference on Machine Learning (Proceedings of Machine Learning Research), , Francis Bach and David Blei (Eds.), Vol. 37. PMLR, Lille, France, 524--532. http://proceedings.mlr.press/v37/lakshmanan15.htmlGoogle Scholar
- Tor Lattimore and Csaba Szepesvári. 2018. Bandit algorithms. preprint (2018).Google Scholar
- Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. 2016. Resource Management with Deep Reinforcement Learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks (HotNets '16). ACM, New York, NY, USA, 50--56. https://doi.org/10.1145/3005745.3005750Google Scholar
Digital Library
- Winter Mason and Duncan J Watts. 2012. Collaborative learning in networks. Proceedings of the National Academy of Sciences , Vol. 109, 3 (2012), 764--769.Google Scholar
Cross Ref
- Ronald Ortner. 2013. Adaptive aggregation for reinforcement learning in average reward Markov decision processes. Annals of Operations Research , Vol. 208, 1 (01 Sep 2013), 321--336. https://doi.org/10.1007/s10479-012--1064-yGoogle Scholar
Cross Ref
- Ronald Ortner and Daniil Ryabko. 2012. Online Regret Bounds for Undiscounted Continuous Reinforcement Learning. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1763--1771. http://papers.nips.cc/paper/4666-online-regret-bounds-for-undiscounted-continuous-reinforcement-learning.pdfGoogle Scholar
- Ian Osband and Benjamin Van Roy. 2014. Model-based Reinforcement Learning and the Eluder Dimension. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1466--1474. http://papers.nips.cc/paper/5245-model-based-reinforcement-learning-and-the-eluder-dimension.pdfGoogle Scholar
- Martin L. Puterman. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming 1st ed.). John Wiley & Sons, Inc., New York, NY, USA.Google Scholar
- Devavrat Shah and Qiaomin Xie. 2018. Q-learning with Nearest Neighbors. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 3111--3121. http://papers.nips.cc/paper/7574-q-learning-with-nearest-neighbors.pdfGoogle Scholar
- Max Simchowitz and Kevin Jamieson. 2019. Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs. arxiv: cs.LG/1905.03814Google Scholar
- Aleksandrs Slivkins. 2015. Contextual Bandits with Similarity Information . Journal of machine learning research : JMLR. , Vol. 15, 2 (2015), 2533--2568. OCLC: 5973068319.Google Scholar
- Aleksandrs Slivkins. 2019. Introduction to Multi-Armed Bandits. arxiv: cs.LG/1904.07272Google Scholar
- Zhao Song and Wen Sun. 2019. Efficient Model-free Reinforcement Learning in Metric Spaces . arXiv:1905.00475 [cs, stat] (May 2019). http://arxiv.org/abs/1905.00475 arXiv: 1905.00475.Google Scholar
- Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction .MIT press.Google Scholar
- Tianyu Wang, Weicheng Ye, Dawei Geng, and Cynthia Rudin. 2019. Towards Practical Lipschitz Stochastic Bandits . arXiv e-prints, Article arXiv:1901.09277 (Jan 2019), bibinfonumpagesarXiv:1901.09277 pages.arxiv: stat.ML/1901.09277Google Scholar
- Nirandika Wanigasekara and Christina Lee Yu. 2019. Nonparametric Contextual Bandits in an Unknown Metric Space. arxiv: cs.LG/1908.01228Google Scholar
- Christopher John Cornish Hellaby Watkins. 1989. Learning from delayed rewards. (1989).Google Scholar
- Lin Yang and Mengdi Wang. 2019 a. Sample-Optimal Parametric Q-Learning Using Linearly Additive Features. In International Conference on Machine Learning. 6995--7004.Google Scholar
- Lin F Yang, Chengzhuo Ni, and Mengdi Wang. 2019. Learning to Control in Metric Space with Optimal Regret. arXiv preprint arXiv:1905.01576 (2019).Google Scholar
- Lin F. Yang and Mengdi Wang. 2019 b. Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound . arXiv:1905.10389 [cs, stat] (May 2019). http://arxiv.org/abs/1905.10389 arXiv: 1905.10389.Google Scholar
Index Terms
Adaptive Discretization for Episodic Reinforcement Learning in Metric Spaces
Recommendations
Adaptive Discretization for Episodic Reinforcement Learning in Metric Spaces
SIGMETRICS '20: Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer SystemsWe present an efficient algorithm for model-free episodic reinforcement learning on large (potentially continuous) state-action spaces. Our algorithm is based on a novel Q-learning policy with adaptive data-driven discretization. The central idea is to ...
Adaptive Discretization for Episodic Reinforcement Learning in Metric Spaces
We present an efficient algorithm for model-free episodic reinforcement learning on large (potentially continuous) state-action spaces. Our algorithm is based on a novel Q-learning policy with adaptive data-driven discretization. The central idea is to ...
Reward Shaping in Episodic Reinforcement Learning
AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent SystemsRecent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of ...






Comments