ABSTRACT
For a Markov Decision Process with finite state (size S) and action spaces (size A per state), we propose a new algorithm---Delayed Q-Learning. We prove it is PAC, achieving near optimal performance except for Õ(SA) timesteps using O(SA) space, improving on the Õ(S2A) bounds of best previous algorithms. This result proves efficient reinforcement learning is possible without learning a model of the MDP from experience. Learning takes place from a single continuous thread of experience---no resets nor parallel sampling is used. Beyond its smaller storage and experience requirements, Delayed Q-learning's per-experience computation cost is much less than that of previous PAC algorithms.
References
- Barto, A. G., Bradtke, S. J., & Singh, S. P. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence, 72, 81--138. Google Scholar
Digital Library
- Brafman, R. I., & Tennenholtz, M. (2002). R-MAX---a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3, 213--231. Google Scholar
Digital Library
- Even-Dar, E., & Mansour, Y. (2003). Learning rates for Q-learning. Journal of Machine Learning Research, 5, 1--25. Google Scholar
Digital Library
- Fiechter, C.-N. (1994). Efficient reinforcement learning. Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory (pp. 88--97). Association of Computing Machinery. Google Scholar
Digital Library
- Kakade, S. M. (2003). On the sample complexity of reinforcement learning. Doctoral dissertation, Gatsby Computational Neuroscience Unit, University College London.Google Scholar
- Kearns, M., & Singh, S. (1999). Finite-sample convergence rates for Q-learning and indirect algorithms. Advances in Neural Information Processing Systems 11 (pp. 996--1002). The MIT Press. Google Scholar
Digital Library
- Kearns, M. J., & Singh, S. P. (2002). Near-optimal reinforcement learning in polynomial time. Machine Learning, 49, 209--232. Google Scholar
Digital Library
- Strehl, A. L., & Littman, M. L. (2005). A theoretical analysis of model-based interval estimation. Proceedings of the Twenty-second International Conference on Machine Learning (ICML-05) (pp. 857--864). Google Scholar
Digital Library
- Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. The MIT Press. Google Scholar
Digital Library
- Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279--292. Google Scholar
Digital Library
Index Terms
PAC model-free reinforcement learning



Comments