10.1145/1143844.1143955acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicpsprocConference Proceedings
ARTICLE

PAC model-free reinforcement learning

ABSTRACT

For a Markov Decision Process with finite state (size S) and action spaces (size A per state), we propose a new algorithm---Delayed Q-Learning. We prove it is PAC, achieving near optimal performance except for Õ(SA) timesteps using O(SA) space, improving on the Õ(S2A) bounds of best previous algorithms. This result proves efficient reinforcement learning is possible without learning a model of the MDP from experience. Learning takes place from a single continuous thread of experience---no resets nor parallel sampling is used. Beyond its smaller storage and experience requirements, Delayed Q-learning's per-experience computation cost is much less than that of previous PAC algorithms.

References

  1. Barto, A. G., Bradtke, S. J., & Singh, S. P. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence, 72, 81--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Brafman, R. I., & Tennenholtz, M. (2002). R-MAX---a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3, 213--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Even-Dar, E., & Mansour, Y. (2003). Learning rates for Q-learning. Journal of Machine Learning Research, 5, 1--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Fiechter, C.-N. (1994). Efficient reinforcement learning. Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory (pp. 88--97). Association of Computing Machinery. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Kakade, S. M. (2003). On the sample complexity of reinforcement learning. Doctoral dissertation, Gatsby Computational Neuroscience Unit, University College London.Google ScholarGoogle Scholar
  6. Kearns, M., & Singh, S. (1999). Finite-sample convergence rates for Q-learning and indirect algorithms. Advances in Neural Information Processing Systems 11 (pp. 996--1002). The MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Kearns, M. J., & Singh, S. P. (2002). Near-optimal reinforcement learning in polynomial time. Machine Learning, 49, 209--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Strehl, A. L., & Littman, M. L. (2005). A theoretical analysis of model-based interval estimation. Proceedings of the Twenty-second International Conference on Machine Learning (ICML-05) (pp. 857--864). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. The MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279--292. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. PAC model-free reinforcement learning

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!