skip to main content
10.5555/3042573.3042600guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Off-policy actor-critic

Published: 26 June 2012 Publication History

Abstract

This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and obtaining data from another (behavior) policy. For many problems, however, actor-critic methods are more practical than action value methods (like Greedy-GQ) because they explicitly represent the policy; consequently, the policy can be stochastic and utilize a large action space. In this paper, we illustrate how to practically combine the generality and learning potential of off-policy learning with the flexibility in action selection given by actor-critic methods. We derive an incremental, linear time and space complexity algorithm that includes eligibility traces, prove convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems.

References

[1]
Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 30-37. Morgan Kaufmann.
[2]
Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., Lee, M. (2009). Natural actor-critic algorithms. Automatica 45 (11):2471-2482.
[3]
Borkar, V. S. (2008). Stochastic approximation: A dynamical systems viewpoint. Cambridge Univ Press.
[4]
Bradtke, S. J., Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning 22 :33-57.
[5]
Delp, M. (2010). Experiments in off-policy reinforcement learning with the GQ(λ) algorithm. Masters thesis, University of Alberta.
[6]
Doya, K. (2000). Reinforcement learning in continuous time and space. Neural computation 12 :219-245.
[7]
Lagoudakis, M., Parr, R. (2003). Least squares policy iteration. Journal of Machine Learning Research 4 :1107-1149.
[8]
Maei, H. R., Sutton, R. S. (2010). GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In Proceedings of the Third Conf. on Artificial General Intelligence.
[9]
Maei, H. R. (2011). Gradient Temporal-Difference Learning Algorithms. PhD thesis, University of Alberta.
[10]
Maei, H. R., Szepesvári, C., Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S. (2009). Convergent temporal-difference learning with arbitrary smooth function approximation. Advances in Neural Information Processing Systems 22 :1204-1212.
[11]
Maei, H. R., Szepesvári, C., Bhatnagar, S., Sutton, R. S. (2010). Toward off-policy learning control with function approximation. Proceedings of the 27th International Conference on Machine Learning.
[12]
Marbach, P., Tsitsiklis, J. N. (1998). Simulation-based optimization of Markov reward processes. Technical report LIDS-P-2411.
[13]
Pemantle, R. (1990). Nonconvergence to unstable points in urn models and stochastic approximations. The Annals of Probability 18 (2):698-712.
[14]
Peters, J., Schaal, S. (2008). Natural actor-critic. Neurocomputing 71 (7):1180-1190.
[15]
Precup, D., Sutton, R.S., Paduraru, C., Koop, A., Singh, S. (2006). Off-policy learning with recognizers. Neural Information Processing Systems 18.
[16]
Smart, W.D., Pack Kaelbling, L. (2002). Effective reinforcement learning for mobile robots. In Proceedings of International Conference on Robotics and Automation, volume 4, pp. 3404-3410.
[17]
Sutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press.
[18]
Sutton, R. S., McAllester, D., Singh, S., Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems 12.
[19]
Sutton, R. S., Szepesvári, Cs., Maei, H. R. (2008). A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In Advances in Neural Information Processing Systems 21, pp. 1609-1616.
[20]
Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, Cs., Wiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 993-1000.
[21]
Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., and Precup, D. (2011). Horde: A scalable realtime architecture for learning knowledge from unsupervised sensorimotor interaction. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems.
[22]
Watkins, C. J. C. H., Dayan, P. (1992). Q-learning. Machine Learning 8 (3):279-292.

Cited By

View all
  • (2023)Model-based reparameterization policy gradient methodsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669112(68391-68419)Online publication date: 10-Dec-2023
  • (2023)Adaptive barrier smoothing for first-order policy gradient with contact dynamicsProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3620136(41219-41243)Online publication date: 23-Jul-2023
  • (2023)Model-based reinforcement learning with scalable composite policy gradient estimatorsProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619545(27346-27377)Online publication date: 23-Jul-2023
  • Show More Cited By
  1. Off-policy actor-critic

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    ICML'12: Proceedings of the 29th International Coference on International Conference on Machine Learning
    June 2012
    1912 pages
    ISBN:9781450312851

    Sponsors

    • PASCAL2 - Pattern Analysis, Statistical Modelling and Computational Learning
    • IBMR: IBM Research
    • NSF
    • Microsoft Research: Microsoft Research
    • Facebook: Facebook

    Publisher

    Omnipress

    Madison, WI, United States

    Publication History

    Published: 26 June 2012

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 24 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Model-based reparameterization policy gradient methodsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669112(68391-68419)Online publication date: 10-Dec-2023
    • (2023)Adaptive barrier smoothing for first-order policy gradient with contact dynamicsProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3620136(41219-41243)Online publication date: 23-Jul-2023
    • (2023)Model-based reinforcement learning with scalable composite policy gradient estimatorsProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619545(27346-27377)Online publication date: 23-Jul-2023
    • (2022)Teacher forcing recovers reward functions for text generationProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601185(12594-12607)Online publication date: 28-Nov-2022
    • (2022)GCS: Graph-Based Coordination Strategy for Multi-Agent Reinforcement LearningProceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems10.5555/3535850.3535976(1128-1136)Online publication date: 9-May-2022
    • (2021)COMBOProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3542479(28954-28967)Online publication date: 6-Dec-2021
    • (2021)Conservative offline distributional reinforcement learningProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3541732(19235-19247)Online publication date: 6-Dec-2021
    • (2021)Conservative data sharing for multi-task offine reinforcement learningProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3541140(11501-11516)Online publication date: 6-Dec-2021
    • (2021)Guiding Evolutionary Strategies with Off-Policy Actor-CriticProceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3463952.3464104(1317-1325)Online publication date: 3-May-2021
    • (2021)Diet Planning with Machine Learning: Teacher-forced REINFORCE for Composition Compliance with Nutrition EnhancementProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3467201(3150-3160)Online publication date: 14-Aug-2021
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media