skip to main content
10.5555/3545946.3598767acmconferencesArticle/Chapter ViewAbstractPublication PagesaamasConference Proceedingsconference-collections
research-article

Curriculum Offline Reinforcement Learning

Published: 30 May 2023 Publication History

Abstract

Offline reinforcement learning holds the promise of obtaining powerful agents from large datasets. To achieve this, a good algorithm should always benefit from (or at least does not degenerate by) adding more samples, even if the samples are not collected by expert policies. However, we observe that many popular offline RL algorithms do not possess such a property and sometimes suffers from adding heterogeneous or poor samples to the dataset. Empirically we show that, given a stage in the learning process, not all samples are useful for these algorithms. Specifically, the agent can learn more efficiently with only the samples collected by a policy similar to the current policy. This indicates that different samples may contribute to different stages of the training process, and therefore we propose Curriculum Offline Reinforcement Learning (CUORL) to equip the previous methods with the such a favorable property. In CUORL, we select the samples that are likely to be generated by the current policy to train the agent. Empirically, we show that CUORL can prevent the negative impact of adding the samples from poor policies and always improves the performance with more samples (even from random policies). Moreover, CUORL also achieves state-of-the-art performance on standard D4RL datasets, which indicates the potential of curriculum learning for offline RL.

References

[1]
Alekh Agarwal, Nan Jiang, Sham M Kakade, and Wen Sun. 2019. Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep (2019), 10--4.
[2]
Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. 2020. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning. PMLR, 104--114.
[3]
David Brandfonbrener, Will Whitney, Rajesh Ranganath, and Joan Bruna. 2021. Offline rl without off-policy evaluation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 4933--4946.
[4]
Jacob Buckman, Carles Gelada, and Marc G Bellemare. 2020. The importance of pessimism in fixed-dataset policy optimization. arXiv preprint arXiv:2009.06799 (2020).
[5]
Rasheed El-Bouri, David Eyre, Peter Watkinson, Tingting Zhu, and David Clifton. 2020. Student-teacher curriculum learning via reinforcement learning: predicting hospital inpatient admission location. In International Conference on Machine Learning. PMLR, 2848--2857.
[6]
Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. 2018. Automatic goal generation for reinforcement learning agents. In International conference on machine learning. PMLR, 1515--1528.
[7]
Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. 2020. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219 (2020).
[8]
Yuwei Fu, Di Wu, and Benoit Boulet. 2021. Benchmarking Sample Selection Strategies for Batch Reinforcement Learning. (2021).
[9]
Scott Fujimoto and Shixiang Shane Gu. 2021. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, Vol. 34 (2021), 20132--20145.
[10]
Scott Fujimoto, David Meger, and Doina Precup. 2019. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning. PMLR, 2052--2062.
[11]
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning. PMLR, 1861--1870.
[12]
Ben Hambly, Renyuan Xu, and Huining Yang. 2021. Recent advances in reinforcement learning in finance. arXiv preprint arXiv:2112.04553 (2021).
[13]
Allan Jabri, Kyle Hsu, Abhishek Gupta, Ben Eysenbach, Sergey Levine, and Chelsea Finn. 2019. Unsupervised curricula for visual meta-reinforcement learning. Advances in Neural Information Processing Systems, Vol. 32 (2019).
[14]
Nan Jiang and Lihong Li. 2016. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning. PMLR, 652--661.
[15]
Ying Jin, Zhuoran Yang, and Zhaoran Wang. 2021. Is Pessimism Provably Efficient for Offline RL?. In International Conference on Machine Learning. PMLR, 5084--5096.
[16]
Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. 2020. Morel: Model-based offline reinforcement learning. arXiv preprint arXiv:2005.05951 (2020).
[17]
Tae-Hoon Kim and Jonghyun Choi. 2018. Screenernet: Learning self-paced curriculum for deep neural networks. arXiv preprint arXiv:1801.00904 (2018).
[18]
B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A Al Sallab, Senthil Yogamani, and Patrick Pérez. 2021. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems (2021).
[19]
Pascal Klink, Carlo D'Eramo, Jan Peters, and Joni Pajarinen. 2021. Boosted Curriculum Reinforcement Learning. In International Conference on Learning Representations.
[20]
Jens Kober, J Andrew Bagnell, and Jan Peters. 2013. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, Vol. 32, 11 (2013), 1238--1274.
[21]
Ilya Kostrikov, Rob Fergus, Jonathan Tompson, and Ofir Nachum. 2021a. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning. PMLR, 5774--5783.
[22]
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. 2021b. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169 (2021).
[23]
Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. 2019. Stabilizing off-policy q-learning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949 (2019).
[24]
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. 2020. Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779 (2020).
[25]
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 (2020).
[26]
Guogang Liao, Ze Wang, Xiaoxu Wu, Xiaowen Shi, Chuheng Zhang, Yongkang Wang, Xingxing Wang, and Dong Wang. 2022. Cross dqn: Cross deep q network for ads allocation in feed. In Proceedings of the ACM Web Conference 2022. 401--409.
[27]
Minghuan Liu, Hanye Zhao, Zhengyu Yang, Jian Shen, Weinan Zhang, Li Zhao, and Tie-Yan Liu. 2021. Curriculum offline imitating learning. Advances in Neural Information Processing Systems, Vol. 34 (2021), 6266--6277.
[28]
Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. 2020. Provably good batch reinforcement learning without great exploration. arXiv preprint arXiv:2007.08202 (2020).
[29]
Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. 2019. Teacher--student curriculum learning. IEEE transactions on neural networks and learning systems, Vol. 31, 9 (2019), 3732--3740.
[30]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. nature, Vol. 518, 7540 (2015), 529--533.
[31]
Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. 2019a. DualDICE: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in Neural Information Processing Systems, Vol. 32 (2019).
[32]
Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. 2019b. AlgaeDICE: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074 (2019).
[33]
Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. 2020. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359 (2020).
[34]
Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. 2020. Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey. Journal of Machine Learning Research, Vol. 21, 181 (2020), 1--50.
[35]
Sanmit Narvekar, Jivko Sinapov, Matteo Leonetti, and Peter Stone. 2016. Source task creation for curriculum learning. In Proceedings of the 2016 international conference on autonomous agents & multiagent systems. 566--574.
[36]
Sanmit Narvekar, Jivko Sinapov, and Peter Stone. 2017. Autonomous Task Sequencing for Customized Curriculum Design in Reinforcement Learning. In IJCAI. 2536--2542.
[37]
Sanmit Narvekar and Peter Stone. 2019. Learning Curriculum Policies for Reinforcement Learning. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. 25--33.
[38]
Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. 2019. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177 (2019).
[39]
Rémy Portelas, Cédric Colas, Katja Hofmann, and Pierre-Yves Oudeyer. 2020. Teacher algorithms for curriculum learning of deep rl in continuously parameterized environments. In Conference on Robot Learning. PMLR, 835--853.
[40]
Doina Precup. 2000. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series (2000), 80.
[41]
Sebastien Racaniere, Andrew K Lampinen, Adam Santoro, David P Reichert, Vlad Firoiu, and Timothy P Lillicrap. 2019. Automated curricula through setter-solver interactions. arXiv preprint arXiv:1909.12892 (2019).
[42]
Zhipeng Ren, Daoyi Dong, Huaxiong Li, and Chunlin Chen. 2018. Self-paced prioritized curriculum learning with coverage penalty in deep reinforcement learning. IEEE transactions on neural networks and learning systems, Vol. 29, 6 (2018), 2216--2226.
[43]
Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2016. Prioritized Experience Replay. In ICLR (Poster).
[44]
Wenjie Shi, Tianchi Cai, Shiji Song, Lihong Gu, Jinjie Gu, and Gao Huang. 2020. Robust Offline Reinforcement Learning from Low-Quality Data. (2020).
[45]
Noah Y Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. 2020. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396 (2020).
[46]
Felipe Leno Da Silva and Anna Helena Reali Costa. 2018. Object-oriented curriculum generation for reinforcement learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. 1026--1034.
[47]
Bharat Singh, Rajesh Kumar, and Vinay Pratap Singh. 2021. Reinforcement learning in robotic applications: a comprehensive survey. Artificial Intelligence Review (2021), 1--46.
[48]
Sainbayar Sukhbaatar, Zeming Lin, Ilya Kostrikov, Gabriel Synnaeve, Arthur Szlam, and Rob Fergus. 2017. Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407 (2017).
[49]
Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
[50]
Maxwell Svetlik, Matteo Leonetti, Jivko Sinapov, Rishi Shah, Nick Walker, and Peter Stone. 2017. Automatic curriculum graph generation for reinforcement learning agents. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.
[51]
Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh S Merel, Jost Tobias Springenberg, Scott E Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, et al. 2020. Critic regularized regression. Advances in Neural Information Processing Systems, Vol. 33 (2020), 7768--7778.
[52]
Yifan Wu, George Tucker, and Ofir Nachum. 2019. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361 (2019).
[53]
Yue Wu, Shuangfei Zhai, Nitish Srivastava, Joshua Susskind, Jian Zhang, Ruslan Salakhutdinov, and Hanlin Goh. 2021. Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning. arXiv preprint arXiv:2105.08140 (2021).
[54]
Sijia Xu, Hongyu Kuang, Zhuang Zhi, Renjie Hu, Yang Liu, and Huyang Sun. 2019. Macro action selection with deep reinforcement learning in starcraft. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Vol. 15. 94--99.
[55]
Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. 2020. Mopo: Model-based offline policy optimization. arXiv preprint arXiv:2005.13239 (2020).
[56]
Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. 2020. GenDICE: Generalized offline estimation of stationary values. arXiv preprint arXiv:2002.09072 (2020).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
AAMAS '23: Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems
May 2023
3131 pages
ISBN:9781450394321
  • General Chairs:
  • Noa Agmon,
  • Bo An,
  • Program Chairs:
  • Alessandro Ricci,
  • William Yeoh

Sponsors

Publisher

International Foundation for Autonomous Agents and Multiagent Systems

Richland, SC

Publication History

Published: 30 May 2023

Check for updates

Author Tags

  1. curriculum learning
  2. mixed dataset
  3. offline reinforcement learning

Qualifiers

  • Research-article

Conference

AAMAS '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,155 of 5,036 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 112
    Total Downloads
  • Downloads (Last 12 months)53
  • Downloads (Last 6 weeks)1
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media