skip to main content
10.1145/3523227.3546758acmotherconferencesArticle/Chapter ViewAbstractPublication PagesrecsysConference Proceedingsconference-collections
research-article
Open access

Off-Policy Actor-critic for Recommender Systems

Published: 13 September 2022 Publication History

Abstract

Industrial recommendation platforms are increasingly concerned with how to make recommendations that cause users to enjoy their long term experience on the platform. Reinforcement learning emerged naturally as an appealing approach for its promise in 1) combating feedback loop effect resulted from myopic system behaviors; and 2) sequential planning to optimize long term outcome. Scaling RL algorithms to production recommender systems serving billions of users and contents, however remain challenging. Sample inefficiency and instability of online RL hinder its widespread adoption in production. Offline RL enables usage of off-policy data and batch learning. It on the other hand faces significant challenges in learning due to the distribution shift.
A REINFORCE agent [3] was successfully tested for YouTube recommendation, significantly outperforming a sophisticated supervised learning production system. Off-policy correction was employed to learn from logged data. The algorithm partially mitigates the distribution shift by employing a one-step importance weighting. We resort to the off-policy actor critic algorithms to addresses the distribution shift to a better extent. Here we share the key designs in setting up an off-policy actor-critic agent for production recommender systems. It extends  [3] with a critic network that estimates the value of any state-action pairs under the target learned policy through temporal difference learning. We demonstrate in offline and live experiments that the new framework out-performs baseline and improves long term user experience.
An interesting discovery along our investigation is that recommendation agents that employ a softmax policy parameterization, can end up being too pessimistic about out-of-distribution (OOD) actions. Finding the right balance between pessimism and optimism on OOD actions is critical to the success of offline RL for recommender systems.

References

[1]
Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. 2017. Constrained policy optimization. arXiv preprint arXiv:1705.10528(2017).
[2]
Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. 2019. Striving for Simplicity in Off-policy Deep Reinforcement Learning. arXiv preprint arXiv:1907.04543(2019).
[3]
Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. 2019. Top-k off-policy correction for a REINFORCE recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 456–464.
[4]
Minmin Chen, Bo Chang, Can Xu, and Ed H Chi. 2021. User response models to improve a reinforce recommender system. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 121–129.
[5]
Minmin Chen, Ramki Gummadi, Chris Harris, and Dale Schuurmans. 2019. Surrogate Objectives for Batch Policy Optimization in One-step Decision Making. In Advances in Neural Information Processing Systems. 8825–8835.
[6]
Minmin Chen, Yuyan Wang, Can Xu, Ya Le, Mohit Sharma, Lee Richardson, Su-Lin Wu, and Ed Chi. 2021. Values of User Exploration in Recommender Systems. In Fifteenth ACM Conference on Recommender Systems. 85–95.
[7]
Xinshi Chen, Shuang Li, Hui Li, Shaohua Jiang, Yuan Qi, and Le Song. 2019. Generative adversarial user model for reinforcement learning based recommendation system. In International Conference on Machine Learning. PMLR, 1052–1061.
[8]
Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 191–198.
[9]
Nathaniel D Daw, John P O’doherty, Peter Dayan, Ben Seymour, and Raymond J Dolan. 2006. Cortical substrates for exploratory decisions in humans. Nature 441, 7095 (2006), 876.
[10]
Thomas Degris, Martha White, and Richard S Sutton. 2012. Off-policy actor-critic. arXiv preprint arXiv:1205.4839(2012).
[11]
Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. 2015. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679(2015).
[12]
Damien Ernst, Pierre Geurts, and Louis Wehenkel. 2005. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6, Apr (2005).
[13]
Scott Fujimoto, Herke Hoof, and David Meger. 2018. Addressing function approximation error in actor-critic methods. In International conference on machine learning. PMLR, 1587–1596.
[14]
Scott Fujimoto, David Meger, and Doina Precup. 2018. Off-policy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900(2018).
[15]
Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B testing for Recommender Systems. In WSDM. ACM, 198–206.
[16]
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning. PMLR, 1861–1870.
[17]
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. 2019. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning. 2555–2565.
[18]
Nikolaus Hansen and Andreas Ostermeier. 2001. Completely derandomized self-adaptation in evolution strategies. Evolutionary computation 9, 2 (2001), 159–195.
[19]
Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, and Yinghui Xu. 2018. Reinforcement learning to rank in e-commerce search engine: Formalization, analysis, and application. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 368–377.
[20]
Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu, Heng-Tze Cheng, Tushar Chandra, and Craig Boutilier. 2019. SlateQ: A tractable decomposition for reinforcement learning with recommendation sets. (2019).
[21]
Amir H Jadidinejad, Craig Macdonald, and Iadh Ounis. 2021. The Simpson’s Paradox in the Offline Evaluation of Recommendation Systems. ACM Transactions on Information Systems (TOIS) 40, 1 (2021), 1–22.
[22]
Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. On using very large target vocabulary for neural machine translation. arXiv preprint arXiv:1412.2007(2014).
[23]
Olivier Jeunen and Bart Goethals. 2021. Pessimistic reward models for off-policy learning in recommendation. In Fifteenth ACM Conference on Recommender Systems. 63–74.
[24]
Olivier Jeunen, David Rohde, Flavian Vasile, and Martin Bompaire. 2020. Joint policy-value learning for recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1223–1233.
[25]
Shivaram Kalyanakrishnan and Peter Stone. 2007. Batch reinforcement learning in a complex domain. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems. ACM, 94.
[26]
Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. 2020. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems 33 (2020), 21810–21823.
[27]
Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, and Yasuo Yamamoto. 2022. Doubly robust off-policy evaluation for ranking policies under the cascade behavior model. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 487–497.
[28]
Vijay Konda and John Tsitsiklis. 1999. Actor-critic algorithms. Advances in neural information processing systems 12 (1999).
[29]
Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.
[30]
Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. 2019. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. arXiv preprint arXiv:1906.00949(2019).
[31]
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. 2020. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems 33 (2020), 1179–1191.
[32]
Sascha Lange, Thomas Gabel, and Martin Riedmiller. 2012. Batch reinforcement learning. In Reinforcement learning. Springer, 45–73.
[33]
Sergey Levine and Vladlen Koltun. 2013. Guided policy search. In International Conference on Machine Learning. 1–9.
[34]
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643(2020).
[35]
Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. 2018. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research 37, 4-5(2018), 421–436.
[36]
Feng Liu, Ruiming Tang, Xutao Li, Weinan Zhang, Yunming Ye, Haokun Chen, Huifeng Guo, and Yuzhou Zhang. 2018. Deep reinforcement learning based recommendation with explicit user-item interactions modeling. arXiv preprint arXiv:1810.12027(2018).
[37]
Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. 2018. Breaking the curse of horizon: Infinite-horizon off-policy estimation. Advances in Neural Information Processing Systems 31 (2018).
[38]
Wei Liu, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. 2011. Hashing with graphs. In Icml.
[39]
Rishabh Mehrotra, Niannan Xue, and Mounia Lalmas. 2020. Bandit based optimization of multiple objectives on a music streaming platform. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 3224–3233.
[40]
Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. 2019. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in Neural Information Processing Systems 32 (2019).
[41]
Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. 2019. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177(2019).
[42]
Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. 2017. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548(2017).
[43]
Noveen Sachdeva, Yi Su, and Thorsten Joachims. 2020. Off-policy bandits with deficient support. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 965–975.
[44]
Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web. 285–295.
[45]
Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, 2020. Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 7839 (2020), 604–609.
[46]
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In International Conference on Machine Learning. 1889–1897.
[47]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347(2017).
[48]
Wenjie Shang, Yang Yu, Qingyang Li, Zhiwei Qin, Yiping Meng, and Jieping Ye. 2019. Environment reconstruction with hidden confounders for reinforcement learning based recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 566–576.
[49]
Guy Shani, David Heckerman, and Ronen I Brafman. 2005. An MDP-based recommender system. Journal of Machine Learning Research 6, Sep (2005).
[50]
Noah Y Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. 2020. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396(2020).
[51]
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, 2016. Mastering the game of Go with deep neural networks and tree search. nature 529, 7587 (2016), 484.
[52]
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, 2017. Mastering the game of go without human knowledge. nature 550, 7676 (2017), 354–359.
[53]
Dusan Stamenkovic, Alexandros Karatzoglou, Ioannis Arapakis, Xin Xin, and Kleomenis Katevas. 2022. Choosing the Best of Both Worlds: Diverse and Novel Recommendations through Multi-Objective Reinforcement Learning. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 957–965.
[54]
Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. 2010. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems. 2217–2225.
[55]
Richard S Sutton, Andrew G Barto, 1998. Reinforcement learning: An introduction. MIT press.
[56]
Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems 12 (1999).
[57]
Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization.Journal of Machine Learning Research 16, 1 (2015), 1731–1755.
[58]
Adith Swaminathan and Thorsten Joachims. 2015. The self-normalized estimator for counterfactual learning. In Advances in Neural Information Processing Systems. 3231–3239.
[59]
Philip Thomas and Emma Brunskill. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. In ICML. 2139–2148.
[60]
Hado Van Hasselt, Arthur Guez, and David Silver. 2016. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 30.
[61]
Kristof Van Moffaert, Madalina M Drugan, and Ann Nowé. 2013. Scalarized multi-objective reinforcement learning: Novel design techniques. In 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL). IEEE, 191–199.
[62]
Kristof Van Moffaert and Ann Nowé. 2014. Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research 15, 1 (2014), 3483–3512.
[63]
Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning 8, 3-4 (1992), 279–292.
[64]
Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 3-4 (1992), 229–256.
[65]
Yifan Wu, George Tucker, and Ofir Nachum. 2019. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361(2019).
[66]
Teng Xiao and Donglin Wang. 2021. A general offline reinforcement learning framework for interactive recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4512–4520.
[67]
Teng Xiao and Suhang Wang. 2022. Towards off-policy learning for ranking policies with logged feedback. In The Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI.
[68]
Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, and Joemon M Jose. 2020. Self-supervised reinforcement learning for recommender systems. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 931–940.
[69]
Haonan Yu, Haichao Zhang, and Wei Xu. 2022. Do You Need the Entropy Reward (in Practice)?arXiv preprint arXiv:2201.12434(2022).
[70]
Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based recommender system: A survey and new perspectives., 38 pages.
[71]
Xiangyu Zhao, Long Xia, Jiliang Tang, and Dawei Yin. 2019. ” Deep reinforcement learning for search, recommendation, and online advertising: a survey” by Xiangyu Zhao, Long Xia, Jiliang Tang, and Dawei Yin with Martin Vesely as coordinator. ACM SIGWEB NewsletterSpring (2019), 1–15.
[72]
Xiangyu Zhao, Long Xia, Dawei Yin, and Jiliang Tang. 2019. Model-based reinforcement learning for whole-chain recommendations. arXiv preprint arXiv:1902.03987(2019).
[73]
Xiangyu Zhao, Liang Zhang, Long Xia, Zhuoye Ding, Dawei Yin, and Jiliang Tang. 2017. Deep reinforcement learning for list-wise recommendations. arXiv preprint arXiv:1801.00209(2017).
[74]
Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. 2018. DRN: A deep reinforcement learning framework for news recommendation. (2018), 167–176.
[75]
Lixin Zou, Long Xia, Zhuoye Ding, Jiaxing Song, Weidong Liu, and Dawei Yin. 2019. Reinforcement learning to optimize long-term user engagement in recommender systems. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2810–2818.

Cited By

View all
  • (2024)On the Opportunities and Challenges of Offline Reinforcement Learning for Recommender SystemsACM Transactions on Information Systems10.1145/366199642:6(1-26)Online publication date: 19-Aug-2024
  • (2024)M3Rec: A Context-Aware Offline Meta-Level Model-Based Reinforcement Learning Approach for Cold-Start RecommendationACM Transactions on Information Systems10.1145/365994742:6(1-27)Online publication date: 19-Aug-2024
  • (2024)Meta Clustering of Neural BanditsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671691(95-106)Online publication date: 25-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
RecSys '22: Proceedings of the 16th ACM Conference on Recommender Systems
September 2022
743 pages
This work is licensed under a Creative Commons Attribution-NoDerivatives International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 September 2022

Check for updates

Author Tags

  1. batch RL
  2. off-policy actor-critic
  3. pessimism
  4. recommender systems
  5. reinforcement learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

Acceptance Rates

Overall Acceptance Rate 254 of 1,295 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,125
  • Downloads (Last 6 weeks)113
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)On the Opportunities and Challenges of Offline Reinforcement Learning for Recommender SystemsACM Transactions on Information Systems10.1145/366199642:6(1-26)Online publication date: 19-Aug-2024
  • (2024)M3Rec: A Context-Aware Offline Meta-Level Model-Based Reinforcement Learning Approach for Cold-Start RecommendationACM Transactions on Information Systems10.1145/365994742:6(1-27)Online publication date: 19-Aug-2024
  • (2024)Meta Clustering of Neural BanditsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671691(95-106)Online publication date: 25-Aug-2024
  • (2024)EasyRL4Rec: An Easy-to-use Library for Reinforcement Learning Based Recommender SystemsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657868(977-987)Online publication date: 10-Jul-2024
  • (2024)User Response Modeling in Reinforcement Learning for Ads AllocationCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648310(131-140)Online publication date: 13-May-2024
  • (2024)Neural Contextual Bandits for Personalized RecommendationCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3641241(1246-1249)Online publication date: 13-May-2024
  • (2024)Retention Depolarization in Recommender SystemProceedings of the ACM Web Conference 202410.1145/3589334.3645485(1126-1137)Online publication date: 13-May-2024
  • (2024)Enhanced entropy based reinforcement learning hotel recommendation systemMultimedia Tools and Applications10.1007/s11042-024-18732-9Online publication date: 15-Mar-2024
  • (2023)Deep Exploration for Recommendation SystemsProceedings of the 17th ACM Conference on Recommender Systems10.1145/3604915.3608855(963-970)Online publication date: 14-Sep-2023
  • (2023)Optimizing Long-term Value for Auction-Based Recommender Systems via On-Policy Reinforcement LearningProceedings of the 17th ACM Conference on Recommender Systems10.1145/3604915.3608854(955-962)Online publication date: 14-Sep-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media