ABSTRACT
Industrial recommender systems deal with extremely large action spaces -- many millions of items to recommend. Moreover, they need to serve billions of users, who are unique at any point in time, making a complex user state space. Luckily, huge quantities of logged implicit feedback (e.g., user clicks, dwell time) are available for learning. Learning from the logged feedback is however subject to biases caused by only observing feedback on recommendations selected by the previous versions of the recommender. In this work, we present a general recipe of addressing such biases in a production top-K recommender system at Youtube, built with a policy-gradient-based algorithm, i.e. REINFORCE. The contributions of the paper are: (1) scaling REINFORCE to a production recommender system with an action space on the orders of millions; (2) applying off-policy correction to address data biases in learning from logged feedback collected from multiple behavior policies; (3) proposing a novel top-K off-policy correction to account for our policy recommending multiple items at a time; (4) showcasing the value of exploration. We demonstrate the efficacy of our approaches through a series of simulations and multiple live experiments on Youtube.
- Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. 2017. Constrained policy optimization. arXiv preprint arXiv:1705.10528 (2017). Google Scholar
Digital Library
- Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. 2014. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning. 1638--1646. Google Scholar
Digital Library
- Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning, Vol. 47, 2--3 (2002), 235--256. Google Scholar
Digital Library
- Yoshua Bengio, Jean-Sébastien Senécal, et almbox. 2003. Quick Training of Probabilistic Neural Nets by Importance Sampling.. In AISTATS . 1--9.Google Scholar
- James Bennett, Stan Lanning, et almbox. 2007. The netflix prize. In Proceedings of KDD cup and workshop, Vol. 2007. New York, NY, USA, 35.Google Scholar
- Alex Beutel, Paul Covington, Sagar Jain, Can Xu, Jia Li, Vince Gatto, and Ed H. Chi. 2018. Latent Cross: Making Use of Context in Recurrent Recommender Systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 5--9, 2018. 46--54. Google Scholar
Digital Library
- Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT'2010 . Springer, 177--186.Google Scholar
Cross Ref
- Léon Bottou, Jonas Peters, Joaquin Qui nonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, Vol. 14, 1 (2013), 3207--3260. Google Scholar
Digital Library
- Olivier Chapelle and Lihong Li. 2011. An empirical evaluation of thompson sampling. In Advances in neural information processing systems . 2249--2257. Google Scholar
Digital Library
- Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).Google Scholar
Digital Library
- Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 191--198. Google Scholar
Digital Library
- Nathaniel D Daw, John P O'doherty, Peter Dayan, Ben Seymour, and Raymond J Dolan. 2006. Cortical substrates for exploratory decisions in humans. Nature, Vol. 441, 7095 (2006), 876.Google Scholar
- Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B testing for Recommender Systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 198--206. Google Scholar
Digital Library
- Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and David Simcha. 2016. Quantization based fast inner product search. In Artificial Intelligence and Statistics . 482--490.Google Scholar
- Nikolaus Hansen and Andreas Ostermeier. 2001. Completely derandomized self-adaptation in evolution strategies. Evolutionary computation, Vol. 9, 2 (2001), 159--195. Google Scholar
Digital Library
- Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 173--182. Google Scholar
Digital Library
- Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015).Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780. Google Scholar
Digital Library
- Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. Ieee, 263--272. Google Scholar
Digital Library
- How Jing and Alexander J Smola. 2017. Neural survival recommender. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 515--524. Google Scholar
Digital Library
- Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased learning-to-rank with biased feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 781--789. Google Scholar
Digital Library
- Jens Kober, J Andrew Bagnell, and Jan Peters. 2013. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, Vol. 32, 11 (2013), 1238--1274. Google Scholar
Digital Library
- Maksim Lapin, Matthias Hein, and Bernt Schiele. 2016. Loss functions for top-k error: Analysis and insights. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1468--1477.Google Scholar
Cross Ref
- Thomas Laurent and James von Brecht. 2016. A recurrent neural network without chaos. arXiv preprint arXiv:1612.06212 (2016).Google Scholar
- Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. 2016. End-to-end training of deep visuomotor policies. JMLR, Vol. 17, 1 (2016), 1334--1373. Google Scholar
Digital Library
- Sergey Levine and Vladlen Koltun. 2013. Guided policy search. In International Conference on Machine Learning. 1--9. Google Scholar
Digital Library
- Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web. ACM, 661--670. Google Scholar
Digital Library
- Jérémie Mary, Romaric Gaudel, and Philippe Preux. 2015. Bandits and recommender systems. In International Workshop on Machine Learning, Optimization and Big Data. Springer, 325--336. Google Scholar
Digital Library
- Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning. 1928--1937. Google Scholar
Digital Library
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).Google Scholar
- Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. 2016. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems. 1054--1062. Google Scholar
Digital Library
- Art B. Owen. 2013. Monte Carlo theory, methods and examples .Google Scholar
- Doina Precup. 2000. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series (2000), 80.Google Scholar
Digital Library
- Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. 2001. Off-policy temporal-difference learning with function approximation. In ICML . 417--424. Google Scholar
Digital Library
- Tobias Schnabel, Paul N Bennett, Susan T Dumais, and Thorsten Joachims. 2018. Short-term satisfaction and long-term coverage: Understanding how users tolerate algorithmic exploration. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 513--521. Google Scholar
Digital Library
- John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In International Conference on Machine Learning . 1889--1897. Google Scholar
Digital Library
- Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. Autorec: Autoencoders meet collaborative filtering. In Proceedings of the 24th International Conference on World Wide Web. ACM, 111--112. Google Scholar
Digital Library
- David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et almbox. 2016. Mastering the game of Go with deep neural networks and tree search. nature, Vol. 529, 7587 (2016), 484.Google Scholar
- Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. 2010. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems . 2217--2225. Google Scholar
Digital Library
- Richard S Sutton, Andrew G Barto, et almbox. 1998. Reinforcement learning: An introduction .MIT press. Google Scholar
Digital Library
- Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems. 1057--1063. Google Scholar
Digital Library
- Adith Swaminathan and Thorsten Joachims. 2015a. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, Vol. 16, 1 (2015), 1731--1755. Google Scholar
Digital Library
- Adith Swaminathan and Thorsten Joachims. 2015b. The self-normalized estimator for counterfactual learning. In Advances in Neural Information Processing Systems. 3231--3239. Google Scholar
Digital Library
- Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-policy evaluation for slate recommendation. In Advances in Neural Information Processing Systems . Google Scholar
Digital Library
- Yong Kiam Tan, Xinxing Xu, and Yong Liu. 2016. Improved recurrent neural networks for session-based recommendations. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 17--22. Google Scholar
Digital Library
- Gerald Tesauro. 1995. Temporal difference learning and TD-Gammon. Commun. ACM, Vol. 38, 3 (1995), 58--68. Google Scholar
Digital Library
- Philip Thomas and Emma Brunskill. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. In ICML . 2139--2148. Google Scholar
Digital Library
- Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, Vol. 8, 3--4 (1992), 229--256. Google Scholar
Digital Library
- Chao-Yuan Wu, Amr Ahmed, Alex Beutel, Alexander J. Smola, and How Jing. 2017. Recurrent Recommender Networks. In WSDM. 495--503. Google Scholar
Digital Library
- Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and Jiliang Tang. 2018. Deep Reinforcement Learning for Page-wise Recommendations. arXiv preprint arXiv:1805.02343 (2018).Google Scholar
Index Terms
Top-K Off-Policy Correction for a REINFORCE Recommender System
Recommendations
Neural Variational Collaborative Filtering for Top-K Recommendation
Trends and Applications in Knowledge Discovery and Data MiningAbstractCollaborative Filtering (CF) is one of the most widely applied models for recommender systems. However, CF-based methods suffer from data sparsity and cold-start, more attention has been drawn to hybrid methods by using both the rating and content ...
A Scalable, Accurate Hybrid Recommender System
WKDD '10: Proceedings of the 2010 Third International Conference on Knowledge Discovery and Data MiningRecommender systems apply machine learning techniques for filtering unseen information and can predict whether a user would like a given resource. There are three main types of recommender systems: collaborative filtering, content-based filtering, and ...
Practical Counterfactual Policy Learning for Top-K Recommendations
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data MiningFor building recommender systems, a critical task is to learn a policy with collected feedback (e.g., ratings, clicks) to decide which items to be recommended to users. However, it has been shown that the selection bias in the collected feedback leads ...





Comments