skip to main content
10.1145/3289600.3290999acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Top-K Off-Policy Correction for a REINFORCE Recommender System

Published:30 January 2019Publication History

ABSTRACT

Industrial recommender systems deal with extremely large action spaces -- many millions of items to recommend. Moreover, they need to serve billions of users, who are unique at any point in time, making a complex user state space. Luckily, huge quantities of logged implicit feedback (e.g., user clicks, dwell time) are available for learning. Learning from the logged feedback is however subject to biases caused by only observing feedback on recommendations selected by the previous versions of the recommender. In this work, we present a general recipe of addressing such biases in a production top-K recommender system at Youtube, built with a policy-gradient-based algorithm, i.e. REINFORCE. The contributions of the paper are: (1) scaling REINFORCE to a production recommender system with an action space on the orders of millions; (2) applying off-policy correction to address data biases in learning from logged feedback collected from multiple behavior policies; (3) proposing a novel top-K off-policy correction to account for our policy recommending multiple items at a time; (4) showcasing the value of exploration. We demonstrate the efficacy of our approaches through a series of simulations and multiple live experiments on Youtube.

References

  1. Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. 2017. Constrained policy optimization. arXiv preprint arXiv:1705.10528 (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. 2014. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning. 1638--1646. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning, Vol. 47, 2--3 (2002), 235--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Yoshua Bengio, Jean-Sébastien Senécal, et almbox. 2003. Quick Training of Probabilistic Neural Nets by Importance Sampling.. In AISTATS . 1--9.Google ScholarGoogle Scholar
  5. James Bennett, Stan Lanning, et almbox. 2007. The netflix prize. In Proceedings of KDD cup and workshop, Vol. 2007. New York, NY, USA, 35.Google ScholarGoogle Scholar
  6. Alex Beutel, Paul Covington, Sagar Jain, Can Xu, Jia Li, Vince Gatto, and Ed H. Chi. 2018. Latent Cross: Making Use of Context in Recurrent Recommender Systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 5--9, 2018. 46--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT'2010 . Springer, 177--186.Google ScholarGoogle ScholarCross RefCross Ref
  8. Léon Bottou, Jonas Peters, Joaquin Qui nonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, Vol. 14, 1 (2013), 3207--3260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Olivier Chapelle and Lihong Li. 2011. An empirical evaluation of thompson sampling. In Advances in neural information processing systems . 2249--2257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 191--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Nathaniel D Daw, John P O'doherty, Peter Dayan, Ben Seymour, and Raymond J Dolan. 2006. Cortical substrates for exploratory decisions in humans. Nature, Vol. 441, 7095 (2006), 876.Google ScholarGoogle Scholar
  13. Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B testing for Recommender Systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 198--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and David Simcha. 2016. Quantization based fast inner product search. In Artificial Intelligence and Statistics . 482--490.Google ScholarGoogle Scholar
  15. Nikolaus Hansen and Andreas Ostermeier. 2001. Completely derandomized self-adaptation in evolution strategies. Evolutionary computation, Vol. 9, 2 (2001), 159--195. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 173--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015).Google ScholarGoogle Scholar
  18. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. Ieee, 263--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. How Jing and Alexander J Smola. 2017. Neural survival recommender. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 515--524. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased learning-to-rank with biased feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 781--789. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jens Kober, J Andrew Bagnell, and Jan Peters. 2013. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, Vol. 32, 11 (2013), 1238--1274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Maksim Lapin, Matthias Hein, and Bernt Schiele. 2016. Loss functions for top-k error: Analysis and insights. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1468--1477.Google ScholarGoogle ScholarCross RefCross Ref
  24. Thomas Laurent and James von Brecht. 2016. A recurrent neural network without chaos. arXiv preprint arXiv:1612.06212 (2016).Google ScholarGoogle Scholar
  25. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. 2016. End-to-end training of deep visuomotor policies. JMLR, Vol. 17, 1 (2016), 1334--1373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sergey Levine and Vladlen Koltun. 2013. Guided policy search. In International Conference on Machine Learning. 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web. ACM, 661--670. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jérémie Mary, Romaric Gaudel, and Philippe Preux. 2015. Bandits and recommender systems. In International Workshop on Machine Learning, Optimization and Big Data. Springer, 325--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning. 1928--1937. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).Google ScholarGoogle Scholar
  31. Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. 2016. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems. 1054--1062. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Art B. Owen. 2013. Monte Carlo theory, methods and examples .Google ScholarGoogle Scholar
  33. Doina Precup. 2000. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series (2000), 80.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. 2001. Off-policy temporal-difference learning with function approximation. In ICML . 417--424. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Tobias Schnabel, Paul N Bennett, Susan T Dumais, and Thorsten Joachims. 2018. Short-term satisfaction and long-term coverage: Understanding how users tolerate algorithmic exploration. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 513--521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In International Conference on Machine Learning . 1889--1897. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. Autorec: Autoencoders meet collaborative filtering. In Proceedings of the 24th International Conference on World Wide Web. ACM, 111--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et almbox. 2016. Mastering the game of Go with deep neural networks and tree search. nature, Vol. 529, 7587 (2016), 484.Google ScholarGoogle Scholar
  39. Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. 2010. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems . 2217--2225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Richard S Sutton, Andrew G Barto, et almbox. 1998. Reinforcement learning: An introduction .MIT press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems. 1057--1063. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Adith Swaminathan and Thorsten Joachims. 2015a. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, Vol. 16, 1 (2015), 1731--1755. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Adith Swaminathan and Thorsten Joachims. 2015b. The self-normalized estimator for counterfactual learning. In Advances in Neural Information Processing Systems. 3231--3239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-policy evaluation for slate recommendation. In Advances in Neural Information Processing Systems . Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yong Kiam Tan, Xinxing Xu, and Yong Liu. 2016. Improved recurrent neural networks for session-based recommendations. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 17--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Gerald Tesauro. 1995. Temporal difference learning and TD-Gammon. Commun. ACM, Vol. 38, 3 (1995), 58--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Philip Thomas and Emma Brunskill. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. In ICML . 2139--2148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, Vol. 8, 3--4 (1992), 229--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Chao-Yuan Wu, Amr Ahmed, Alex Beutel, Alexander J. Smola, and How Jing. 2017. Recurrent Recommender Networks. In WSDM. 495--503. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and Jiliang Tang. 2018. Deep Reinforcement Learning for Page-wise Recommendations. arXiv preprint arXiv:1805.02343 (2018).Google ScholarGoogle Scholar

Index Terms

  1. Top-K Off-Policy Correction for a REINFORCE Recommender System

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader