skip to main content
10.1145/3366423.3380130acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Off-policy Learning in Two-stage Recommender Systems

Published: 20 April 2020 Publication History

Abstract

Many real-world recommender systems need to be highly scalable: matching millions of items with billions of users, with milliseconds latency. The scalability requirement has led to widely used two-stage recommender systems, consisting of efficient candidate generation model(s) in the first stage and a more powerful ranking model in the second stage.
Logged user feedback, e.g., user clicks or dwell time, are often used to build both candidate generation and ranking models for recommender systems. While it’s easy to collect large amount of such data, they are inherently biased because the feedback can only be observed on items recommended by the previous systems. Recently, off-policy correction on such biases have attracted increasing interest in the field of recommender system research. However, most existing work either assumed that the recommender system is a single-stage system or only studied how to apply off-policy correction to the candidate generation stage of the system without explicitly considering the interactions between the two stages.
In this work, we propose a two-stage off-policy policy gradient method, and showcase that ignoring the interaction between the two stages leads to a sub-optimal policy in two-stage recommender systems. The proposed method explicitly takes into account the ranking model when training the candidate generation model, which helps improve the performance of the whole system. We conduct experiments on real-world datasets with large item space and demonstrate the effectiveness of our proposed method.

References

[1]
Aman Agarwal, Ivan Zaitsev, Xuanhui Wang, Cheng Li, Marc Najork, and Thorsten Joachims. 2019. Estimating position bias without intrusive interventions. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. ACM, 474–482.
[2]
Shipra Agrawal and Navin Goyal. 2013. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning. 127–135.
[3]
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47, 2-3 (2002), 235–256.
[4]
Fedor Borisyuk, Krishnaram Kenthapadi, David Stein, and Bo Zhao. 2016. CaSMoS: A framework for learning candidate selection models over structured queries and documents. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 441–450.
[5]
Djallel Bouneffouf, Amel Bouzeghoub, and Alda Lopes Gançarski. 2012. A contextual-bandit algorithm for mobile context-aware recommender system. In International Conference on Neural Information Processing. Springer, 324–331.
[6]
Olivier Chapelle and Lihong Li. 2011. An empirical evaluation of thompson sampling. In Advances in neural information processing systems. 2249–2257.
[7]
Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. 2019. Top-k off-policy correction for a REINFORCE recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. ACM, 456–464.
[8]
Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. 2011. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. 208–214.
[9]
Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems. ACM, 191–198.
[10]
John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, Jul (2011), 2121–2159.
[11]
Miroslav Dudík, John Langford, and Lihong Li. 2011. Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601(2011).
[12]
Chantat Eksombatchai, Pranav Jindal, Jerry Zitao Liu, Yuchen Liu, Rahul Sharma, Charles Sugnet, Mark Ulrich, and Jure Leskovec. 2018. Pixie: A system for recommending 3+ billion items to 200+ million users in real-time. In Proceedings of the 2018 World Wide Web Conference. International World Wide Web Conferences Steering Committee, 1775–1784.
[13]
Seth Flaxman, Sharad Goel, and Justin M Rao. 2016. Filter bubbles, echo chambers, and online news consumption. Public opinion quarterly 80, S1 (2016), 298–320.
[14]
Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline a/b testing for recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 198–206.
[15]
Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association 47, 260(1952), 663–685.
[16]
Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In 2008 Eighth IEEE International Conference on Data Mining. Ieee, 263–272.
[17]
Nan Jiang and Lihong Li. 2015. Doubly robust off-policy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722(2015).
[18]
Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased learning-to-rank with biased feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 781–789.
[19]
Wang-Cheng Kang and Julian McAuley. 2019. Candidate Generation with Binary Codes for Large-Scale Top-N Recommendation. arXiv preprint arXiv:1909.05475(2019).
[20]
John Langford and Tong Zhang. 2008. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems. 817–824.
[21]
Sergey Levine and Vladlen Koltun. 2013. Guided policy search. In International Conference on Machine Learning. 1–9.
[22]
Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web. ACM, 661–670.
[23]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602(2013).
[24]
Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. 2016. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems. 1054–1062.
[25]
Feiyang Pan, Qingpeng Cai, Pingzhong Tang, Fuzhen Zhuang, and Qing He. 2019. Policy Gradients for Contextual Recommendations. In The World Wide Web Conference. ACM, 1421–1431.
[26]
Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. 2001. Off-policy temporal-difference learning with function approximation. In ICML. 417–424.
[27]
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In International conference on machine learning. 1889–1897.
[28]
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, 2016. Mastering the game of Go with deep neural networks and tree search. nature 529, 7587 (2016), 484.
[29]
Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. 2013. Ranked bandits in metric spaces: learning diverse rankings over large document collections. Journal of Machine Learning Research 14, Feb (2013), 399–436.
[30]
Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. 2010. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems. 2217–2225.
[31]
Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems. 1057–1063.
[32]
Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization.Journal of Machine Learning Research 16, 1 (2015), 1731–1755.
[33]
Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-policy evaluation for slate recommendation. In Advances in Neural Information Processing Systems. 3632–3642.
[34]
Jiaxi Tang, Francois Belletti, Sagar Jain, Minmin Chen, Alex Beutel, Can Xu, and Ed H Chi. 2019. Towards neural mixture recommender for long range dependent user sequences. In The World Wide Web Conference. ACM, 1782–1793.
[35]
Liang Tang, Yexi Jiang, Lei Li, Chunqiu Zeng, and Tao Li. 2015. Personalized recommendation via parameter-free contextual bandits. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, 323–332.
[36]
Liang Tang, Romer Rosales, Ajit Singh, and Deepak Agarwal. 2013. Automatic ad format selection via contextual bandits. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. ACM, 1587–1594.
[37]
Philip Thomas and Emma Brunskill. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning. 2139–2148.
[38]
Chih-Chun Wang, Sanjeev R Kulkarni, and H Vincent Poor. 2005. Bandit problems with side observations. IEEE Trans. Automat. Control 50, 3 (2005), 338–355.
[39]
Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 3-4 (1992), 229–256.
[40]
Xinyang Yi, Ji Yang, Lichan Hong, Derek Zhiyuan Cheng, Lukasz Heldt, Aditee Kumthekar, Zhe Zhao, Li Wei, and Ed Chi. 2019. Sampling-bias-corrected neural modeling for large corpus item recommendations. In Proceedings of the 13th ACM Conference on Recommender Systems. ACM, 269–277.
[41]
Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin. 2018. Recommendations with negative feedback via pairwise deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1040–1048.
[42]
Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews, Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. 2019. Recommending what video to watch next: a multitask ranking system. In Proceedings of the 13th ACM Conference on Recommender Systems. ACM, 43–51.
[43]
Arkaitz Zubiaga. 2012. Enhancing navigation on wikipedia with social tags. arXiv preprint arXiv:1202.5469(2012).

Cited By

View all
  • (2024)On the Opportunities and Challenges of Offline Reinforcement Learning for Recommender SystemsACM Transactions on Information Systems10.1145/366199642:6(1-26)Online publication date: 19-Aug-2024
  • (2024)M3Rec: A Context-Aware Offline Meta-Level Model-Based Reinforcement Learning Approach for Cold-Start RecommendationACM Transactions on Information Systems10.1145/365994742:6(1-27)Online publication date: 19-Aug-2024
  • (2024)Counterfactual Explanation for Fairness in RecommendationACM Transactions on Information Systems10.1145/364367042:4(1-30)Online publication date: 29-Jan-2024
  • Show More Cited By

Index Terms

  1. Off-policy Learning in Two-stage Recommender Systems
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        WWW '20: Proceedings of The Web Conference 2020
        April 2020
        3143 pages
        ISBN:9781450370233
        DOI:10.1145/3366423
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 20 April 2020

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Neural Networks
        2. Off-policy Learning
        3. Recommender Systems
        4. Two-stage Systems

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        WWW '20
        Sponsor:
        WWW '20: The Web Conference 2020
        April 20 - 24, 2020
        Taipei, Taiwan

        Acceptance Rates

        Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)224
        • Downloads (Last 6 weeks)26
        Reflects downloads up to 24 Sep 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)On the Opportunities and Challenges of Offline Reinforcement Learning for Recommender SystemsACM Transactions on Information Systems10.1145/366199642:6(1-26)Online publication date: 19-Aug-2024
        • (2024)M3Rec: A Context-Aware Offline Meta-Level Model-Based Reinforcement Learning Approach for Cold-Start RecommendationACM Transactions on Information Systems10.1145/365994742:6(1-27)Online publication date: 19-Aug-2024
        • (2024)Counterfactual Explanation for Fairness in RecommendationACM Transactions on Information Systems10.1145/364367042:4(1-30)Online publication date: 29-Jan-2024
        • (2024)On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n RecommendationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671687(1222-1233)Online publication date: 25-Aug-2024
        • (2024)Achieving a Better Tradeoff in Multi-stage Recommender Systems through PersonalizationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671593(4939-4950)Online publication date: 25-Aug-2024
        • (2024)Learning Metrics that Maximise Power for Accelerated A/B-TestsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671512(5183-5193)Online publication date: 25-Aug-2024
        • (2024)Ad-load Balancing via Off-policy Learning in a Content MarketplaceProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635846(586-595)Online publication date: 4-Mar-2024
        • (2024)Full-stage Diversified Recommendation: Large-scale Online Experiments in Short-video PlatformProceedings of the ACM Web Conference 202410.1145/3589334.3648144(4565-4574)Online publication date: 13-May-2024
        • (2024)Off-Policy Evaluation for Large Action Spaces via Policy ConvolutionProceedings of the ACM Web Conference 202410.1145/3589334.3645501(3576-3585)Online publication date: 13-May-2024
        • (2024)FairSync: Ensuring Amortized Group Exposure in Distributed Recommendation RetrievalProceedings of the ACM Web Conference 202410.1145/3589334.3645413(1092-1102)Online publication date: 13-May-2024
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media