skip to main content
10.1145/3394486.3403139acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Public Access

Off-policy Bandits with Deficient Support

Published: 20 August 2020 Publication History

Abstract

Learning effective contextual-bandit policies from past actions of a deployed system is highly desirable in many settings (e.g. voice assistants, recommendation, search), since it enables the reuse of large amounts of log data. State-of-the-art methods for such off-policy learning, however, are based on inverse propensity score (IPS) weighting. A key theoretical requirement of IPS weighting is that the policy that logged the data has "full support", which typically translates into requiring non-zero probability for any action in any context. Unfortunately, many real-world systems produce support deficient data, especially when the action space is large, and we show how existing methods can fail catastrophically. To overcome this gap between theory and applications, we identify three approaches that provide various guarantees for IPS-based learning despite the inherent limitations of support-deficient data: restricting the action space, reward extrapolation, and restricting the policy space. We systematically analyze the statistical and computational properties of these three approaches, and we empirically evaluate their effectiveness. In addition to providing the first systematic analysis of support-deficiency in contextual-bandit learning, we conclude with recommendations that provide practical guidance.

References

[1]
Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. 2014. Taming the monster: A fast and simple algorithm for contextual bandits. In ICML.
[2]
Alina Beygelzimer and John Langford. 2009. The offset tree for learning with partial labels. In KDD. ACM, 129--138.
[3]
Léon Bottou, Jonas Peters, Joaquin Qui nonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual reasoning and learning systems: The example of computational advertising. JMLR, Vol. 14, 1 (2013), 3207--3260.
[4]
Andrea Dal Pozzolo, Olivier Caelen, Reid A Johnson, and Gianluca Bontempi. 2015. Calibrating probability with undersampling for unbalanced classification. In 2015 SSCI. IEEE, 159--166.
[5]
Miroslav Dud'ik, John Langford, and Lihong Li. 2011. Doubly Robust Policy Evaluation and Learning. In ICML.
[6]
Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More Robust Doubly Robust Off-policy Evaluation. In ICML. 1446--1455.
[7]
Scott Fujimoto, David Meger, and Doina Precup. 2018. Off-policy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900 (2018).
[8]
Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. 2004. Variance reduction techniques for gradient estimates in reinforcement learning. JMLR, Vol. 5, Nov (2004), 1471--1530.
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.
[10]
Nan Jiang and Lihong Li. 2016. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning. In ICML. 652--661.
[11]
T. Joachims, A. Swaminathan, and M. de Rijke. 2018. Deep Learning with Logged Bandit Feedback. In ICLR.
[12]
T. Joachims, A. Swaminathan, and T. Schnabel. 2017. Unbiased Learning-to-Rank with Biased Feedback. In WSDM.
[13]
Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. 2014. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, Vol. 55 (2014).
[14]
Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. 2019. Stabilizing off-policy q-learning via bootstrapping error reduction. In NeurIPS. 11761--11771.
[15]
John Langford and Tong Zhang. 2008. The epoch-greedy algorithm for multi-armed bandits with side information. In NeurIPS. 817--824.
[16]
Romain Laroche, Paul Trichelair, and Rémi Tachet des Combes. 2017. Safe policy improvement with baseline bootstrapping. arXiv preprint arXiv:1712.06924 (2017).
[17]
Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. 2015. Counterfactual estimation and optimization of click metrics in search engines: A case study. In WWW. ACM, 929--934.
[18]
Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In WSDM. ACM, 297--306.
[19]
Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. 2019. Off-Policy Policy Gradient with State Distribution Correction. arXiv preprint arXiv:1904.08473 (2019).
[20]
Ben London and Ted Sandler. 2019. Bayesian Counterfactual Risk Minimization. In ICML. 4125--4133.
[21]
Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. 2011. Learning from Logged Implicit Exploration Data. In NeurIPS.
[22]
Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. 2019. CAB: Continuous Adaptive Blending for Policy Evaluation and Learning. In ICML. 6005--6014.
[23]
Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction .MIT press.
[24]
A. Swaminathan and T. Joachims. 2015a. Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization. JMLR, Vol. 16 (Sep 2015), 1731--1755. Special Issue in Memory of Alexey Chervonenkis.
[25]
A. Swaminathan and T. Joachims. 2015b. The Self-Normalized Estimator for Counterfactual Learning. In NeurIPS.
[26]
Philip Thomas and Emma Brunskill. 2016. Data-efficient Off-policy Policy Evaluation for Reinforcement Learning. In ICML.
[27]
Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. 2017. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. In ICML.
[28]
Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning, Vol. 8, 3--4 (1992), 279--292.
[29]
Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, Vol. 8, 3--4 (1992), 229--256.

Cited By

View all
  • (2024)Policy Learning for Off-Dynamics RL with Deficient SupportProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662965(1093-1100)Online publication date: 6-May-2024
  • (2024)Causal Inference in Recommender Systems: A Survey and Future DirectionsACM Transactions on Information Systems10.1145/363904842:4(1-32)Online publication date: 2-Jan-2024
  • (2024)On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n RecommendationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671687(1222-1233)Online publication date: 25-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
August 2020
3664 pages
ISBN:9781450379984
DOI:10.1145/3394486
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. contextual bandits
  2. counterfactual reasoning
  3. implicit feed-back
  4. log data
  5. off-policy learning

Qualifiers

  • Research-article

Funding Sources

  • NSF

Conference

KDD '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)173
  • Downloads (Last 6 weeks)22
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Policy Learning for Off-Dynamics RL with Deficient SupportProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662965(1093-1100)Online publication date: 6-May-2024
  • (2024)Causal Inference in Recommender Systems: A Survey and Future DirectionsACM Transactions on Information Systems10.1145/363904842:4(1-32)Online publication date: 2-Jan-2024
  • (2024)On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n RecommendationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671687(1222-1233)Online publication date: 25-Aug-2024
  • (2024)Off-Policy Evaluation for Large Action Spaces via Policy ConvolutionProceedings of the ACM Web Conference 202410.1145/3589334.3645501(3576-3585)Online publication date: 13-May-2024
  • (2024)Long-term Off-Policy Evaluation and LearningProceedings of the ACM Web Conference 202410.1145/3589334.3645446(3432-3443)Online publication date: 13-May-2024
  • (2024)Learning Action Embeddings for Off-Policy EvaluationAdvances in Information Retrieval10.1007/978-3-031-56027-9_7(108-122)Online publication date: 24-Mar-2024
  • (2023)Marginal density ratio for off-policy evaluation in contextual banditsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668415(52648-52691)Online publication date: 10-Dec-2023
  • (2023)Counterfactual-augmented importance sampling for semi-offline policy evaluationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666626(11394-11429)Online publication date: 10-Dec-2023
  • (2023)Budgeting counterfactual for offline RLProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666372(5729-5751)Online publication date: 10-Dec-2023
  • (2023)Sequential counterfactual risk minimizationProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3620113(40681-40706)Online publication date: 23-Jul-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media