skip to main content
10.1145/3543507.3583448acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article
Open access

Offline Policy Evaluation in Large Action Spaces via Outcome-Oriented Action Grouping

Published: 30 April 2023 Publication History

Abstract

Offline policy evaluation (OPE) aims to accurately estimate the performance of a hypothetical policy using only historical data, which has drawn increasing attention in a wide range of applications including recommender systems and personalized medicine. With the presence of rising granularity of consumer data, many industries started exploring larger action candidate spaces to support more precise personalized action. While inverse propensity score (IPS) is a standard OPE estimator, it suffers from more severe variance issues with increasing action spaces. To address this issue, we theoretically prove that the estimation variance can be reduced by merging actions into groups while the distinction among these action effects on the outcome can induce extra bias. Motivated by these, we propose a novel IPS estimator with outcome-oriented action Grouping (GroupIPS), which leverages a Lipschitz regularized network to measure the distance of action effects in the embedding space and merges nearest action neighbors. This strategy enables more robust estimation by achieving smaller variances while inducing minor additional bias. Empirically, extensive experiments on both synthetic and real world datasets demonstrate the effectiveness of our proposed method.

References

[1]
Aman Agarwal, Soumya Basu, Tobias Schnabel, and Thorsten Joachims. 2017. Effective Evaluation Using Logged Bandit Feedback from Multiple Loggers. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Halifax, NS, Canada) (KDD ’17). Association for Computing Machinery, New York, NY, USA, 687–696. https://doi.org/10.1145/3097983.3098155
[2]
Ioana Bica, James Jordon, and Mihaela van der Schaar. 2020. Estimating the Effects of Continuous-valued Interventions using Generative Adversarial Networks. CoRR abs/2002.12326 (2020). arXiv:2002.12326https://arxiv.org/abs/2002.12326
[3]
Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising.Journal of Machine Learning Research 14, 11 (2013).
[4]
Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. 2017. Parseval networks: Improving robustness to adversarial examples. In International Conference on Machine Learning. PMLR, 854–863.
[5]
Miroslav Dudík, John Langford, and Lihong Li. 2011. Doubly Robust Policy Evaluation and Learning. In ICML.
[6]
Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning. PMLR, 1447–1456.
[7]
Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline a/b testing for recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 198–206.
[8]
Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association 47, 260 (1952), 663–685.
[9]
Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 133–142.
[10]
Nathan Kallus. 2018. Balanced policy evaluation and learning. Advances in neural information processing systems 31 (2018).
[11]
Nathan Kallus and Angela Zhou. 2018. Policy evaluation and optimization with continuous treatments. In International conference on artificial intelligence and statistics. PMLR, 1243–1251.
[12]
Randall Lewis and David Reiley. 2009. Retail advertising works! measuring the effects of advertising on sales via a controlled experiment on yahoo! (2009).
[13]
Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. 2015. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web. 929–934.
[14]
Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining. 297–306.
[15]
Hsueh-Ti Derek Liu, Francis Williams, Alec Jacobson, Sanja Fidler, and Or Litany. 2022. Learning Smooth Neural Functions via Lipschitz Regularization. In ACM SIGGRAPH 2022 Conference Proceedings (Vancouver, BC, Canada) (SIGGRAPH ’22). Association for Computing Machinery, New York, NY, USA, Article 31, 13 pages. https://doi.org/10.1145/3528233.3530713
[16]
Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1 (1983), 41–55.
[17]
Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2021. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, J. Vanschoren and S. Yeung (Eds.). Vol. 1. https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/33e75ff09dd601bbe69f351039152189-Paper-round2.pdf
[18]
Yuta Saito and Thorsten Joachims. 2022. Off-Policy Evaluation for Large Action Spaces via Embeddings. In Proceedings of the 39th International Conference on Machine Learning. PMLR, 19089–19122.
[19]
Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. Recommendations as Treatments: Debiasing Learning and Evaluation. CoRR abs/1602.05352 (2016). arXiv:1602.05352http://arxiv.org/abs/1602.05352
[20]
Arjun Sondhi, David Arbour, and Drew Dimmery. 2020. Balanced off-policy evaluation in general action spaces. In International Conference on Artificial Intelligence and Statistics. PMLR, 2413–2423.
[21]
Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. 2010. Learning from logged implicit exploration data. Advances in neural information processing systems 23 (2010).
[22]
Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. 2020. Adaptive Estimator Selection for Off-Policy Evaluation. In Proceedings of the 37th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 9196–9205. https://proceedings.mlr.press/v119/su20d.html
[23]
Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. 2019. Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning. PMLR, 6005–6014.
[24]
Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning. PMLR, 814–823.
[25]
Adith Swaminathan and Thorsten Joachims. 2015. The self-normalized estimator for counterfactual learning. advances in neural information processing systems 28 (2015).
[26]
Philip Thomas and Emma Brunskill. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning. PMLR, 2139–2148.
[27]
George Tucker and Jonathan Lee. 2021. Improved Estimator Selection for Off-Policy Evaluation. In Workshop on Reinforcement Learning Theory at the 38th International Conference on Machine Learning.
[28]
Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudık. 2017. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning. PMLR, 3589–3597.
[29]
Daniel Westreich, Justin Lessler, and Michele Jonsson Funk. 2010. Propensity score estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. Journal of clinical epidemiology 63, 8 (2010), 826–833.
[30]
Hao Zou, Peng Cui, Bo Li, Zheyan Shen, Jianxin Ma, Hongxia Yang, and Yue He. 2020. Counterfactual prediction for bundle treatment. Advances in Neural Information Processing Systems 33 (2020), 19705–19715.
[31]
Hao Zou, Bo Li, Jiangang Han, Shuiping Chen, Xuetao Ding, and Peng Cui. 2022. Counterfactual Prediction for Outcome-Oriented Treatments. In Proceedings of the 39th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 27693–27706. https://proceedings.mlr.press/v162/zou22a.html

Cited By

View all
  • (2024)Off-Policy Evaluation for Large Action Spaces via Policy ConvolutionProceedings of the ACM Web Conference 202410.1145/3589334.3645501(3576-3585)Online publication date: 13-May-2024
  • (2024)Long-term Off-Policy Evaluation and LearningProceedings of the ACM Web Conference 202410.1145/3589334.3645446(3432-3443)Online publication date: 13-May-2024
  • (2024)IDoserExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121796238:PBOnline publication date: 27-Feb-2024
  • Show More Cited By

Index Terms

  1. Offline Policy Evaluation in Large Action Spaces via Outcome-Oriented Action Grouping

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '23: Proceedings of the ACM Web Conference 2023
    April 2023
    4293 pages
    ISBN:9781450394161
    DOI:10.1145/3543507
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 April 2023

    Check for updates

    Author Tags

    1. Action Grouping
    2. Embedding Space
    3. Offline Policy Evaluation

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    WWW '23
    Sponsor:
    WWW '23: The ACM Web Conference 2023
    April 30 - May 4, 2023
    TX, Austin, USA

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)514
    • Downloads (Last 6 weeks)68
    Reflects downloads up to 24 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Off-Policy Evaluation for Large Action Spaces via Policy ConvolutionProceedings of the ACM Web Conference 202410.1145/3589334.3645501(3576-3585)Online publication date: 13-May-2024
    • (2024)Long-term Off-Policy Evaluation and LearningProceedings of the ACM Web Conference 202410.1145/3589334.3645446(3432-3443)Online publication date: 13-May-2024
    • (2024)IDoserExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121796238:PBOnline publication date: 27-Feb-2024
    • (2024)Learning Action Embeddings for Off-Policy EvaluationAdvances in Information Retrieval10.1007/978-3-031-56027-9_7(108-122)Online publication date: 24-Mar-2024
    • (2023)Off-policy evaluation for large action spaces via conjunct effect modelingProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619642(29734-29759)Online publication date: 23-Jul-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media