skip to main content
10.1145/3209978.3210050acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Offline Comparative Evaluation with Incremental, Minimally-Invasive Online Feedback

Published: 27 June 2018 Publication History

Abstract

We investigate the use of logged user interaction data---queries and clicks---for offline evaluation of new search systems in the context of counterfactual analysis. The challenge of evaluating a new ranker against log data collected from a static production ranker is that new rankers may retrieve documents that have never been seen in the logs before, and thus lack any logged feedback from users. Additionally, the ranker itself could bias user actions such that even documents that have been seen in the logs would have exhibited different interaction patterns had they been retrieved and ranked by the new ranker. We present a methodology for incrementally logging interactions on previously-unseen documents for use in computation of an unbiased estimator of a new ranker's effectiveness. Our method is very lightly invasive with respect to the production ranker results to insure against users becoming dissatisfied if the new ranker is poor. We demonstrate how well our methods work in a simulation environment designed to be challenging for such methods to argue that they are likely to work in a wide variety of scenarios.

References

[1]
R. Agrawal, A. Halverson, K. Kenthapadi, N. Mishra, and P. Tsaparas. Generating labels from clicks. WSDM '09. 2009.
[2]
L. Bottou, J. Peters, J. Q. Candela, D. X. Charles, M. Chickering, E. Portugaly, D. Ray, P. Y. Simard, and E. Snelson. Counterfactual reasoning and learning systems - the example of computational advertising. Journal of Machine Learning Research (), 2013.
[3]
B. Carterette and R. Jones. Evaluating search engines by modeling the relationship between relevance and clicks. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, NIPS, pages 217--224. 2008.
[4]
O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleaved search evaluation. Trans. Inf. Systems, 30(1):6:1--41, 2012.
[5]
O. Chapelle and Y. Zhang. A dynamic bayesian network click model for web search ranking. 2009.
[6]
A. Chuklin, I. Markov, and M. de Rijke. Click Models for Web Search. Synthesis Lectures on Information Concepts, Retrieval, and Services, 7(3):1--115, 2015.
[7]
G. V. Cormack, C. R. Palmer, and C. L. a. Clarke. Efficient construction of large test collections. ACM, New York, New York, USA, Aug. 1998.
[8]
N. Craswell, O. Zoeter, M. J. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. WSDM, page 87, 2008.
[9]
M. Dud'ık, D. Erhan, J. Langford, and L. Li. Doubly Robust Policy Evaluation and Optimization. Statistical Science, 29(4):485--511, 2014.
[10]
G. E. Dupret and B. Piwowarski. A user browsing model to predict search engine click data from past observations. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '08, pages 331--338, 2008.
[11]
K. Hofmann, L. Li, and F. Radlinski. Online Evaluation for Information Retrieval. Foundations and Trends® in Information Retrieval, 10(1):1--117, 2016.
[12]
K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM '11, pages 249--258, 2011.
[13]
T. Joachims. Optimizing search engines using clickthrough data. conference on Knowledge discovery and data mining, 2002.
[14]
T. Joachims, L. a. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search. ACM Transactions on Information Systems, 25(2):7--es, 2007.
[15]
T. Joachims, A. Swaminathan, and T. Schnabel. Unbiased Learning-to-Rank with Biased Feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining - WSDM '17, 2017.
[16]
M. T. Keane and M. O'Brien. Click Models for Web Search. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 28, 2006.
[17]
R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne. Controlled experiments on the web: Survey and practical guide. Data Min. Knowl. Discov., 18(1):140--181, 2009.
[18]
D. Li and E. Kanoulas. Active Sampling for Large-scale Information Retrieval Evaluation. arXiv.org, pages 49--58, Sept. 2017.
[19]
L. Li, S. Chen, J. Kleban, and A. Gupta. Counterfactual Estimation and Optimization of Click Metrics in Search Engines - A Case Study. pages 929--934, 2015.
[20]
A. Lipani, J. R. M. Palotti, M. Lupu, F. Piroi, G. Zuccon, and A. Hanbury. Fixed-Cost Pooling Strategies Based on IR Evaluation Measures. ECIR, 2017.
[21]
D. E. Losada, J. Parapar, and A. Barreiro. Feeling lucky? - multi-armed bandits for ordering judgements in pooling-based evaluation. SAC, pages 1027--1034, 2016.
[22]
A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Info. Sys., 27(1):1--27, 2008.
[23]
U. Ozertem, R. Jones, and B. Dumoulin. Evaluating new search engine configurations with pre-existing judgments and clicks. In Proceedings of the 20th International Conference on World Wide Web, WWW '11, pages 397--406, 2011.
[24]
V. Pavlu and J. Aslam. A practical sampling strategy for efficient retrieval evaluation. 2007.
[25]
V. Pavlu, E. Yilmaz, J. A. Aslam, and H. Ave. A Statistical Method for System Evaluation Using Incomplete Judgments. pages 541--548, 2006.
[26]
F. Radlinski and T. Joachims. Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 2, AAAI'06, 2006.
[27]
F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM '08, pages 43--52, 2008.
[28]
M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: estimating the click-through rate for new ads. 2007.
[29]
P. R. ROSENBAUM and D. B. RUBIN. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41--55, 1983.
[30]
E. M. Voorhees and D. K. Harman. TREC : Experiment and Evaluation in Information Retrieval. MIT Press, 2005.
[31]
X. Wang, M. Bendersky, D. Metzler, and M. Najork. Learning to Rank with Selection Bias in Personal Search. pages 115--124, 2016.
[32]
E. Yilmaz, E. Kanoulas, and J. A. Aslam. A simple and efficient sampling method for estimating AP and NDCG. In the 31st annual international ACM SIGIR conference, page 603, New York, New York, USA, July 2008. ACM Request Permissions.
[33]
Y. Yue, R. Patel, and H. Roehrig. Beyond position bias: Examining result attractiveness as a source of presentation bias in clickthrough data. In Proc. of the 19th International Conference on World Wide Web, WWW, pages 1011--1018, 2010.

Cited By

View all
  • (2024)Validating Synthetic Usage Data in Living Lab EnvironmentsJournal of Data and Information Quality10.1145/362364016:1(1-33)Online publication date: 6-Mar-2024
  • (2023)Unbiased Top-$k$ Learning to Rank with Causal Likelihood DecompositionProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625340(129-138)Online publication date: 26-Nov-2023
  • (2020)Accelerated Convergence for Counterfactual Learning to RankProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401069(469-478)Online publication date: 25-Jul-2020
  • Show More Cited By

Index Terms

  1. Offline Comparative Evaluation with Incremental, Minimally-Invasive Online Feedback

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
    June 2018
    1509 pages
    ISBN:9781450356572
    DOI:10.1145/3209978
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 June 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. counterfactual evaluation
    2. experimentation
    3. ips estimate
    4. measurement
    5. performance

    Qualifiers

    • Research-article

    Conference

    SIGIR '18
    Sponsor:

    Acceptance Rates

    SIGIR '18 Paper Acceptance Rate 86 of 409 submissions, 21%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)15
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 24 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Validating Synthetic Usage Data in Living Lab EnvironmentsJournal of Data and Information Quality10.1145/362364016:1(1-33)Online publication date: 6-Mar-2024
    • (2023)Unbiased Top-$k$ Learning to Rank with Causal Likelihood DecompositionProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625340(129-138)Online publication date: 26-Nov-2023
    • (2020)Accelerated Convergence for Counterfactual Learning to RankProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401069(469-478)Online publication date: 25-Jul-2020
    • (2020)Unbiased Learning to Rank: Counterfactual and Online ApproachesCompanion Proceedings of the Web Conference 202010.1145/3366424.3383107(299-300)Online publication date: 20-Apr-2020
    • (2019)Addressing Trust Bias for Unbiased Learning-to-RankThe World Wide Web Conference10.1145/3308558.3313697(4-14)Online publication date: 13-May-2019
    • (2019)Estimating Position Bias without Intrusive InterventionsProceedings of the Twelfth ACM International Conference on Web Search and Data Mining10.1145/3289600.3291017(474-482)Online publication date: 30-Jan-2019
    • (2019)Position Bias Estimation for Unbiased Learning-to-Rank in eCommerce SearchString Processing and Information Retrieval10.1007/978-3-030-32686-9_4(47-64)Online publication date: 7-Oct-2019
    • (2018)Estimating Clickthrough Bias in the Cascade ModelProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3269315(1587-1590)Online publication date: 17-Oct-2018

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media