skip to main content
10.1145/3289600.3291027acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Offline Evaluation to Make Decisions About PlaylistRecommendation Algorithms

Published: 30 January 2019 Publication History

Abstract

Evaluating algorithmic recommendations is an important, but difficult, problem. Evaluations conducted offline using data collected from user interactions with an online system often suffer from biases arising from the user interface or the recommendation engine. Online evaluation (A/B testing) can more easily address problems of bias, but depending on setting can be time-consuming and incur risk of negatively impacting the user experience, not to mention that it is generally more difficult when access to a large user base is not taken as granted. A compromise based on \em counterfactual analysis is to present some subset of online users with recommendation results that have been randomized or otherwise manipulated, log their interactions, and then use those to de-bias offline evaluations on historical data. However, previous work does not offer clear conclusions on how well such methods correlate with and are able to predict the results of online A/B tests. Understanding this is crucial to widespread adoption of new offline evaluation techniques in recommender systems. In this work we present a comparison of offline and online evaluation results for a particular recommendation problem: recommending playlists of tracks to a user looking for music. We describe two different ways to think about de-biasing offline collections for more accurate evaluation. Our results show that, contrary to much of the previous work on this topic, properly-conducted offline experiments do correlate well to A/B test results, and moreover that we can expect an offline evaluation to identify the best candidate systems for online testing with high probability.

References

[1]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval: The Concepts and Technology Behind Search. Addison-Wesley Publishing Company, USA, 2nd edition, 2008.
[2]
J. Beel and S. Langer. A comparison of offline evaluations, online evaluations, and user studies in the context of research-paper recommender systems. In Proceedings of the 19th International Conference on Theory and Practice of Digital Libraries (TPDL), volume 9316 of Lecture Notes in Computer Science, pages 153--168, 2015.
[3]
A. Bellogin, P. Castells, and I. Cantador. Precision-oriented evaluation of recommender systems: An algorithmic comparison. In Proceedings of the Fifth ACM Conference on Recommender Systems, RecSys '11, pages 333--336, 2011.
[4]
T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011.
[5]
L. Bottou, J. Peters, J. Q. Candela, D. X. Charles, M. Chickering, E. Portugaly, D. Ray, P. Y. Simard, and E. Snelson. Counterfactual reasoning and learning systems - the example of computational advertising. Journal of Machine Learning Research (), 2013.
[6]
O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleaved search evaluation. Trans. Inf. Systems, 30(1):6:1--41, 2012.
[7]
O. Chapelle and Y. Zhang. A dynamic bayesian network click model for web search ranking . 2009.
[8]
A. Chuklin, I. Markov, and M. de Rijke. Click Models for Web Search. Synthesis Lectures on Information Concepts, Retrieval, and Services, 7(3):1--115, 2015.
[9]
C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '08, pages 659--666, 2008.
[10]
P. Cremonesi, Y. Koren, and R. Turrin. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the Fourth ACM Conference on Recommender Systems, RecSys '10, pages 39--46, 2010.
[11]
M. Dud'ik, D. Erhan, J. Langford, and L. Li. Doubly Robust Policy Evaluation and Optimization. Statistical Science, 29(4):485--511, 2014.
[12]
G. E. Dupret and B. Piwowarski. A user browsing model to predict search engine click data from past observations. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '08, pages 331--338, 2008.
[13]
M. D. Ekstrand, F. M. Harper, M. C. Willemsen, and J. A. Konstan. User perception of differences in recommender algorithms. In Proceedings of the 8th ACM Conference on Recommender Systems, RecSys '14, pages 161--168, 2014.
[14]
J. Garcia-Gathright, B. St. Thomas, C. Hosey, Z. Nazari, and F. Diaz. Understanding and evaluating user satisfaction with music discovery. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR '18, pages 55--64, 2018.
[15]
F. Garcin, B. Faltings, O. Donatsch, A. Alazzawi, C. Bruttin, and A. Huber. Offline and online evaluation of news recommender systems at swissinfo.ch. In Proceedings of the 8th ACM Conference on Recommender Systems, RecSys '14, pages 169--176, 2014.
[16]
M. Ge, C. Delgado-Battenfeld, and D. Jannach. Beyond accuracy: Evaluating recommender systems by coverage and serendipity. In Proceedings of the Fourth ACM Conference on Recommender Systems, RecSys '10, pages 257--260, 2010.
[17]
G. G. Gebremeskel and A. P. de Vries. Recommender systems evaluations : Offline, online, time and a/a test. In CLEF, 2016.
[18]
A. Gilotte, C. Calauzènes, T. Nedelec, A. Abraham, and S. Dollé. Offline a/b testing for recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM '18, pages 198--206, 2018.
[19]
A. Gunawardana and G. Shani. Evaluating Recommender Systems, pages 265--308. Springer US, Boston, MA, 2015.
[20]
F. M. Harper and J. A. Konstan. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4):19:1--19:19, Dec. 2015.
[21]
J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst., 22(1):5--53, Jan. 2004.
[22]
K. Hofmann, L. Li, and F. Radlinski. Online Evaluation for Information Retrieval. Foundations and Trends® in Information Retrieval, 10(1):1--117, 2016.
[23]
K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM '11, pages 249--258, 2011.
[24]
D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663--685, 1952.
[25]
N. Hurley and M. Zhang. Novelty and diversity in top-n recommendation -- analysis and evaluation. ACM Trans. Internet Technol., 10(4):14:1--14:30, 2011.
[26]
D. Jannach and M. Ludewig. Determining characteristics of successful recommendations from log data: a case study. In Proceedings of the Symposium on Applied Computing - SAC '17, pages 1643--1648, 2017.
[27]
K. J"arvelin and J. Kek"al"ainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, Oct. 2002.
[28]
T. Joachims, L. a. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search. ACM Transactions on Information Systems, 25(2):7--es, 2007.
[29]
T. Joachims, A. Swaminathan, and T. Schnabel. Unbiased Learning-to-Rank with Biased Feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining - WSDM '17, 2017.
[30]
M. T. Keane and M. O'Brien. Click Models for Web Search. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 28, 2006.
[31]
R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne. Controlled experiments on the web: Survey and practical guide. Data Min. Knowl. Discov., 18(1):140--181, 2009.
[32]
L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM '11, pages 297--306, 2011.
[33]
J. McInerney, B. Lacker, S. Hansen, K. Higley, H. Bouchard, A. Gruson, and R. Mehrotra. Explore, exploit, explain: Personalizing explainable recommendations with bandits. In Proceedings of the 12th ACM conference on Recommender Systems (RecSys). ACM, 2018.
[34]
M. J. D. POWELL and J. SWANN. Weighted uniform sampling - a monte carlo technique for reducing variance. IMA Journal of Applied Mathematics, 2(3):228--236, 1966.
[35]
F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM '08, pages 43--52, 2008.
[36]
P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. Grouplens: An open architecture for collaborative filtering of netnews. In Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work, CSCW '94, pages 175--186, 1994.
[37]
P. R. ROSENBAUM and D. B. RUBIN. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41--55, 1983.
[38]
M. Rossetti, F. Stella, and M. Zanker. Contrasting offline and online results when evaluating recommendation algorithms. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys '16, pages 31--34, 2016.
[39]
A. Said and A. Bellog'in. Comparative recommender system evaluation: Benchmarking recommendation frameworks. In Proceedings of the 8th ACM Conference on Recommender Systems, RecSys '14, pages 129--136, 2014.
[40]
A. Swaminathan and T. Joachims. The self-normalized estimator for counterfactual learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS'15, pages 3231--3239, 2015.
[41]
A. Swaminathan, A. Krishnamurthy, A. Agarwal, M. Dudik, J. Langford, D. Jose, and I. Zitouni. Off-policy evaluation for slate recommendation. pages 3635--3645, 2017.
[42]
S. Vargas and P. Castells. Rank and relevance in novelty and diversity metrics for recommender systems. In Proceedings of the Fifth ACM Conference on Recommender Systems, RecSys '11, pages 109--116, 2011.
[43]
E. M. Voorhees and D. K. Harman. TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing). The MIT Press, 2005.
[44]
X. Wang, M. Bendersky, D. Metzler, and M. Najork. Learning to Rank with Selection Bias in Personal Search. pages 115--124, 2016.
[45]
Y. Yue, R. Patel, and H. Roehrig. Beyond position bias: Examining result attractiveness as a source of presentation bias in clickthrough data. In Proc. of the 19th International Conference on World Wide Web, WWW, pages 1011--1018, 2010.

Cited By

View all
  • (2024)AutoOffAB: Toward Automated Offline A/B Testing for Data-Driven Requirement EngineeringCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663780(472-476)Online publication date: 10-Jul-2024
  • (2024)Toward Bias-Agnostic Recommender Systems: A Universal Generative FrameworkACM Transactions on Information Systems10.1145/365561742:6(1-30)Online publication date: 25-Jun-2024
  • (2024)Automating Pipelines of A/B Tests with Population Split Using Self-Adaptation and Machine LearningProceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing Systems10.1145/3643915.3644087(84-97)Online publication date: 15-Apr-2024
  • Show More Cited By

Index Terms

  1. Offline Evaluation to Make Decisions About PlaylistRecommendation Algorithms

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining
      January 2019
      874 pages
      ISBN:9781450359405
      DOI:10.1145/3289600
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 January 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. counterfactual analysis
      2. experimentation
      3. music recommendation
      4. offline experimentation
      5. online experimentation
      6. personalization
      7. playlist recommendation
      8. recommendation

      Qualifiers

      • Research-article

      Conference

      WSDM '19

      Acceptance Rates

      WSDM '19 Paper Acceptance Rate 84 of 511 submissions, 16%;
      Overall Acceptance Rate 498 of 2,863 submissions, 17%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)111
      • Downloads (Last 6 weeks)10
      Reflects downloads up to 24 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)AutoOffAB: Toward Automated Offline A/B Testing for Data-Driven Requirement EngineeringCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663780(472-476)Online publication date: 10-Jul-2024
      • (2024)Toward Bias-Agnostic Recommender Systems: A Universal Generative FrameworkACM Transactions on Information Systems10.1145/365561742:6(1-30)Online publication date: 25-Jun-2024
      • (2024)Automating Pipelines of A/B Tests with Population Split Using Self-Adaptation and Machine LearningProceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing Systems10.1145/3643915.3644087(84-97)Online publication date: 15-Apr-2024
      • (2024)On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n RecommendationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671687(1222-1233)Online publication date: 25-Aug-2024
      • (2024)Long-term Off-Policy Evaluation and LearningProceedings of the ACM Web Conference 202410.1145/3589334.3645446(3432-3443)Online publication date: 13-May-2024
      • (2024)Debiasing Recommendation with Personal PopularityProceedings of the ACM Web Conference 202410.1145/3589334.3645421(3400-3409)Online publication date: 13-May-2024
      • (2024)Enhancing Disentanglement of Popularity Bias for Recommendation With Triplet Contrastive LearningIEEE Transactions on Services Computing10.1109/TSC.2024.337892517:3(921-933)Online publication date: May-2024
      • (2024)Mitigating Hidden Confounding Effects for Causal RecommendationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.337848236:9(4794-4805)Online publication date: Sep-2024
      • (2024)Graph Contrastive Learning With Negative Propagation for RecommendationIEEE Transactions on Computational Social Systems10.1109/TCSS.2024.335607111:3(4255-4266)Online publication date: Jun-2024
      • (2024)Is Encoded Popularity Always Harmful? Explicit Debiasing with Augmentation for Contrastive Collaborative Filtering2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651019(1-9)Online publication date: 30-Jun-2024
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media