10.1145/3209978.3210043acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedings
research-article

Stochastic Simulation of Test Collections: Evaluation Scores

ABSTRACT

Part of Information Retrieval evaluation research is limited by the fact that we do not know the distributions of system effectiveness over the populations of topics and, by extension, their true mean scores. The workaround usually consists in resampling topics from an existing collection and approximating the statistics of interest with the observations made between random subsamples, as if one represented the population and the other a random sample. However, this methodology is clearly limited by the availability of data, the impossibility to control the properties of these data, and the fact that we do not really measure what we intend to. To overcome these limitations, we propose a method based on vine copulas for stochastic simulation of evaluation results where the true system distributions are known upfront. In the basic use case, it takes the scores from an existing collection to build a semi-parametric model representing the set of systems and the population of topics, which can then be used to make realistic simulations of the scores by the same systems but on random new topics. Our ability to simulate this kind of data not only eliminates the current limitations, but also offers new opportunities for research. As an example, we show the benefits of this approach in two sample applications replicating typical experiments found in the literature. We provide a full R package to simulate new data following the proposed method, which can also be used to fully reproduce the results in this paper.

References

  1. K. Aas, C. Czado, A. Frigessi, and H. Bakken . 2009. Pair-copula Constructions of Multiple Dependence. Insurance: Mathematics and Economics Vol. 44, 2 (2009).Google ScholarGoogle Scholar
  2. H. Akaike . 1974. A new look at the statistical model identification. IEEE Trans. Automat. Control Vol. 19, 6 (1974), 716--723.Google ScholarGoogle ScholarCross RefCross Ref
  3. L. Azzopardi, M. de Rijke, and K. Balog . 2007. Building simulated queries for known-item topics: an analysis using six european languages ACM SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Azzopardi, K. J"arvelin, J. Kamps, and M.D. Smucker . 2010. Report on the SIGIR 2010 workshop on the simulation of interaction. SIGIR Forum Vol. 44, 2 (2010), 35--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A.P. de Vries, and E. Yilmaz . 2008. Relevance Assessment: Are Judges Exchangeable and Does it Matter? ACM SIGIR. 667--674. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. Bedford and R.M. Cooke . 2002. Vines -- a new graphical model for dependent random variables. The Annals of Statistics Vol. 30, 4 (2002), 1031--1068.Google ScholarGoogle ScholarCross RefCross Ref
  7. E.C. Brechmann, C. Czado, and K. Aas . 2012. Truncated regular vines in high dimensions with application to financial data. Canadian Journal of Statistics Vol. 40, 1 (2012), 68--85.Google ScholarGoogle ScholarCross RefCross Ref
  8. C. Buckley and E.M. Voorhees . 2000. Evaluating Evaluation Measure Stability. In ACM SIGIR. 33--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Carterette . 2012. Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments. ACM TOIS Vol. 30, 1 (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. Carterette . 2015 a. Bayesian Inference for Information Retrieval Evaluation ACM ICTIR. 31--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Carterette . 2015 b. The Best Published Result is Random: Sequential Testing and Its Effect on Reported Effectiveness. In ACM SIGIR. 747--750. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Carterette, V. Pavlu, E. Kanoulas, J.A. Aslam, and J. Allan . 2009. If I Had a Million Queries. In ECIR. 288--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S.X. Chen . 1999. Beta kernel estimators for density functions. Computational Statistics & Data Analysis Vol. 31, 2 (1999), 131--145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M.D. Cooper . 1973. A simulation model of an information retrieval system. Information Storage and Retrieval Vol. 9, 1 (1973), 13--32.Google ScholarGoogle ScholarCross RefCross Ref
  15. Gordon V. Cormack and Thomas R. Lynam . 2007. Validity and Power of t-test for Comparing MAP and GMAP ACM SIGIR. 753--754. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Dissmann, E.C. Brechmann, C. Czado, and D. Kurowicka . 2013. Selecting and estimating regular vine copulae and application to financial returns. Computational Statistics & Data Analysis Vol. 59 (2013), 52--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Forbes, M. Evans, N. Hastings, and B. Peacock . 2011. Statistical Distributions. Wiley.Google ScholarGoogle Scholar
  18. J. Friedman, T. Hastie, and R. Tibshirani . 2001. The elements of statistical learning. Springer.Google ScholarGoogle Scholar
  19. H. Joe . 2014. Dependence Modeling with Copulas. Chapman & Hall/CRC.Google ScholarGoogle Scholar
  20. E. Kanoulas and J.A. Aslam . 2009. Empirical Justification of the Gain and Discount Function for nDCG ACM CIKM. 611--620. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Loader . 2006. Local regression and likelihood. Springer.Google ScholarGoogle Scholar
  22. S. Robertson and E. Kanoulas . 2012. On Per-Topic Variance in IR Evaluation. In ACM SIGIR. 891--900. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Tetsuya Sakai . 2015. Topic Set Size Design. Information Retrieval Journal Vol. 19, 3 (2015), 256--283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Sanderson, A. Turpin, Y. Zhang, and F. Scholer . 2012. Differences in Effectiveness Across Sub-collections ACM CIKM. 1965--1969. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Sanderson and J. Zobel . 2005. Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability ACM SIGIR. 162--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Schwarz . 1978. Estimating the Dimension of a Model. The Annals of Statistics Vol. 6, 2 (1978), 461--464.Google ScholarGoogle ScholarCross RefCross Ref
  27. A. Sklar . 1959. Fonctions de Répartition à n Dimensions et Leurs Marges.Google ScholarGoogle Scholar
  28. J. Tague, M. Nelson, and H. Wu . 1981. Problems in the Simulation of Bibliographic Retrieval Systems ACM SIGIR. 236--255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Tague-Sutcliffe . 1992. The Pragmatics of Information Retrieval Experimentation, Revisited. Information Processing and Management Vol. 28, 4 (1992), 467--490. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Urbano . 2016. Test Collection Reliability: A Study of Bias and Robustness to Statistical Assumptions via Stochastic Simulation. Information Retrieval Journal Vol. 19, 3 (2016), 313--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Urbano and M. Marrero . 2016. Toward estimating the rank correlation between the test collection results and the true system performance. In ACM SIGIR. 1033--1036. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Urbano, M. Marrero, and D. Martín . 2013. A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation. In ACM SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. E.M. Voorhees . 2009. Topic Set Size Redux. In ACM SIGIR. 806--807. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M.P. Wand and M.C. Jones . 1994. Multivariate plug-in bandwidth selection. Computational Statistics Vol. 9, 2 (1994), 97--116.Google ScholarGoogle Scholar
  35. M.C. Wang and J.V. Ryzing . 1981. A Class of Smooth Estimators for Discrete Distributions. Biometrika Vol. 68, 1 (1981), 301--309.Google ScholarGoogle ScholarCross RefCross Ref
  36. W. Webber, M. Bagdouri, D.D. Lewis, and D.W. Oard . 2013. Sequential Testing in Classifier Evaluation Yields Biased Estimates of Effectiveness ACM SIGIR. 933--936. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. W. Webber, A. Moffat, and J. Zobel . 2008 a. Score Standardization for Inter-collection Comparison of Retrieval Systems ACM SIGIR. 51--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. W. Webber, A. Moffat, and J. Zobel . 2008 b. Statistical Power in Retrieval Experimentation. In ACM CIKM. 571--580. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. G.B. Wetherill and K.D. Glazebrook . 1986. Sequential Methods in Statistics. Chapman and Hill.Google ScholarGoogle Scholar
  40. J. Zobel . 1998. How Reliable are the Results of Large-Scale Information Retrieval Experiments? ACM SIGIR. 307--314. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Stochastic Simulation of Test Collections

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!