ABSTRACT
Part of Information Retrieval evaluation research is limited by the fact that we do not know the distributions of system effectiveness over the populations of topics and, by extension, their true mean scores. The workaround usually consists in resampling topics from an existing collection and approximating the statistics of interest with the observations made between random subsamples, as if one represented the population and the other a random sample. However, this methodology is clearly limited by the availability of data, the impossibility to control the properties of these data, and the fact that we do not really measure what we intend to. To overcome these limitations, we propose a method based on vine copulas for stochastic simulation of evaluation results where the true system distributions are known upfront. In the basic use case, it takes the scores from an existing collection to build a semi-parametric model representing the set of systems and the population of topics, which can then be used to make realistic simulations of the scores by the same systems but on random new topics. Our ability to simulate this kind of data not only eliminates the current limitations, but also offers new opportunities for research. As an example, we show the benefits of this approach in two sample applications replicating typical experiments found in the literature. We provide a full R package to simulate new data following the proposed method, which can also be used to fully reproduce the results in this paper.
References
- K. Aas, C. Czado, A. Frigessi, and H. Bakken . 2009. Pair-copula Constructions of Multiple Dependence. Insurance: Mathematics and Economics Vol. 44, 2 (2009).Google Scholar
- H. Akaike . 1974. A new look at the statistical model identification. IEEE Trans. Automat. Control Vol. 19, 6 (1974), 716--723.Google Scholar
Cross Ref
- L. Azzopardi, M. de Rijke, and K. Balog . 2007. Building simulated queries for known-item topics: an analysis using six european languages ACM SIGIR. Google Scholar
Digital Library
- L. Azzopardi, K. J"arvelin, J. Kamps, and M.D. Smucker . 2010. Report on the SIGIR 2010 workshop on the simulation of interaction. SIGIR Forum Vol. 44, 2 (2010), 35--47. Google Scholar
Digital Library
- P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A.P. de Vries, and E. Yilmaz . 2008. Relevance Assessment: Are Judges Exchangeable and Does it Matter? ACM SIGIR. 667--674. Google Scholar
Digital Library
- T. Bedford and R.M. Cooke . 2002. Vines -- a new graphical model for dependent random variables. The Annals of Statistics Vol. 30, 4 (2002), 1031--1068.Google Scholar
Cross Ref
- E.C. Brechmann, C. Czado, and K. Aas . 2012. Truncated regular vines in high dimensions with application to financial data. Canadian Journal of Statistics Vol. 40, 1 (2012), 68--85.Google Scholar
Cross Ref
- C. Buckley and E.M. Voorhees . 2000. Evaluating Evaluation Measure Stability. In ACM SIGIR. 33--34. Google Scholar
Digital Library
- B. Carterette . 2012. Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments. ACM TOIS Vol. 30, 1 (2012). Google Scholar
Digital Library
- B. Carterette . 2015 a. Bayesian Inference for Information Retrieval Evaluation ACM ICTIR. 31--40. Google Scholar
Digital Library
- B. Carterette . 2015 b. The Best Published Result is Random: Sequential Testing and Its Effect on Reported Effectiveness. In ACM SIGIR. 747--750. Google Scholar
Digital Library
- B. Carterette, V. Pavlu, E. Kanoulas, J.A. Aslam, and J. Allan . 2009. If I Had a Million Queries. In ECIR. 288--300. Google Scholar
Digital Library
- S.X. Chen . 1999. Beta kernel estimators for density functions. Computational Statistics & Data Analysis Vol. 31, 2 (1999), 131--145. Google Scholar
Digital Library
- M.D. Cooper . 1973. A simulation model of an information retrieval system. Information Storage and Retrieval Vol. 9, 1 (1973), 13--32.Google Scholar
Cross Ref
- Gordon V. Cormack and Thomas R. Lynam . 2007. Validity and Power of t-test for Comparing MAP and GMAP ACM SIGIR. 753--754. Google Scholar
Digital Library
- J. Dissmann, E.C. Brechmann, C. Czado, and D. Kurowicka . 2013. Selecting and estimating regular vine copulae and application to financial returns. Computational Statistics & Data Analysis Vol. 59 (2013), 52--69. Google Scholar
Digital Library
- C. Forbes, M. Evans, N. Hastings, and B. Peacock . 2011. Statistical Distributions. Wiley.Google Scholar
- J. Friedman, T. Hastie, and R. Tibshirani . 2001. The elements of statistical learning. Springer.Google Scholar
- H. Joe . 2014. Dependence Modeling with Copulas. Chapman & Hall/CRC.Google Scholar
- E. Kanoulas and J.A. Aslam . 2009. Empirical Justification of the Gain and Discount Function for nDCG ACM CIKM. 611--620. Google Scholar
Digital Library
- C. Loader . 2006. Local regression and likelihood. Springer.Google Scholar
- S. Robertson and E. Kanoulas . 2012. On Per-Topic Variance in IR Evaluation. In ACM SIGIR. 891--900. Google Scholar
Digital Library
- Tetsuya Sakai . 2015. Topic Set Size Design. Information Retrieval Journal Vol. 19, 3 (2015), 256--283. Google Scholar
Digital Library
- M. Sanderson, A. Turpin, Y. Zhang, and F. Scholer . 2012. Differences in Effectiveness Across Sub-collections ACM CIKM. 1965--1969. Google Scholar
Digital Library
- M. Sanderson and J. Zobel . 2005. Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability ACM SIGIR. 162--169. Google Scholar
Digital Library
- G. Schwarz . 1978. Estimating the Dimension of a Model. The Annals of Statistics Vol. 6, 2 (1978), 461--464.Google Scholar
Cross Ref
- A. Sklar . 1959. Fonctions de Répartition à n Dimensions et Leurs Marges.Google Scholar
- J. Tague, M. Nelson, and H. Wu . 1981. Problems in the Simulation of Bibliographic Retrieval Systems ACM SIGIR. 236--255. Google Scholar
Digital Library
- J. Tague-Sutcliffe . 1992. The Pragmatics of Information Retrieval Experimentation, Revisited. Information Processing and Management Vol. 28, 4 (1992), 467--490. Google Scholar
Digital Library
- J. Urbano . 2016. Test Collection Reliability: A Study of Bias and Robustness to Statistical Assumptions via Stochastic Simulation. Information Retrieval Journal Vol. 19, 3 (2016), 313--350. Google Scholar
Digital Library
- J. Urbano and M. Marrero . 2016. Toward estimating the rank correlation between the test collection results and the true system performance. In ACM SIGIR. 1033--1036. Google Scholar
Digital Library
- J. Urbano, M. Marrero, and D. Martín . 2013. A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation. In ACM SIGIR. Google Scholar
Digital Library
- E.M. Voorhees . 2009. Topic Set Size Redux. In ACM SIGIR. 806--807. Google Scholar
Digital Library
- M.P. Wand and M.C. Jones . 1994. Multivariate plug-in bandwidth selection. Computational Statistics Vol. 9, 2 (1994), 97--116.Google Scholar
- M.C. Wang and J.V. Ryzing . 1981. A Class of Smooth Estimators for Discrete Distributions. Biometrika Vol. 68, 1 (1981), 301--309.Google Scholar
Cross Ref
- W. Webber, M. Bagdouri, D.D. Lewis, and D.W. Oard . 2013. Sequential Testing in Classifier Evaluation Yields Biased Estimates of Effectiveness ACM SIGIR. 933--936. Google Scholar
Digital Library
- W. Webber, A. Moffat, and J. Zobel . 2008 a. Score Standardization for Inter-collection Comparison of Retrieval Systems ACM SIGIR. 51--58. Google Scholar
Digital Library
- W. Webber, A. Moffat, and J. Zobel . 2008 b. Statistical Power in Retrieval Experimentation. In ACM CIKM. 571--580. Google Scholar
Digital Library
- G.B. Wetherill and K.D. Glazebrook . 1986. Sequential Methods in Statistics. Chapman and Hill.Google Scholar
- J. Zobel . 1998. How Reliable are the Results of Large-Scale Information Retrieval Experiments? ACM SIGIR. 307--314. Google Scholar
Digital Library
Index Terms
Stochastic Simulation of Test Collections

Julián Urbano


Comments