How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication Methods

Creating test collections for offline retrieval evaluation requires human effort to judge documents' relevance. This expensive activity motivated much work in developing methods for constructing benchmarks with fewer assessment costs. In this respect, adjudication methods actively decide both which documents and the order in which experts review them, in order to better exploit the assessment budget or to lower it. Researchers evaluate the quality of those methods by measuring the correlation between the known gold ranking of systems under the full collection and the observed ranking of systems under the lower-cost one. This traditional analysis ignores whether and how the low-cost judgements impact on the statistically significant differences among systems with respect to the full collection. We fill this void by proposing a novel methodology to evaluate how the low-cost adjudication methods preserve the pairwise significant differences between systems as the full collection. In other terms, while traditional approaches look for stability in answering the question "is system A better than system B?", our proposed approach looks for stability in answering the question "is system A significantly better than system B?", which is the ultimate questions researchers need to answer to guarantee the generalisability of their results. Among other results, we found that the best methods in terms of ranking of systems correlation do not always match those preserving statistical significance.


INTRODUCTION
Information Retrieval (IR) is a field with a strong focus on evaluation [18,50], whose main purpose is to empirically measure the effectiveness of retrieval systems.Offline batch evaluation allows researchers to perform experiments under controlled conditions and enables the reproducibility of the results.It is based on test collections, which consist of a corpus of documents, topics, and relevance judgements (also called assessments, or qrels) [42].Acquiring the assessments for creating these collections is costly, since human experts have to judge the documents' content and decide which ones are relevant for each topic.The advantage is that once the collections are created, it is straightforward and cheap to conduct as many experiments as needed to evaluate and compare the performance of (new) IR systems [55].
The first and small test collections had complete judgements [18], containing a human assessment for each topic-document pair, thus representing the ideal situation in terms of evaluation quality.However, that exhaustive procedure is only feasible for collections with a very small corpus.Nonetheless, small corpora are not the conditions that operational search systems face.As a consequence, when larger collections arose, there was the need to implement some kind of sampling so that assessors would not have to judge the relevance of each document for each topic.However, simple random sampling, the most immediate approach, would not work, since the number of relevant documents for a topic is extremely small compared to the size of the corpus of documents.Thus, a random sample would end up consisting of (almost all) non-relevant documents.The first solution to this problem was the pooling technique implemented by TREC [45,55].With this technique, assessors only judge a subset of the corpus, the pool.For each topic, the pool consists of the union of the top- documents retrieved by several search systems for that topic.The assessors judge the relevance of the documents in the pool while the rest, i.e. the non-pooled documents, are assumed to be non-relevant.Top- pooling builds on the assumption that IR systems try to push relevant documents towards the top of the ranking and thus there is a good chance to pool most of the relevant documents for a topic, provided that  is deep enough and the pooled systems are diverse enough.However, the number of judgements that an assessor can perform, i.e. the budget, is limited and, therefore, there is a trade-off with the depth  of the pool and the number of pooled systems, since the more they grow, the higher the number of documents in the pool.
Pooling does not guarantee finding all the relevant documents for a topic but, as said, it strives to find a very good share of them.Researchers are interested in comparing systems in order to answer the fundamental question "is system A better than system B?". Answering this question requires a good estimate of system performance rather than absolute performance scores which, in turn, would demand finding all the relevant documents.Therefore, the quality of a pool is traditionally measured on its ability to fairly rank systems, i.e. to fairly compare them.This is not limited to the systems which were actually pooled, but it should also hold for systems which were not pooled [59], to ensure the future reusability of a test collection also with new systems.
However, collections kept growing in size, and just judging deep pools over a diverse set of systems stopped to be a practicable approach as well [53].Therefore, much work has focused on developing alternative methods to better select which documents to pool and judge by performing some sort of focused sampling, aimed at picking documents which more probably turn out to be relevant and better employing the assessor budget or allowing for lower budgets at a comparable quality [29,35].A method that actively decides which document to judge next is called an adjudication method.However, alternative prioritisation models may introduce biases or incompleteness in the judgements, hampering the future reusability of an test collection [49].
Therefore, the quality of new adjudication methods is traditionally assessed by checking that they rank systems as closely as possible to the full set of judgements of a (good quality) top- pool, ensuring that they can still properly answer the question "is system A better than system B?".This is quantified by computing the correlation, e.g.Kendall's  [25,26], between the ranking of systems produced by an adjudication method and by the full top- pool.
The rationale is that if this correlation is high, one may assume the validity of the new method and aim to use it in the future for building new test collections at a comparable quality but with a lower assessment cost.However, the question researchers are really interested in is rather "is system A statistically significantly better than system B?", since this ensures that observed differences are not due just to the randomness present in the construction process of a collection and, especially, that the found differences would generalise better and still hold in operational settings [17,38].The problem is that the above correlation measures ignore whether the evaluated systems' statistical significance is preserved.
Let us better explain this problem with an example.Let us assume we have three different IR systems, Sys1, Sys2 and Sys3, and that their true ranking, given by the full top- pool, is (Sys1, Sys2, Sys3).We perform a significance test between all possible pairwise comparisons and we obtain that Sys1 is significantly better than Sys2 and Sys3, and Sys2 is also significantly better than Sys3.Then, we create a new set of judgements using some adjudication method and repeat the above procedure.Using this new pool, we find the same ranking of systems as when using the full top- pool, leading to a perfect correlation and concluding that the adjudication method is fully equivalent, but less costly, than the full top- pool.However, we do not know anything about the significance between systems.If we repeat the same significance test using the new pool instead, we may not find any significant difference between any pair.We may thus conclude that there is no evidence of any system being different from the rest.This would be the opposite conclusion than the one drawn on the full top- pool, where all the system pairs were significantly different.
In this work, our objectives are two-fold.First, we aim to propose a new approach to evaluate the validity of low-cost adjudication methods, focusing on how they preserve the statistically significant differences between systems.Second, we analyse some state-ofthe-art adjudication methods using our new approach to gain new insights about them.In particular, we aim to answer the following research questions: RQ1 Are the adjudication methods able to preserve the same statistically significant differences as the full top- pool?RQ2 When adjudication methods fail to see a real significant difference, do they follow any distinguishable pattern in terms of system position in the ranking?RQ3 Are the adjudication methods able to preserve the same statistically significant differences as the full top- pool for new (non-pooled) systems?
The rest of the paper is organised as follows: Section 2 introduces past work; Section 3 explains our methodology; Section 4 and Section 5 report our experiments; and, finally, Section 6 draws conclusions and presents some ideas for future work.

RELATED WORK
How to build high-quality experimental collections for retrieval evaluation is still an open research question [13,53,56].Research in adjudication methods looks for ways of prioritising the pooled documents so that the assessors expend their effort in judging relevant documents.In this way, we may only need to judge some of the pooled documents while maintaining the quality of the judgements, thus making more efficient use of the resources.
Losada et al. [29] proposed a series of sampling methods based on the multi-armed bandit problem.The multi-armed bandit problem [46,Chapter 2] has been a subject of research for decades in Reinforcement Learning (RL), statistics and other fields.These methods bring ideas from RL to the task of document adjudication for building test collections.They apply Bayesian principles to this problem, formalising the uncertainty associated with reviewing a document from a pooled system.Other works have also explored the development of adjudication methods [11,27,32,35] Section 4 provides further details about the state-of-the-art adjudication methods under experimentation.Adjudication methods have shown remarkable improvements in bringing relevant documents earlier in the pooling process, and indeed they were used to build the collection of the TREC Common Core Track of 2017 [1].However, the quality of the judgements produced with a limited budget is still an open question [49].
Previous work on adjudicating methods used a series of metrics to evaluate the quality of these algorithms.The commonest is Kendall's  [25,26] correlation, which researchers use to measure how well a new adjudication method can induce the gold ranking of systems, i.e. the one on the full top- pool.Another top-weighted correlation,   [58], is also common.This correlation penalises swaps in higher positions more.In some works [49,53], they also measure the change in the ranking position of the system that suffers the highest drop as a measure of the reusability of an experimental collection.The problem with all these measures, as we already introduced earlier, is that they ignore the significance between the scores of the systems.If we ignore this, it is meaningless to account for ranking swaps.
In this work, we propose a new methodology to evaluate lowcost adjudication methods that, instead of focusing only on the ranking of the systems, focuses on evaluating how well a method preserves the real pairwise significant differences.
Statistical significance testing is of paramount importance in IR, and studying the properties of significance tests is an active area of research [4,7,8,9,10,15,16,23,33,34,39,43,44,47,48,57].However, this is out-of-scope for the present work which, instead, focuses on considering the output of a statistical significance test as a way to assess the quality of an adjudication method.

METHOD
Let  = {  }, | | = , be the set of systems under experimentation, and let  be the gold assessments (also said gold qrels), i.e. the full top- pool.Using an effectiveness measure of choice, we compute the per-topic scores for each of the  systems and we perform a statistical test for each pairwise comparison between systems.From this test, we obtain, for each pair of systems   and   ( <  ≤ ), a triplet ⟨  ,   , ⟩, where  ∈ {>, ≫, <, ≪}, denoting the four outcomes we are interested in:   is better than   (  >   ),   is significantly better than   (  ≫   ),   is better than   (  <   ), or   is significantly better than   (  ≪   ).Now we use   to denote the set of triplets that result from the statistical test performed using the gold qrels.Similarly, we use  to denote the qrels obtained with a low-cost adjudication method ( ⊆ ) and   to denote the set of triplets that result from the statistical test performed with them.Note that . Finally, we use   to denote the set of comparisons from   that are significantly different, that is, the set triplets for which  ∈ {≪, ≫}, and   for the significantly different comparisons obtained with the low-cost assessments.
As we already explained, we are interested in studying to what extent the judgements produced by different low-cost adjudication methods preserve the statistically significant differences between systems we observe when using the gold qrels.The idea here is that if the low-cost method is able to preserve such differences, we could confidently use it to build new collections in the future with fewer assessment costs.Thus, we compare how   and   agree with each other using the measures described in the following section.

Measures
Kendall's .Kendall's  is the measure traditionally used to evaluate adjudication methods.It computes the correlation between the ranking of systems under the gold qrels setting and the one under the qrels produced with the different adjudication methods.
Given two rankings over the same set of items, Kendall's  computes how many items are swapped as follows:  = ( − )/  2 , where  is the number of concordant pairs (pairs of systems ranked in the same relative order in both lists),  is the number of discordant pairs (swapped pairs of systems), and  2 = is the number of total pairs, given that we have  items.
Precision and Recall.We consider the Precision (P) and Recall (R) of the significantly different pairs detected by the low-cost adjudication methods, defined as follows: is the number of significantly different pairs common to both the gold and adjudication qrles, i.e. the correct ones when assuming the gold qrels detect the "true" differences.Precision indicates how much "noise" is introduced by an adjudication method, meant as additional significant differences not detected by gold qrels; Recall indicates how many of the total possible significant differences are not detected by an adjudication method.
Agreements.We consider an adaptation of a series of agreement measures that have been used in past work [14,15,31,48].Note that, while Kendall's  and Precision/Recall focus on ranking of systems (the former) or on matching significantly different pairs (the latter) in isolation, the following agreement measures consider them jointly.
• Active Disagreements (AD): the set of opposite outputs between both methods.This is, ⟨  ,   , ≫⟩ ∈   and ⟨  ,   , ≪⟩ ∈   , or ⟨  ,   , ≪⟩ ∈   and ⟨  ,   , ≫⟩ ∈   .This is the worst possible case, since it means that both methods reach complete opposite conclusions for a given pair.Thus, the lesser, the better.
• Mixed Agreements (MA): we have four possible options: ➊ ⟨  ,   , ≪⟩ ∈   and ⟨  ,   , <⟩ ∈   , or ➋ ⟨  ,   , ≫⟩ ∈   and ⟨  ,   , >⟩ ∈   , or ➌ ⟨  ,   , <⟩ ∈   and ⟨  ,   , ≪⟩ ∈   , or ➍ ⟨  ,   , >⟩ ∈   and ⟨  ,   , ≫⟩ ∈   .We distinguish between MA G (➊ and ➋), which counts the cases where the adjudication method was not able to see a gold significant difference.Conversely, MA L (➌ and ➍) counts the cases where a low-cost method sees a significant difference that is not in the gold qrels.Note that MA G + MA L = MA • Mixed Disagreements (MD): we also have four possible cases here: ➎ ⟨  ,   , ≪⟩ ∈   and ⟨  ,   , >⟩ ∈   , or ➏ ⟨  ,   , ≫⟩ ∈   and ⟨  ,   , <⟩ ∈   , or ➐ ⟨  ,   , >⟩ ∈   and ⟨  ,   , ≪⟩ ∈   , or ➑ ⟨  ,   , <⟩ ∈   and ⟨  ,   , ≫⟩ ∈   .Here, as with MA, we also distinguish between MD G (➎ and ➏) and MD L (➐ and ➑) Bias.Analogously to Ferro and Sanderson [15], we also consider the publication bias, i.e. the likelihood of a researcher publishing a significant result using an adjudication method when in fact a significance test on the gold qrels would have produced either no significance (MA, MD) or a significant result in the opposite direction (AD).We define it as follows: A value of 0% means that every significance detected by an adjudication method leads to the same conclusions (and publication) as those of the gold qrels.Conversely, a value of 100% means that every significance detected by an adjudication method leads to opposite conclusions (and publication) to those of the gold qrels.Thus, the lower the bias, the better.Note that, differently from Ferro and Sanderson [15], we do not consider the whole MA and MD but just MA L and MD L , since we are interested only in the publication bias induced by the adjudication method.This metric tries to measure the situations where a researcher sees a significant outcome under the reduced pools when, in reality, it would be a different conclusion under the gold qrels.

Family-Wise Error Rate (FWER)
Performing multiple comparisons-in our case between each pair of systems-leads to an increase of the Type I error, i.e. incorrectly rejecting the null hypothesis, and inflates the number of significant differences found [20,22,37].
The Type I error probability is equal to the significance level  and, as the number of comparisons increases, this probability also does.If we perform  different system comparisons, the probability of correctly accepting the null hypothesis for all of them is equal to (1 − )  .Thus, the probability of committing at least one Type I error is 1 − (1 − )  .This is the family-wise error rate (FWER).If we have, for example,  = 0.05 and  = 6 comparisons (4 systems, 4(4−1) 2 = 6), this probability would rise to 0.264, which is not acceptable.For this reason, when we perform multiple comparisons, we should employ a technique to adjust the p-values, so that the FWER stays below .Obviously, this has the side-effect of reducing the power of the statistical test and increasing the number of Type II errors, i.e. not detecting an actual significant difference.
There are several options to control the FWER in a multiple comparison situation.The Bonferroni correction, for example, is a post-hoc correction where, if we have  different comparisons, we should use  <   as our significance level in each pairwise comparison.However, the Bonferroni correction is known to be too conservative and to reduce the power of a test too much, especially when the number of comparisons increases as in our case.Therefore, we employ the randomised version of the Tukey Honestly Significant Difference (HSD) test [8,37].This is a nonparametric computer-based generalisation of the common permutation test for handling more than 2 systems.At each permutation, the test perturbs the array of system scores of each topic, and, after this perturbation, computes the difference between the maximum and minimum average system scores.Then the test counts how many times the actual differences between system average performance is greater than the permuted mean to determine if it is honestly significant [8].The Tukey HSD test produces a p-value for each pairwise comparison, which can be compared to the significance level  to decide whether that pair of systems is significantly different or not.Algorithm 1 (adapted from prior work [8,37]) shows the details of our implementation.

EXPERIMENTAL SETUP
Collections.We employ the TREC-8 ad hoc collection, known to have a very high-quality pool [54,56].It includes 129 system submissions, retrieving 1000 documents for each topic, and 50 topics.Official relevance judgements are based on a pool of depth 100 over 71 out of 129 submitted runs, resulting in 86 830 assessments across all 50 topics.The average pool size per topic is 1736, while the maximum and the minimum are 2992 and 1046, respectively.Additionally, we use the collection from the document ranking task of TREC 2021 Deep Learning track [12], which adopted a shallow pooling approach at depth 10, then enlarged with a method based Algorithm 1 Paired Randomised Tukey HSD

Input
×  topic-system scores matrix. number of permutations.

Output
×  matrix holding a p-value for each pairwise system comparison.
end if end for end for on active learning.We used only the documents in the top-10 pools as our gold qrels to provide a fairer comparison to the case of TREC-8.It includes 66 runs, retrieving 100 documents for each topic, and 13 058 judgements made by NIST assessors over 57 different topics.The depth-10 pools we used include 6510 judgements, with an average pool size of 114, a maximum of 226 and a minimum of 50.
Adjudication methods.We consider a series of state-of-the-art adjudication methods.
• top- pooling.We adapt the standard method used in TREC to limited-budget situations.When limiting the budget of assessments, we choose a  deep enough to fill that budget.Then, pooled documents are sorted by their document identifier [55].
• MoveToFront (MTF).MTF is a dynamic adjudication method proposed by Cormack and colleagues [11] that has been acknowledged as a robust adjudication method [2].
• MaxMean (MM), MM Non Stationary (MM-NS), Thompson Sampling (TS) and TS Non Stationary (TS-NS).Bandit-based methods for document adjudication apply bayesian principles to formalise the uncertainty associated with the probabilities of pulling a positive reward (a relevant document) from playing a bandit [28].
• Hedge.Hedge is an online learning algorithm adapted for pooling in [3].A more detailed explanation of applying Hedge for pooling can be found in this article [29].
• NTCIR top- prioritization.Documents in the pool are sorted by the number of runs that contain the document at or above the depth  (the higher the better), ties are solved with the sum of the ranks of that document within the runs (the lower the better) [41].
Other Settings.We used Average Precision (AP) [6] and Normalized Discounted Cumulative Gain (NDCG) [24] as performance measures to score runs.We used  = 0.05 as significance level and  = 1 000 000 permutations in Tukey HSD test.Finally, since MTF, MM, MM-NS, TS, and TS-NS have a stochastic nature, the reported results for those methods are averaged over 50 executions of each.
To ease the reproducibility of the experiments, we release the source code. 1 RESULTS AND DISCUSSION 5.1 RQ1: Preservation of significant differences In Table 1, we report the Kendall's , Precision and Recall, as defined in Section 3, that each adjudication method achieves, while varying the number of assessments per topic.We report the scores for 100 judgements per topic (which is a 6% budget of the original pool), and 300 (17%).All this values were obtained using the pooled systems of the TREC-8 collection, which includes 71 different systems.
Regarding Kendall's  and consistently with previous findings in the literature, we see almost every method achieves a very high correlation ( > 0.90) already at a 6% of the original budget.While this means that every method obtains a ranking of systems very similar to the one of the gold qrels, it also makes it very difficult to distinguish among methods.Moreover, we can observe that top- and NTCIR methods stay behind the rest, leaving room for improvement in developing more efficient adjudication strategies for building new collections in evaluation workshops.
As we mentioned earlier, Kendall's  does not allow us to know whether the compared algorithms preserve the same statistically significant differences as the gold qrels.Therefore, we study to which extent this effect might hold by using the Precision and Recall measures previously introduced.
We observe that every method obtains Precision and Recall values over 90% in almost all the cases, which is a quite solid result.Moreover, every method is able to mostly preserve the same differences just having a 6% of the original budget.With 300 assessment per topic (17% of the budget), Recall is (almost) 1.00 for most of the methods, indicating that they are able to detect all the significant differences of the gold qrels at less than one third of the cost.
It is also interesting to observe that most of them detect some differences that there were not detected in the gold qrels.Indeed, Precision is lower than 1.00 while Recall is almost 1.00 (all the differences in the gold qrels detected).In other terms,   (the set of significant differences detected by the adjudication method) is not a proper subset of   (the set of significant differences detected by the gold qrels).A possible explanation might be that, since reduced pools lack some relevant documents, the performance difference of some pair of systems (delta AP/NDCG between the two systems in our case) turns out to be increased with respect to the gold qrels and this makes the pair significantly different on the reduced pool but not on the gold qrels.Since more evaluation on this issue would need more experimentation, due to space restrictions we leave this investigation for future work.
To support a more detailed analysis, in Table 2, we report the raw agreements of each method.The upper half of the table includes the results obtained when using AP for evaluating the runs.In this case, there are a total of 966 gold significant differences (|  | = 966).The lower half includes the results when using NDCG.In this case, there are a total of 917 gold significant differences (|  | = 917).
The AA counts confirm that adjudication methods are more effective than top- and NTCIR pooling methods in detecting significant pairs in the correct order, especially at lower budgets.They provide further insights about the (almost) 1.00 Recall (see Table 1) we observed for most adjudication methods.Indeed, with AP, the gold qrels detect 966 significantly different pairs and the AA counts is (almost) 966, indicating that the 1.00 Recall is due to significant pairs in the correct order.The same happens for NDCG, where we observe that most methods obtain AA values near 917.In other terms, the slight drop in Kendall's  observed in Table 1 is not caused by wrongly ordered pairs, even when Recall is 1.00.When it comes to the specific methods, MTF achieves the best AA figures for budgets of 100, 300 when using AP, while under NDCG Hedge works slightly better with lower budgets and bandit-based methods perform the best with a budget of 300.
If we compare the AA counts with the number of relevant documents found by a method (the # rels.row), we observe a somehow unexpected behaviour.One might think that the more relevant documents found, the more AA increases.However, for a budget of 100 judgements per topic, Hedge adjudicated 2170 relevant documents, 485 more than MTF, but the latter one achieves the highest AA with AP; the same happens again for a budget of 300: MTF is not the best one in terms of relevant documents but it is the best in terms of AA.We can observe something similar with NDCG: founding more relevant documents does not necessarily mean more AA.Obviously, having more relevant documents in the pool helps in increasing the number of AA, but these results showcase that it is not the only factor.Overall, these observations suggest that not all the relevant documents are equally discriminative in finding significantly different pairs.Indeed, relevant documents appear at different ranks in the results lists and the same (or even higher) number of relevant documents may contribute differently to the performance score of a run and, in turn, to the significant differences found.So far, research has mostly focused on determining the number of topics needed [5,40,43,51,52] or on identifying the most discriminative subset of topics [19,21,30,36].These findings open up the possibility of future research on which are the best relevant documents to more reliably discriminate among systems, an area not well explored yet, to the best of our knowledge.
Almost in every case, no method fails in a mixed or active disagreement, i.e. detecting significant differences when there is a swap.This represents a very important insight from this experiment, since it shows that no method causes a ranking swap between a pair of systems that were originally significantly different.In other terms, the drop in Kendall's  is not due to swaps between systems that are significantly different on the gold qrels but swaps only happen among not significantly different systems, having a much lower impact.
Let us now consider MA G and MA L .The former accounts for significant pairs in the gold qrels which are missed by reduced pools; thus, it helps mainly to explain drops in Recall.The latter accounts for significant pairs in a reduced pool which are not present in the gold qrels; thus, it helps mainly to explain drops in Precision.We can observe that MA G gets reduced as the budget size increases up to almost 0, with the exception of top- pooling, Hedge and NTCIR method, consistently with the previous findings in Table 1.Moreover, MA L is consistently higher than MA G , explaining the loss in Precision even at very high Recall levels.
When it comes to publication bias, we observe moderate values, from 7% and below, suggesting that all the methods would not lead to draw conclusions severely different from the gold qrels.We can observe that bias quickly decreases as the budget increases and that adjudication methods are more effective than top- pooling, achieving a bias up to 2-3 times lower than it.
Finally, we can observe that there are not different trends between the two evaluation metrics employed, AP and NDCG.This shows that the results presented here are not an artefact of the metric used, but of the adjudication methods being evaluated.
Additionally, we run experiments on the TREC Deep Learning (DL) track 2021.We selected this collection as having opposing characteristics to TREC-8.The DL collection adopts a very shallow pooling at just depth 10, representing a quite challenging setting for adjudication methods.We believe that using these two collections helps in supporting the generalizability of the results presented here.Table 3 reports the Kendall's , Precision, and Recall, similarly to Table 1 for TREC-8; Table 4 reports the agreement counts, similarly to Table 2 for TREC-8.In general, we observe quite lower and much more varied performance on DL 2021 than on TREC-8.
Kendall's  is generally low for all the methods with both metrics.In TREC-8, adjudication methods were able to obtain very strong results only with a 17% of the original budget, while in this case no method is able to reach that performance even with a 26%.One important difference is that, while in TREC-8 top- and NTCIR method were clearly underperforming with respect to the other methods, in DL 2021 Hedge clearly achieves the worst performance.
When it comes to the agreements (Table 3), a notable difference is that, at low budgets (9%), MD appear while they go to (almost) zero for higher budgets.The MD at 9% budget indicate that the drop in Kendall's  are also due to swaps in the significantly different pairs.The problem concerns more MD L , i.e. swaps in significant pairs detected by a reduced pool but not the gold qrles, than MD G , i.e. swaps in significant pairs detected by the gold qrels but not a reduced pool.As a consequence, part of the loss of Precision is due to swaps in the significant pairs a more severe condition than the one causing the loss of Precision in TREC-8.This issue impacts more Table 4: Relevants, agreements and bias of each adjudication method for a varying number of judgements per topic.Parentheses indicate the size with respect to the full pool.We used the 66 pooled systems from DL21.The top-10 pool includes 3541 relevant documents.There are a total of 2145 pairwise comparisons, of which 418 are significant under the gold qrels with MAP (upper half), and 417 with NDCG (lower half).For each budget, the best values are bolded and the worst ones are underlined.
top- and NTCIR than the adjudication methods but, overall, low budgets and shallow pools do not lead to reliable enough results.When it comes to AA, differently from TREC-8, they struggle to get close to the total number of significantly different pairs on the gold qrels.As in the TREC-8 case, an increase in the number of relevant documents found does not necessarily lead to an increase in the AA counts.
On a positive side, AD is always 0, also for DL 2021.
When it comes to MA, we observe two different patterns.Differently from TREC-8, MA G is always quite high, motivating the general lack of Recall.In addition, MA L does not substantially decrease as the budget increases, explaining the general lack of Precision.1: Distribution of MAP differences between systems in MA for a budget of 100 assessments (6%).The x-axis represents the systems sorted by their position in the official ranking.Each data point holds the distribution of 3 systems.The solid line represents the median of the bin.The shaded area is limited by the first and third quartiles of the distribution, i.e. it represents the inter-quartile range.Finally, the dashed lines are the maximum and the minimum.Breaks in the lines mean that there was not any mixed agreement for those systems.We used the 71 pooled systems of TREC-8.
Publication bias is exceedingly high, especially at low budgets, ranging between 25% and 50%.Overall, these high values shed a negative light on the reliability of the conclusions you would draw when using these methods under shallow pool conditions.

RQ2: How and where the methods fail
We study how and where, in terms of rank positions, the different methods fail in detecting significant differences.
We focus our analysis on the cases of mixed agreements (MA), which have shown to be the main factor for the loss of Precision and Recall.Figure 1 shows the distribution of the score differences in systems pairs which belong to MA with respect to their position in the gold ranking of systems for a budget of 100 assessments (6%).For each MA pair, we compute the difference between the score of the best and the worst system in the pair (under the adjudicated qrels, not the gold ones), recording it with a positive sign for the best system and a negative one for the worst system. 2 Figure 1 tries to convey information about the distribution of such differences as a series of boxplots would do, but in a more compact and reabable way.The x-axis is the position of each system in the ranking of systems under the gold qrels, and we consider bins of three rank positions to make the figure more readable.For example, the first point in the figure represents the distribution of the mentioned differences for the first three systems in the gold ranking of systems.The solid line represents the median of the bin; the shaded area is limited by the first and third quartiles of the distribution, i.e. it represents the inter-quartile range; finally, the dashed lines are the maximum and the minimum.A break in the lines means that no pair of systems in that range of rank positions is a MA.
We can see some clear trends among all the evaluated methods.As a general trend for most adjudication methods, the biggest differences occur between MA systems in the middle of the ranking (we see wider areas in the middle of the ranking), whereas we see more narrow distributions in the top-ranked and lowest-ranked methods.This suggests that the MA, and the consequent loss of Precision, happen in a region of moderate impact, since mid-rank systems may receive less interest in any case.Top- and NTCIR method represent two notable exceptions.Indeed, top- concentrates most of the score differences in the top ranks; therefore, top- is not only the less performing method (see Table 1 and Table 2) but it also fails in the most impactful region of the ranking.This is even worse for NTCIR, where the biggest differences (of 0.2 points), are all clustered in the top positions of the ranking.

RQ3: Evaluation of unseen systems
We investigate the reusability of the judgements produced by a low-cost method, i.e. their ability to fairly evaluate unseen systems.Usually, reusability is evaluated by following a leave-one-group-out approach.This consists in forming pools leaving one participating group each time and using those pools to evaluate the submissions of the group that was left out.We follow a different approach using the non-pooled systems of TREC-8. 3To this aim, we performed the same experiments as in the previous sections, but using the non-pooled systems of TREC-8.In this way, we are evaluating systems that did not participate in the constructions of the pools.As commented in Section 4, this collection has been repeatedly acknowledged in the community as a high-quality one to evaluate unseen systems.Thus, we assume that the TREC-8 gold judgements are reusable and, if a low-cost method provides the same significant differences as them, we conclude that it is reusable as well.
Table 5 reports the Kendall's , Precision and Recall values of every method, for a varying number of assessments per topic, using the non-pooled systems.On a positive side, Table 5 shows similar trends as Table 1, suggesting that there is not a specific bias against non-pooled systems.On a slightly negative side, we observe that performance in Table 5 are generally slightly lower than those in More in detail, TS, MM and Hedge always have the highest correlation scores and while MM achieves always the best Recall, independently from the budget and the metric.This means that if we were to gather the judgements of a new collection, MM would be the best option in terms of reusability of the collected assessments.As before, top-k and NTCIR method lag behind the other methods in all the cases and for every considered measure.This finding suggests that other alternative methods might be a better option to gather assessments when constructing new experimental collections.
Table 6 reports the agreements for the non-pooled systems, similarly to Table 2 for the pooled ones. 4The results follow the same trends as with the pooled systems, further supporting the lack of strong biases against non-pooled systems.These scores confirm that alternative adjudication methods are more effective than top-, which, contrary to what we observed in Table 2, now is clearly the worst method.As before, the more relevant documents found does not necessarily mean the more AA; therefore, not all the relevant documents are equally discriminative also for non-pooled systems.
No method fails in a mixed or active disagreement when evaluating the non-pooled systems.This further supports the fact that most drops in Kendall's  are due to swaps between systems that are not significantly different under the gold qrels.
When it comes to the publication bias, we observe similar trends as in the case of the pooled systems, even with lower values, indicating that published conclusions would not change also in the case of non-pooled systems.
Finally, we can observe similar trends between the results obtained with AP and those obtained with NDCG, supporting the fact that the results presented here are generalizable in terms of the evaluation of unseen systems, and that they are not an artefact of the evaluation metric used.

CONCLUSIONS AND FUTURE WORK
We argued for the need of a more powerful way of evaluating adjudication methods.In particular, while the current approach just focuses on how close two alternative methods rank systems, quantified by Kendall's , we think that we should focus our attention also on how different methods behave with respect to the significantly different pairs of systems detected.Indeed, while the current approach looks for stability in answering the question "is system A better than B?", our proposed method looks for stability in answering the question "is system A significantly better than B?", which is the ultimate questions researchers are interested in to ensure generalizability of results.
To this end, we considered two measures-namely Precision and Recall-which consider significantly different pairs in isolation, as well as measures-the agreement/disagreement counts-which relate them to swaps in the ranking of systems.We also considered the problem of the publication bias, i.e. the chance of publishing results/conclusions that would not hold or be the opposite when using the full pool instead of a reduced one.
To both validate and to showcase our proposed approach, we conducted a thorough experimentation on TREC-8, a collection renown for its high quality deep pool, and TREC Deep Learning 2021, a collection adopting a very shallow pool.In this way, we have shown that our methodology allows us to obtain insights not possible simply using Kendall's .
For example, we found that no active disagreements (AD) and (almost) no mixed disagreements (MD) happen.This means that observed drops in Kendall's  are mostly due to swaps between not significantly different systems.Therefore, those drops concerns not very interesting system pairs, and it might not be worth to strive for (or to judge a method just by) 1.00 Kendall's .
We also found that the number of relevant documents detected by a method does not necessarily increase the number of significantly different pairs detected, suggesting that not all the relevant documents in a pool are equally discriminative.This opens up interesting future investigations on which (relevant) documents would be optimal for a pool while the current focus has been more on determining how many and which topics to sample.
We have shown that drops in Precision and Recall are caused by mixed agreements (MA) which distribute unevenly at different rank positions and, therefore, they have a quite different impact: those happening at mid-to-bottom rank positions are less serious than those happening at the top positions of the ranking.
Finally, we also found that no adjudication methods induces strong biases against non-pooled systems, thus further supporting the use of these methods to construct new test collections for IR evaluation.Previous work evaluated the reusability of bandit-based methods using Kendall's  and other swap-based measures, and concluded that the collections built with them were less reusable than desirable.With the new evaluation approach we have presented in this paper, we shed some more light on this issue and show that, when focusing on significance between systems, bandit-based method are indeed reusable.
Overall, our approach allowed us to show that existing methods for human assessment adjudication in IR evaluation could preserve most of the true statistical differences between the pairwise comparisons of systems.Besides this, as discussed in detail, our approach allowed us to pinpoint which adjudication method works better in specific conditions, why, and how it is different from other methods.This will thus be a helpful tool and guidance for researchers, when they have to decide which method to choose in their settings.

Figure
Figure1: Distribution of MAP differences between systems in MA for a budget of 100 assessments (6%).The x-axis represents the systems sorted by their position in the official ranking.Each data point holds the distribution of 3 systems.The solid line represents the median of the bin.The shaded area is limited by the first and third quartiles of the distribution, i.e. it represents the inter-quartile range.Finally, the dashed lines are the maximum and the minimum.Breaks in the lines mean that there was not any mixed agreement for those systems.We used the 71 pooled systems of TREC-8.

Table 2 :
Relevants, agreements and bias of each adjudication method for a varying number of judgements per topic.Parentheses indicate the size with respect to the full pool.We used the 71 pooled systems of TREC-8.The top-100 full pool includes 4728 relevant documents.There are 2485 pairwise comparisons, of which 966 are significant under the gold qrels with MAP (upper half), and 917 with NDCG (lower half).For each budget and metric, the best values are bolded and the worst ones are underlined.

Table 3 :
Kendall's , Precision and Recall (see Section 3) of each adjudication method for a varying number of judgements per topic.10 and 30 are the budget of judgements per topic.Parentheses indicate the size with respect to the full pool.We used the 66 pooled systems from DL21.For each column, the best values are bolded and the worst ones are underlined.

Table 5 :
Kendall's , Precision and Recall (see Section 3) of each adjudication method for a varying number of judgements per topic.100 and 300 are the budget of judgements per topic.Parentheses indicate the size with respect to the full pool.We used the 58 nonpooled systems from TREC-8.For each budget, best values are bolded and worst ones underlined.

Table 1 ,
especially at the lowest budget, indicating a bit more loss and some more swaps due to not being pooled.

Table 6 :
Relevants, agreements and bias of each adjudication method for a varying number of judgements per topic.Parentheses indicate the size with respect to the full pool.We used the 58 non-pooled systems from TREC-8.The top-100 full pool includes 4728 relevant documents.There are 1653 pairwise comparisons, of which 509 are significant under the gold qrels with MAP (upper half), and 527 with NDCG (lower half).For each budget, the best values are bolded and the worst ones are underlined.