Coherence-based Query Performance Measures for Dense Retrieval

Query Performance Prediction (QPP) estimates the effectiveness of a search engine’s results in response to a query without relevance judgments. Traditionally, post-retrieval predictors have focused upon either the distribution of the retrieval scores, or the coherence of the top-ranked documents using traditional bag-of-words index representations. More recently, BERT-based models using dense embedded document representations have been used to create new predictors, but mostly applied to predict the performance of rankings created by BM25. Instead, we aim to predict the effectiveness of rankings created by single-representation dense retrieval models (ANCE & TCT-ColBERT). Therefore, we propose a number of variants of existing unsupervised coherence-based predictors that employ neural embedding representations. In our experiments on the TREC Deep Learning Track datasets, we demonstrate improved accuracy upon dense retrieval (up to 92% compared to sparse variants for TCT-ColBERT and 188% for ANCE). Going deeper, we select the most representative and best performing predictors to study the importance of differences among predictors and query types on query performance. Using the scaled Absolute Rank Error (sARE) evaluation measure and a particular type of linear mixed model, we find that query types further significantly influence query performance (and are up to 35% responsible for the unstable performance of QPP predictors), and that this sensitivity is unique to dense retrieval models. In particular, we find that in the cases where our predictors perform lower than score-based predictors, this is partially due to the sensitivity of MAP@100 to query types. Our novel analysis provides new insights into dense QPP that can explain potential unstable performance of existing predictors and outlines the unique characteristics of different query types on dense retrieval models.

Retrieval effectiveness in search engines can vary across different queries [22,53].Being able to accurately predict the likely effectiveness of a search engine for a given query may facilitate interventions, such as asking the user to reformulate the query [5,29,40,54].To this end, the task of Query Performance Prediction (QPP) aims to predict the effectiveness of a search result in response to a query without having access to relevance judgments [7].In the last two decades, a number of query performance predictors have been proposed, which can be grouped in two main categories: Pre-retrieval predictors estimate query performance using only linguistic or statistical information contained in the queries or the corpus [24,25,34,46,59].On the other hand, post-retrieval predictors use the relevance scores or contents of the top returned documents, by measuring, for example, the focus of the result list compared to the corpus [11,60], or the distribution of the scores of the top-ranked documents [12,39,43,47,51].Predictors based on NQC [49] (the standard deviation of relevance scores) have been found to be surprisingly accurate.A further group of predictors examine the pairwise similarities among the retrieved documents [1,16].Thus far, these predictors have been applied using traditional bag-of-words representations.While examining the coherence between returned documents is useful, as we show, these representations are not suitable for predicting the query performance of more advanced retrieval methods.
More recently, pre-trained language models (PLMs) have introduced neural network architectures that encode the embeddings of queries and documents [15,27,28,56], and have led to increased retrieval effectiveness.Often, a BERT-based model is trained for use as a reranker of the result retrieved by (e.g.) BM25 [41] -such cross-encoders include BERT_CLS [36] and monoT5 [37].On the other hand, dense retrieval approaches [26,56] are increasingly popular, whereby embedding-based representations of documents are indexed, and those with the similar embeddings to the query are identified through nearest-neighbour search (e.g.ANCE [56], TCT-ColBERT [28]).Compared to reranking setups, dense retrieval is attractive as recall is not limited by the initial BM25 retrieval approach, and improvements in the PLM can improve all aspects of the retrieval effectiveness.Therefore, dense retrieval models inspire us to develop predictors that are effective for predicting their rankings.
In parallel, neural architectures have also been adopted as methods for predicting query difficulty.These post-retrieval methods are supervised, and use refined neural architectures in order to produce a final performance estimate [2,14,23,58].For instance, BERT-QPP [2] fine-tunes a BERT [15] model for QPP by estimating the relevance of the top-ranked document retrieved for each query.However, its performance is lower or outperformed by unsupervised predictors when using advanced retrieval methods and the TREC Deep Learning datasets [18].In our view, the problem lies in the mismatch of representations between predictor and ranking, which is best described in Figure 1.On top, we see the pipeline resulting from a BM25 ranking, and, at the bottom, a ranking from a dense retrieval system [26,56].While BERT-based QPP techniques can be used to predict the effectiveness of BM25 [2,14,23,58], single-representation dense retrieval models already contain representations that can accurately predict their corresponding ranking, thus eliminating the need to apply Step 3 (e.g.BERT-QPP).Instead, to create predictors applicable for dense retrieval, we could use the existing embedded representations (Step 2).Indeed, by considering patterns among the embeddings of the retrieved documents, we can update existing unsupervised predictors from traditional sparse [1,16] to dense representation-based.
At the same time, the selection of evaluation measure can have an impact on the conclusions of QPP experimental results.This observation is more prominent if we consider, for example, that unsupervised QPP predictors such as NQC [47] were primarily optimised for MAP at deeper cutoffs (100 or 1000); on the other hand, more recent supervised predictors were either optimised for RR@10 [2, 23] or used both NDCG@10 and RR@10 [14] providing comparable results between the two measures, but in both cases, results for MAP were missing.As a result, it is impossible to provide insights that are fully generalisable, as missing to report either of them can lead to biased results and incomplete conclusions.We believe that designing experimental studies should be aligned with the idea that the different measures are not interchangeable, and that proposed predictors could be complemented with the case where the predictor fails, together with the explanation of the reasons why this happens.
One explanation for why a QPP predictor fails could be that query performance is further mediated by query categorisation.Few works have examined how QPP varies with query categories [8,19].Indeed, knowing which queries are more difficult to answer may inform us about how to develop more refined predictors.Recently, a query taxonomy was proposed [6], where the identified question categories were placed in a labelled dataset together with a classifier that enables researchers to apply this categorisation to other datasets.We adopt this taxonomy to also quantify the extent to which query categories are responsible for the unstable performance of QPPs across different evaluation measures.In this work, the questions belonging to Debate, Experience, and Reason categories were found difficult to predict compared to others.
In short, our contributions are the following: (i) We propose a number of embedding variants of existing coherence predictors and our own extension pairRatio, an unsupervised predictor which uses pairwise relations of embedding vectors.In this way, we create predictors designed for dense retrieval; (ii) We study existing predictors to two state-of-the-art single-representation dense retrieval models, namely ANCE [56] and TCT-ColBERT [28], as well as BM25 using all three evaluation metrics currently used for QPP, and show that changing the representations increases performance significantly not just for dense but also sparse retrieval; (iii) By also comparing with supervised predictors, we show that applying a BERT-based model for dense QPP is an unnecessary step in the pipeline that decreases QPP performance; (iv) We apply multilevel statistical models [13,21,32,50] in QPP to quantify the relationship between query categorisation and the unstable QPPs.In our analyses, we measure the performance of different QPPs in relation to the total QPP variation that can be attributed to the categorisation or as we term query types.At the same time, we detect a unique sensitivity of dense retrieval methods, which are affected by query type (up to 35% increase in query performance variations due to query categorisation) and exhibit larger differences between predictors, a pattern which is not apparent in sparse retrieval.
In addition, we observe: (a) Our proposed predictors provide the highest correlations for the more precision-oriented NDCG@10 for all retrieval models, while NDCG@10 and MRR@10 provides similar results.(b) Our multilevel perspective proposes a solution to correlation instabilities between measures, by showing how the interplay with query types differently influences each of the measures.In other words, we provide an analytical point that can explain any predictor, and show how our proposed predictors mainly optimise the measure that is less influenced by query variations.We share our methodology and results at: https: //github.com/mariavlachou/Coherence-DR-QPP.The structure of the rest of this paper is as follows: We present related work in Section 2, and present our new extended predictors in Section 3.Then, we follow with traditional correlation analysis of QPP predictors in Sections 4 and 5, continue with an extended linear mixed model analysis to test for query type in Section 6, and conclude with some final remarks in Section 7.
In terms of post-retrieval QPP, earlier post-retrieval predictors examined the focus on the result list induced by language models (probability distributions of all single terms) [11].For example, Clarity [11] measures the divergence between the language model of top-ranked documents from the one of the corpus -the higher the divergence, the better the performance.Utility Estimation Framework (UEF) [48] uses pseudo-effective reference lists induced by term probability-based language models and estimates their relevance using predictors such as NQC (see below in Section 2.1).Both of these rely upon term probabilities, and are, therefore, not feasible for extending our predictions to dense retrieval.Query Feedback (QF) [60] refers to the overlap of the returned documents with those obtained after applying pseudo-relevance feedback -yet, pseudo relevance feedback approaches for dense retrieval are still in their infancy [55,57], so we do not consider QF further.
In the remainder of this section, we discuss the main types of query performance predictors that could be applied to dense retrieval, specifically score-based unsupervised predictors (Section 2.1) and document representation-based predictors (Section 2.2).

Score-based QPP
Score-based predictors encode certain assumptions about how the scores should be distributed for high or low-performing queries.For instance, a simple predictor might be the Maximum Score among the retrieved documents [42] -the higher the maximum score, the more confident the retrieval system is that it has found a document that matches well the query.The most commonly applied score-based predictor is Normalised Query Commitment (NQC) [47], based on the standard deviation of the retrieval scores, which is negatively correlated with the amount of query drift (the non-related information in the result list) [33].Several variations of NQC have been proposed that further enhance its accuracy [12,39], incorporate the scores magnitude [51], or estimate a more robust version of variance with bootstrapping.Indeed, Robust Standard Deviation estimator (RSD) [43] extends NQC results to multiple contexts (each with a bootstrap sample) representing a population of scores [43].Score-based predictors (step 1 in Figure 1) are easily applicable to dense retrieval, since scores are computed by each retrieval method.

Unsupervised Coherence Predictors.
In general, effective unsupervised predictors that consider document representations are preferable, since they require less computation than supervised predictors.One example of an unsupervised predictor that examines the lexical representations of documents is spatial autocorrelation [16], which considers the spatial proximity of lexical document representations, by using their pairwise TF.IDF-based similarities to produce a new set of scores "diffused in space".The final predictor is obtained by correlating the original scores with the diffused scores.Indeed, a low correlation between scores of topically-close documents is assumed to imply a poor retrieval performance.
Another family of recent coherence-based predictors creates a graph of the most similar documents among the top-ranked documents [1], based on their TF-IDF representations.Specifically, metrics such as Weighted Average Neighbour Degree (WAND) and Weighted Density (WD) were found to enhance the performance of score-based predictors after linear interpolation.These predictors (applied in Step 2 in Figure 1, top) were proposed for sparse document representations and have not previously been applied to dense embedded representations.

Supervised & Neural Predictors.
In general, supervised models for QPP can be attractive due to the varying sources of indicators for inferring query performance [42].At the same time, they are computationally complex compared to unsupervised predictors.For example, Neural-QPP [58] is a multi-component supervised predictor as the output of existing unsupervised QPP predictors with weak supervision -we can think of this as a neural supervised aggregation predictor.More recently, BERT-QPP [2] fine-tunes a BERT model for the QPP task by adding cross-encoder or bi-encoder layers that estimate an effectiveness measure (e.g.NDCG) based on the contents of the top returned document in response to the query.While BERT-QPP can also be applied to the dense retrieval rankings, it uses a different model to that used by the dense retrieval approach itself.Out of the two BERT-QPP variants, the bi-encoder version is closer to the intuition of single-representation dense retrieval.Finally, qppBERT-PL [14] adds an LSTM network on top of the BERT representation to model both document contents and the progression of estimated relevance in the ranking.Compared to BERT-QPP, this approach has promise as it considers more information than just the top-ranked document.
To summarise, existing predictors have either focused on sparse document representations or retrieval scores on the unsupervised side, or have introduced neural pre-trained architectures to create more complex supervised predictors.However, no work has addressed unsupervised predictors using dense embedded representations, as are readily available in dense retrieval configuration.Instead, we argue that by using simple predictors that consider document representation resulting from dense models (Step 2 of Figure 1, bottom), we can accurately predict effectiveness without the need for supervised cross-encoder-based methods (Step 3).In the next section, we detail existing predictors that can be applied to dense retrieval.

COHERENCE PREDICTORS FOR DENSE RETRIEVAL
In this section, we first describe some existing sparse coherencebased predictors in Section 3.1, and then show how these can be adapted to be better suited for dense retrieval settings in Section 3.2.

Sparse Coherence-based Methods
3.1.1Spatial Autocorrelation (AC) [16].First, consider  to be a document's TF.IDF vector.Then, the inner product of two documents at ranks  and  is given by (  ,   ).We can obtain a pairwise similarity matrix among  top-ranked documents as follows: where  is the cutoff number of the top-k documents.For brevity of notation, let (   ) = (  ,   ).Projecting (multiplying) each element of the matrix    on the vector of the original retrieved scores,  ( ì ), we can obtain a new vector of diffused scores as: Thereafter, an estimate of the spatial autocorrelation (AC) [16] is obtained by using the Pearson correlation between the two vectors: which quantifies the relation between the initial and diffused scores.Indeed, as mentioned above, a low correlation between the original retrieval scores (i.e. ()) and those weighted by their topical similarity (the diffused scores,  ( d)) was found to imply poor retrieval performance [16].
3.1.2Network Metrics.As mentioned above, the matrix  represents all pairwise similarities between the top-retrieved documents.This matrix is equivalent to a fully connected network, where each node V G corresponds to document 's TF.IDF vector, and each edge E G corresponds to each entry (   ) [1], or more formally In this regard, to avoid all edges being considered equal without attention to the edge weight, the network is further pruned via thresholding [9], where the similarities higher than the mean similarity value are selected as neighbours.
Consequently, we have the following definitions, which correspond to some recently proposed network metrics [1] for QPP: where    is the neighbourhood of document   .Typically, Equation (4) is applied on the pruned graph that only contains edges between the most similar documents, and hence corresponds to the more accurate Weighted AND (WAND) measure [1].
Another way to think about coherence is to count the observed edges or similarities over the set of all possible edges.This results in the Density measure, as follows: In short, a higher neighbourhood degree and a higher density of a graph network indicates a more coherent set of top-retrieved results.The general intuition behind these measures is that the presence of coherence, as reflected by highly similar documents in a top-retrieved set indicates the ability of the retrieval method to distinguish relevant from non-relevant documents, and therefore, return the relevant ones at the top of the list.

Dense Coherence-based Methods
We now derive the embedding representation variants of the above predictors in order to make them suitable for the prediction of neural dense retrievers.We first create the variants for embeddingbased AC and network metrics, and then introduce a new variant that extends AC by considering rank groupings.written (   ), then we can define the pairwise similarities of the top  ranked documents as: We can then apply autocorrelation (denoted as AC above) as per Equations ( 2) & (3).We denote this as AC-embs.

Network-embs.
Similarly, and as we showed that the similarity matrix is equivalent to a fully connected network set of edges, we can apply WAND and WD as per Equations ( 4) & ( 5), denoted as WAND-embs and WD-embs, respectively.

pairRatio.
We now introduce an extension of AC-embs inspired by visually exploring embedding relations.Specifically, in Figure 2, we visualise the pairwise similarity matrix (  ) obtained using TCT-ColBERT [28] embeddings for the top-100 passages for the one high and one low performing query in the TREC Deep Learning Track 2019 queryset.For the best performing query, there is higher pairwise similarity among documents of top ranks (top left corner, indicated by a group of lighter shading), and lower correlation for lower ranks (darker shading).On the other hand, for the worst query, elements of darker shading appear at high ranks, indicating that the top-ranked documents may not be as coherent).In addition, there is less dark shading in low ranks compared to the best query.These observations inspire us to explore the trend of average top vs. bottom rank pairwise similarities of top-ranked embeddings.
Specifically, let    1 .. 2 denote the (diagonal) subset of   between ranks  1 and  2 (the part of the similarity matrix determined by the two rank limits).Then, for a given rank threshold , we can measure the ratio between the mean pairwise similarity above and below rank , i.e.   0.. and    .. as follows: where   denotes the mean of the given matrix,   corresponds to the end of the upper matrix, and   symbolises the start of the lower matrix (we use the two cutoff points as separate hyperparameters).We called this predictor pairRatio.Unlike WAND and WD, we consider the magnitude of this contrast as indicative of query performance.We believe that this relates to the intuition of a more advanced retrieval method itself, and is, therefore, indicative of query performance.
Still, the similarity matrix   can only provide information about the relative similarity of documents.Introducing some information about the document scores would increase performance prediction accuracy, since it relates to the absolute ranking of each document.Let  be an adjusted matrix, where each entry    , for a document pair  and  is multiplied by the final similarity of the query to each of the documents: better encodes similarity of the query among the pairwise document similarities.pairRatio (Equation ( 7)) can then be applied upon , which we denote as adjusted pairRatio, or A-pairRatio.
In short, we are interested in the effectiveness of these predictors based on dense document representations and how they perform in relation to their sparse versions.We test their performance compared to score-based and supervised predictors in Section 5.

EXPERIMENTAL SETUP
Our experiments address the following research questions: RQ1 How do unsupervised coherence-based predictors compare to unsupervised score-based predictors in dense and sparse retrieval?
RQ2 How do unsupervised predictors perform compared to supervised predictors in dense and sparse retrieval?To address these research questions, our setup is as follows: Datasets: We use the MSMARCO passage ranking corpus, and apply the TREC Deep Learning track 2019 and 2020 query sets, containing respectively 43 and 54 queries with relevance judgements.In particular, each query in these querysets contains many judgements obtained by pooling various distinct retrieval systems.
QPP Predictors: As unsupervised score-based predictors, we apply Max score (MAX) [42], and NQC [47].As a representative variant of NQC, we choose RSD.This bootstrap-based predictor is the most recent NQC variant and was shown to outperform other score-based predictors.Specifically, we use the RSD(uni) version, which samples documents uniformly.For each cutoff, we sample from 0.60 to 0.80 of the initial result list size.We use spatial autocorrelation (AC) [16], WAND and WD [1], and the interpolation of WAND and WD with NQC (following the findings of the original paper [1], which suggest that network metrics further increase the performance of NQC).We also report our embedding variants (AC-embs, WAND-embs, WD-embs, PairRatio, A-PairRatio).For each unsupervised predictor, we tune the hyperparameters of each dataset on the other.Specifically, to tune the cutoff value for the top- documents all unsupervised predictors including ours, we use a grid of values [5,10,20,50,100,200,500,1000].For PairRatio and A-PairRatio, we also vary the other upper and lower rank threshold hyperparameters   and   .
For supervised predictors, we report the bi-encoder and crossencoder variants of BERT-QPP [2].To achieve this, we retrained the BERT-QPP cross-encoder and bi-encoder models specifically for each of the dense retrieval models.These supervised predictors exhibit their highest correlations mainly for MRR, which means that they train models that estimate the relevance of the top document of a ranking.In this regard, we check whether an alternative supervised predictor (which we call top-1(monoT5)) that uses only the top-retrieved document to a monoT5 model [37] -i.e.trained for relevance estimation and ranking rather than performance prediction -can perform well in dense retrieval.Note that we use the term QPP Predictors instead of baselines, since our purpose is not to demonstrate the superiority of a single predictor, but how a group of predictors behaves under different contexts and retrieval models.

CORRELATION RESULTS
Tables 1 and 2 show the accuracy of all our examined predictors on the TREC DL 2019 and 2020 query sets, respectively.Within each table: groups of columns denote the various retrieval approaches; the uppermost row reports the mean effectiveness of each ranking approach for each evaluation measure; the next group of rows contains the Kendall's  correlation of the score-based predictors, the next one the unsupervised lexical coherence-based predictors; then we report the results for the embedding-based predictors; and finally for the supervised predictors [2].In all cases, * denotes if the reported  value is a significant correlation at  = 0.05.

RQ1: Score-based vs Coherence-based Predictors
As expected, for BM25, distribution-based score predictors (NQC and RSD(uni) show high accuracy for MAP@100 and NDCG@10, while their accuracy is lower for MRR@10, especially for DL 19.However, unlike older datasets, sparse coherence predictors are very low for TREC DL datasets.As for dense coherence predictors, surprisingly, AC-embs variant is the best performing predictor for AP@100, and for NDCG@10 on 2020.As for our pairRatio variants, they are less effective than other unsupervised predictors, such as NQC and AC-embs (except for MRR@10), as well as supervised predictors on MRR@10.
Next we consider the two dense retrieval settings, i.e.ANCE & TCT-ColBERT.For TCT-ColBERT, we observe that our pairRatio predictors outperform not only supervised predictors, but also NQC (the best performing unsupervised predictor) for NDCG@10 and MRR@10 for both datasets, are only behind RS(uni) for MRR@10 in the DL 2019 dataset, and are competitive for AP@100.Another observation is that A-pairRatio has increased the accuracy compared to pairRatio, particularly for the TCT-ColBERT model, which indicates the need for including document-query relations.In summary, for NDCG@10 and MRR@10, for TREC DL 2020, in all four cases our dense coherence-based predictors (any of them considered) outperform score-based predictors; for TREC DL 2019, in two of the four cases ours are higher, in one case RSD is higher, and in one case they are identical.For ANCE, WAND-embs and WD-embs are better than score-based predictors for NDCG@10 Table 1: Kendall's  correlations of unsupervised and supervised predictors for TREC DL 2019.The highest correlation by an unsupervised predictor in each column is emphasised in bold and (*) indicates significance for single predictors at  = 0.05.
BM25 ANCE TCT MAP@100 NDCG@10 MRR@10 MAP@100 NDCG@10 MRR@10 MAP@100 NDCG@10 MRR@ and MRR@10 for the 2020 datatset, while they are only slightly behind them in the 2019 dataset.Overall, for MAP@100, NQC or RSD (uni) consistently outperform coherence-based predictors, while for NDCG@10 and MRR@10, the picture is more unstable; however, in most cases, coherence-based predictors win for dense retrieval.Further, as might be expected, changing the type of representations from sparse to dense increases the performance of coherence-based predictors across the dense retrieval settings (for ANCE,in 7 out of 9 (QPP, Measure) cases in TREC 2019, and 9 out of 9 for TREC 2020; for TCT-ColBERT, our pairRatio variants are more effective), as the updated representations match those of the retrieval methods.To answer RQ1, for dense retrieval, score-based predictors perform well for MAP@100, while coherence-based predictors show increased accuracy for NDCG@10 and MRR@10.For sparse retrieval, dense coherence predictors are in general better than score-based.

RQ2: Unsupervised vs. Supervised Predictors
Next, we compare the performance of unsupervised with supervised QPP predictors for each retrieval method.For BM25, we are able to reproduce the results of the bi-encoder and cross-encoder variants of BERT-QPP, as reflected by the higher values in MRR and the competitive correlation on the other two metrics.For BM25, we used the authors' checkpoints, while we re-trained the method for ANCE & TCT-ColBERT.However, their values are still lower than NQC, (a simple score-based unsupervised predictor), and RSD(uni) (NDCG@10 on the TREC 2019 queryset), our pairRatio (MRR@10 on the 2019 AC-embs (AP@100 on 2019, AP@100 on 2020, NDCG@10 on 2020), and top-1 monoT5 (MRR@10 on datasets).Most importantly, for the two dense retrieval methods, supervised predictors are not as effective as unsupervised predictors, such as Max and NQC.For TCT-ColBERT, supervised predictors are less effective than our pairRatio variants for NDCG@10 and MRR@10, and NQC and RSD(uni) for all metrics.The strongest observed correlations of BERT-QPP variants in dense retrieval are for AP@100.However, they have a cost to deploy (applying a BERT model on the top-ranked result).We argue that this resource would be better spent to re-rank the top results.In addition, the simpler "supervised" variant, top-1(mono-T5), which uses the monoT5 score of the top-ranked document is a more accurate predictor than BERT-QPP across all retrieval methods, particularly for MRR@10, which is the metric that BERT-QPP is most competitive.This surprising result shows that BERT-QPP is itself just a relevance estimator for the top-ranked document that has been trained to predict MRR@10; using any effective relevance estimator can do as good a job, if not better.To answer RQ2, we find that the existing BERT-QPP supervised predictors are less accurate than unsupervised predictors (existing and ours) for dense retrieval.

MODELING QUERY DIFFERENCES IN QPP
The performance of dense coherence-based predictors is particularly accurate in certain dense retrieval settings (for TCT-ColBERT: pairRatio and A-pairRatio, for ANCE: WAND-embs and WD-embs) and shows superior performance to score-based, especially for NDCG@10.Still, score-based predictors are often better particularly for MAP@100.This difference in QPP correlations among evaluation metrics motivates us to explore whether the relationship between QPPs and retrieval effectiveness is mediated by the type of query (for instance queries of an Experience type have been found difficult to answer [6]).For this purpose, we apply a distribution-based QPP evaluation approach based on the scaled Absolute Rank Error (sARE) [20].Specifically, the sARE value each query is calculated as: where    and    are the ranks assigned to query  by the QPP predictor and the evaluation metric, respectively (one sARE value is obtained per query, instead of a point estimate), and  is the set of queries.This further allows using sARE in statistical models [18,20].Unlike [18,20] who use ANOVA, we use Linear Mixed Effects (LMM) models [13,21,32,50], which also belong to Generalised Linear Models (GLM) [31,35], but split the total explained variance in  into 2 levels.
Specifically, Level 1 specifies the within-query variations (how each query changes or the per query variance over different QPP predictors).Level 2 specifies the between-query differences; it further explains each part of Level 1 by showing, how it changes according to a between-query factor -here we use the type of query or query type by applying the classifier proposed in [6] on the TREC DL query sets 3 .A 2-Level approach is necessary to model the interplay of QPPs with query types; while each query receives a separate sARE value for each QPP predictor, each measurement fluctuates based on the type of query (multiple queries in the same type share a similar query performance), and are, therefore, nested within their group (each query belongs to only one level of query type).Thus, the multilevel approach allows splitting the total variation in sARE into within (due to QPPs -Level 1)-and between-query (due to query types -Level 2) variation.Using separate models for each evaluation measure allows to check which measure is more affected by query types.Next, we describe LMMs in detail.where  00 and  10 are the average true sARE for the reference query type in the initial status and rate of change, respectively.Similarly,  01 and  11 show the effect of the between-query factor on sARE, for the initial status and rate of change.For convenience, we use    in an equivalent compact form (Levels 1 and 2) as: Table 3 shows the interpretation of each of the    parameters.Next, we introduce two reduced models.We start with   that only assumes an average sARE value: Finally, we obtain   as follows: =  00 10 (   )+ 0 + 1 (   )+   (12) In what follows, we use a model selection strategy, as indicated in Table 4, where each row shows the models being compared, the quantity of interest, and its definition.The difference between   and   is the effect of QPP predictor;  −  2   tells us how much of the total variability within queries can be attributed to QPPs.Similarly, when comparing  2 0 and  2 1 of    with the ones of   , these two models differ in the inclusion of the terms  01 () and  11 ().  − 2 0 and − 2  1 tell us how much of the total variability between queries in initial status and rate of change, respectively, are due to query type.Starting from   , we sequentially move to   and    , if needed.At each step, we compare between the model that contains the added factor and the one that does not.The decision is made based on the significance of fixed effects and the model Deviance [32,50], indicating the goodness-of-fit (the lower, the better).The deviance in this case is:  = −2  , where   is the maximised log-likelihood of each model.We implement the proposed LMMs using the lme4 R package [4,52], with Full Maximum Likelihood Estimation.We now address the following research questions: RQ3 Is the accuracy of query performance prediction influenced by query type more for dense retrieval than sparse retrieval?
RQ4 How sensitive are the different evaluation measures to (a) query types and (b) QPPs?

RQ3 -Importance of Query Type
Table 5 provides the resulting LMMs from our model comparison strategy, as outlined in Section 6.1.For the dense retrieval models, Equations with   contain a coefficient that indicates sensitivity to a particular type of query, (the first line of ANCE refers to Not-A-Question queries, and the first two lines in TCT-ColBERT refer to Experience and Reason queries).The corresponding BM25 LMMs do not contain a query type coefficient.
Most importantly, in Table 6, the top half of shows the proportions of gained explained variance for both levels (with ✗ indicating no significant gains), while the bottom half highlights the included effect terms.The first row shows that variations due to QPPs are similar for the three retrieval methods (similar  −  2  values).However, the next two rows have much higher relative gain in explained variance for the two dense models than BM25, especially for  − 2  1 , reaching 35% and 23% for ANCE and TCT-ColBERT, respectively.Indeed, as  − 2  1 includes query type, this means that a noticeable proportion of the variance is attributed to query type.Therefore, for dense retrieval, some query types are more accurately predicted by certain QPPs, and other query types work better for other QPPs.This indicates that QPP performance cannot be judged in isolation from query taxonomies, which in some cases are more influential than the predictor itself.To answer RQ3, the accuracy of query performance is influenced by query type more for dense retrieval than sparse retrieval.In each plot, the  (yaxis) values are plotted as a function of QPP predictor (x-axis), with each query type as a separate plot, and colours indicating different QPP predictors (from left: starting with dense coherencebased predictors, then supervised, and score-based on the right).For   , the trends for two query types, Experience and Reason, behave differently than the rest; these two types show better performance (lower ) for coherence-based than score-based predictors, while the opposite holds for Instruction and Not-A-Question queries.As for Evidence-based and Factoid queries, there is higher variance in  among different queries, but for dense coherencebased predictors, the variance is smaller than score-based predictors, as indicated by the corresponding colours.In general, for   , performance seems to be affected by the different types of queries, which make QPPs more unstable.Indeed, Experience and Reason were originally found to be harder questions for retrieval systems [6].This result reflects the selected model for   , which was   (effect of query type across QPP measurements).

RQ4 -Sensitivity of Evaluation Measures
On the other hand, for    , QPP performance for different query types seems more uniform.The trend still looks different for Experience and Not-A-Question queries compared to the rest, but those represent only a small portion of the total queries.For the remaining types, the structure is similar, with some variations in strength.Importantly, for Evidence-based, Factoid, Instruction, and Reason queries, there is increasing variance across queries for   score-based compared to dense coherence-based predictors.This indicates that our proposed predictors are less sensitive to query type compared to score-based and supervised predictors.Note that while we plot the full model, for    ,   was preferred, i.e., only an effect of QPP predictor.This is complemented by Table 5, where    contain a coefficient for QPPs, but not for query types or their interaction with QPPs.
To summarise, in Section 5.1, we observed that score-based predictors showed improved performance for MAP@100, but our LMM analysis showed that this result is susceptible to influential query types.Instead, our dense coherence-based predictors showed higher correlations mainly for NDCG@10, and with the LMM analysis (lack of query type terms and  −  2 terms at Level 2), we showed that this is more stable across different query types.Therefore, our predictors provide promising evidence for generalisability compared to existing predictors.In other words, while both MAP@100 and NDCG@10 are sensitive to QPPs, NDCG@10 is less sensitive to query type variations than MAP@100, thereby answering RQ4.

CONCLUSIONS
We examined the accuracy of QPP upon two single-representation dense retrieval methods.In particular, we proposed new variants of unsupervised coherence-based predictors and managed to increase their performance for dense retrieval.In this way, we showed that changing the representations from TF.IDF to neural embeddings provided by the dense retrieval models together with some further modifications is enough to generalise performance of unsupervised predictors in relation to supervised ones.Indeed, with increasing effectiveness brought by dense retrieval methods, our proposed predictors becomes more competitive, especially for NDCG@10 and MRR@10.Also, we highlighted that focusing on a single evaluation measure to optimise a proposed predictor may falsely inform future studies, since MAP@100 and NDCG@10 cannot be used interchangeably.At the same time, we demonstrated the interplay between the different QPP predictors, evaluation metrics, and the particular types of queries, showing that query type is an important aspect when studying QPP.Importantly, we showed that while score-based predictors still remain very competitive for MAP@100, our examined statistical models indicate that MAP@100 is highly influenced by the type of query.Instead, using NDCG@10, QPP performance is more stable across queries, and since our proposed predictors show higher performance on this metric, this is a promising result for more generalisable performance in dense retrieval.

Figure 3
Figure3plots the TCT-ColBERT   of  prediction for both   (a) and    (b).In each plot, the  (yaxis) values are plotted as a function of QPP predictor (x-axis), with each query type as a separate plot, and colours indicating different QPP predictors (from left: starting with dense coherencebased predictors, then supervised, and score-based on the right).For   , the trends for two query types, Experience and Reason, behave differently than the rest; these two types show better performance (lower ) for coherence-based than score-based predictors, while the opposite holds for Instruction and Not-A-Question queries.As for Evidence-based and Factoid queries, there is higher variance in  among different queries, but for dense coherencebased predictors, the variance is smaller than score-based predictors, as indicated by the corresponding colours.In general, for   , performance seems to be affected by the different types of queries, which make QPPs more unstable.Indeed, Experience and Reason were originally found to be harder questions for retrieval systems[6].This result reflects the selected model for   , which was   (effect of query type across QPP measurements).On the other hand, for    , QPP performance for different query types seems more uniform.The trend still looks different for Experience and Not-A-Question queries compared to the rest, but those represent only a small portion of the total queries.For the remaining types, the structure is similar, with some variations in strength.Importantly, for Evidence-based, Factoid, Instruction, and Reason queries, there is increasing variance across queries for

Table 2 :
Results on TREC DL 2020.Notation as per Table

Table 3 :
Explanation of terms included in   . ) where    is the sARE of query  at QPP predictor measurement ,  0 is the intercept (initial status) of query 's change trajectory (reference QPP predictor, i.e., the first QPP measurement),  1 is the slope (rate of change) in sARE (per predictor unit), and

Table 6 :
Proportion of explained variance per component and included fixed effects in each LMM for all three retrieval methods.✓ indicates the presence of a fixed effect in LMMs, while ✗ shows the absence of either an important contribution of a factor (top) or a fixed effect (bottom).