Revisiting Bag of Words Document Representations for Efficient Ranking with Transformers

Modern transformer-based information retrieval models achieve state-of-the-art performance across various benchmarks. The self-attention of the transformer models is a powerful mechanism to contextualize terms over the whole input but quickly becomes prohibitively expensive for long input as required in document retrieval. Instead of focusing on the model itself to improve efficiency, this paper explores different bag of words document representations that encode full documents by only a fraction of their characteristic terms, allowing us to control and reduce the input length. We experiment with various models for document retrieval on MS MARCO data, as well as zero-shot document retrieval on Robust04, and show large gains in efficiency while retaining reasonable effectiveness. Inference time efficiency gains are both lowering the time and memory complexity in a controllable way, allowing for further trading off memory footprint and query latency. More generally, this line of research connects traditional IR models with neural “NLP” models and offers novel ways to explore the space between (efficient, but less effective) traditional rankers and (effective, but less efficient) neural rankers elegantly.


INTRODUCTION
Traditional information retrieval approaches operate on a bag of words representing documents by their word distribution.In contrast, modern neural retrieval approaches rely on large pre-trained language models that represent long sequences of text in natural language, resulting in superior performance yet at a dramatic increase in computational cost.The quadratic complexity with the input size of the self-attention makes time and GPU memory two critical factors for applying large transformers on long sequences of textual input.This is particularly problematic in document 114:2 D. Rau et al. ranking where the size of the input typically exceeds the 512 token limit of the most popular transformer-based models such as BERT [12] and its successors.The input limitation of modern transformer-based rankers has led to research being mostly focused on ranking short inputs, with TREC Deep Learning Track's passage ranking task [4,5] gaining popularity.The challenge of moving from passage to document ranking calls for new ways to encode very long documents in an efficient way.There is active research addressing this by reducing the complexity of the self-attention to allow for longer inputs [49,50] also in particular for ranking [20,22].A naive way to avoid long inputs to the transformer model is to truncate long documents to the first n tokens [6,55,59], which comes with the caveat of ignoring the subsequent content entirely and is ineffective for aggressive truncation.
This paper explores a radically different approach by revisiting bag-of-word document representations and focusing on the model's input instead of changing the model.What if we can reduce the document representations to a shorter input length by only including the most important terms?In this case, the main bottleneck, the quadratic complexity over input length, is working in our favor as this will immediately and dramatically reduce the computational cost.The main aim of this paper is to investigate the efficiency and effectiveness of reducing the full narrative text of documents to bag-of-word representations of only the most characteristic terms.Table 1 shows an example document where only the 64 most characteristic terms are visible and all other input tokens are grayed out.This illustrates two important aspects.First, the input is reduced to only a very small fraction of its original input tokens.This suggests even considering extreme reductions of input length that would not be viable for truncation.Second, despite the limited number of selected terms, they effectively encompass content from across the entire document, rather than solely the initial sentences.This suggests that this approach may be particularly intriguing for full document retrieval.
It is of clear scientific interest to revisit bag of words document representations, as they operate in a very different way in a neural ranking setting compared to their traditional use.A traditional ranker would consider only the query terms occurring in the document's content, resulting in a very sparse representation of the document's content.In contrast, a neural ranker would Revisiting Bag of Words Document Representations for Efficient Ranking 114:3 Fig. 1.Schematic of our proposed method of first extracting characteristic terms from the document and then feeding them in decreasing term importance in a bag of words representation into the model.We show the input exemplary for a Cross-Encoder Ranker.
still consider all these selected characteristic terms best representing the entire content of the document.This could render neural models operating on only selected characteristic terms an appealing middle ground, between traditional rankers and neural rankers fully encoding entire documents.
It is of particular interest to revisit bag of words document representations in the context of efficiency.These document representations offer a direct way to control the model's input length by simply considering only the top n most characteristic terms.Rather than only observing the efficiency of a given model post-hoc, this gives us the agency to control the efficiency of the model.Assuming we can reduce the number of input tokens to our model, this will directly impact the GPU memory usage in a predictable way.Moreover, assuming we can lower the memory footprint of our model, this would allow us to run the same task with a larger batch size.Increasing the batch size will directly result in lowering per query inference time by exploiting GPU acceleration.As a result, if successful, it would offer unprecedented control over efficiency.
In the course of the paper we will answer the following research questions: RQ1 Can we represent long documents by bag-of-characteristic-terms representation without a significant loss of retrieval effectiveness?RQ2 What is the effect on efficiency when reducing the number of input tokens drastically?RQ3 What is the efficiency-effectiveness trade-off of characteristic term document representations?RQ4 How effective is ranking with characteristic terms out-of-domain?RQ5 Is ranking with characteristic terms sensitive to position bias?RQ6 Do we observe a similar efficiency-effectiveness trade-off for Dual-Encoders?RQ7 Can we improve performance by fine-tuning on characteristic terms?The rest of this paper is structured as follows.In Section 2, we embed our approach in related literature.Next, in Section 4, we will detail standard ways to create document representations based on salient or characteristic terms.Then, Section 5 will explain the experimental setup.In separate sections, we will explore the viability of ranking on term-based representation for long documents (Section 6), investigate the efficiency of extremely short document representations (Section 7), and investigate the effectiveness-efficiency trade-off (Section 8) and test the robustness out-of-domain (Section 9).We provide further analysis in Section 10, exploring the sensitivity to position bias, the impact for Dual-Encoders, and fine-tuning the model on characteristic terms.We end with conclusions in Section 11.

RELATED WORK
The challenges of efficiently encoding long documents using transformers have attracted considerable interest both in machine learning [e.g., 50], and in information retrieval in particular as part of the TREC Deep Learning Track [4,5].While prior research focused on novel models, this 114:4 D. Rau et al.
paper explores the impact of more efficient document representations by revisiting bag of words representations for neural rankers.
Many others have also made important contributions to increasing the efficiency of neural rankers.Earlier neural IR models [11,15,21,54,58] represent document terms with embeddings such as Word2Vec [35] and employ different pooling methods to aggregate input terms.Contextualized embeddings with transformer-based models have later proven to be very successful for passage ranking [5,37,38].However, it is still an ongoing challenge to apply those models to long inputs [49].Typically, existing passage ranking models are applied to all passages in the document, and scores are aggregated [6,20,28,55].Taking the maximum passage (MaxP) turned out to be one of the most effective strategies [6,55,59].Li et al. [28] find a more optimal solution for aggregating passages, at the cost of adding additional complexity.Hofstätter et al. [18] prune the number of passages to be evaluated, still posing a significant computational overhead compared to our method.
Query term independence [36] allows to pre-compute document representations offline using dual encoders and allows for efficient search via an inverted index [7,25,33,34].The idea of extracting elite terms from documents to reduce the number of computations has previously been explored [7,14,25,29,33,60].Our work differs fundamentally from those approaches, as they first contextualize document terms via transformer modules and afterward score on a subset of tokens, still requiring the entire document to be passed through the model.In contrast, we feed only a fraction of the document to the model, saving GPU memory and inference time.Another line of research to mitigate the input limitation of transformers is to employ self-attention to local windows [8,46], or finding an approximation to self-attention matrix with lower rank [27,44,48,51], or hybrid approaches [2,53].Our approaches opted, instead to not change the model but to look critically at the model's input.
This paper revisits earlier IR research on document representations.Classic term weighting approaches such as TF-IDF [42,43,45] are a standard way of term selection in many IR and Natural Language Processing (NLP) applications relying on vector-space models.In the language modeling framework, very specific "parsimonious" language models have been proposed [16].Essentially, these models remove terms from the document language model that are explained already by the collection language model, only retaining the terms characteristic for the specific document.Hiemstra et al. [16] did obtain very similar retrieval performance while restricting the language models to only a very small subset of characteristic terms.These models were inspired by the classic bell-shaped curve of Luhn [31] and were further extended into significant word language models removing both general and specific terms [9].These significant word language models have been applied to relevance feedback [9] and hierarchical text classification [10].The parsimonious language models are also more interpretable in human inspection [24], signaling that these models do retain all the characteristic features of a given document.In a way, our experiments will test whether this superiority in human inspection also translates to good ranking performance when encoded by a large neural model.

TRANSFORMERS AND NATURAL LANGUAGE SEQUENCES
Transformers are typically applied to text in natural sequence order.However, in this work, we propose a radically different approach which is disrupting the natural order of words and reducing to bag of words representation of a document, inspired by traditional retrievers.Being equipped with the self-attention mechanism transformers can attend to different parts of the input and contextualize words with each other.One might argue that the natural sentence structure, given by the order of the terms is necessary to build meaningful contextual representations based on salient n-grams.We challenge this assumption about transformer models to understand what contributes to the success of the transformer models.

114:5
Earlier work applying transformers to ranking found that the position information of a finetuned ranker is less important compared to the pre-trained transformer [41].They also find that removing position information surprisingly results in a much smaller performance drop than expected.By fine-tuning without position information the performance gap is almost closed and the difference is minimal leading to the conclusion that word order does not play a critical role in the success of the transformer-based ranker.In conclusion, this line of research suggests that transformer models can perform well on disrupted sequences.This holds particularly for the ranking task which is highly based on finding similarities between tokens rather than extracting detailed knowledge to answer a question precisely where the immediate context based on the natural word order is presumably of greater importance.
Besides the ability to model sequences, we believe the transformers have fundamental properties that make it a suitable model architecture to operate on bag of words representations.A major strength of the self-attention mechanism is that it allows to contextualize terms with each over the entire input sequence.Independent of the sequence order transformers can enrich the representations of related words by matching key-query pairs in the self-attention.Again, this is specifically important for the ranking task which is heavily based on (soft-)similarity matching between the input tokens.Instead of latching onto particular n-grams, the model could instead focus primarily on term co-occurrences in the input to operate on term distributions.Other scholars [40] support this hypothesis by attributing the success of the pre-training of transformer models mainly to distributional information regardless of the input order.
Inspired by those observations we ask the natural question, freed from document order, are all terms in the input equally important?If not, can we remove the unimportant terms?It is at least of scientific interest to investigate these questions, as they help elucidate and quantify the natural language factors responsible for the effectiveness of neural rankers.But this is not only of theoretical interest, as scaling transformers to long input like full documents remains an open problem, and there are important efficiency benefits in case we could encode full documents in a fraction of their original length.Hence, in this paper, we investigate those questions under efficiency aspects and provide empirical insights into the applicability of selective bag of words representations to transformer-based rankers.

DOCUMENT REPRESENTATIONS BASED ON CHARACTERISTIC TERMS
In this section, we will detail ways to create document representations based on salient or characteristic terms.
The main bottleneck for feeding long documents into transformer models is the limited input length due to the computational complexity of self-attention.The naive solution of fitting long documents into the model by truncating the excess tokens limits the model in its context dramatically and does not generalize to all settings.It may work well for news-or Wikipedia articles, which contain a short summary at the beginning.However, if a broader context is necessary to estimate whether a document is relevant, e.g., for legal documents, more advanced solutions are required.The solution that we want to explore instead of finding faster approximations for self-attention is to leverage already existing ranking models.
To reduce documents and therewith the number of input tokens we can make use of a particular property of natural language: Due to the Zipfian nature of natural language, few terms occur exponentially more often than most of the other terms, thus some are more discriminative than others.We propose to reduce full documents to meaningful terms only and to discard all noninformative terms in favor of a shorter input to the model.We call those Characteristic Terms (CT).More formally given given document D in collection C we define a function f that assigns a term characteristic score s i to each term t ∈ D = {t 1 , . . ., t n }.
To obtain the input to our proposed method CT we transform all documents using f and then sort tokens by declining term characteristics so that the most characteristic terms are at the beginning.We then truncate the documents to < 512 characteristic tokens.An example of the proposed pipeline for a Cross-Encoder ranker is illustrated in Figure 1.
Information retrieval has a long history of estimating the importance of particular terms to the overall relevance of the document with respect to a query.We are going to explore two different functions f to estimate the term importance: the simple yet powerful TF-IDF (Section 4.1) as well as a more advanced method based on language models called PLM (Section 4.2).

TF-IDF
The most prominent proxy for term importance is "Term Frequency -Inverse Document Frequency" (TF-IDF) [23].TF-IDF of term t in a document d can be calculated as follows: with t f (t, d) being the term frequency of term t in document d and id f (t) being defined as where N is the total number of documents in the collection and d f (t) is the document frequency of term t.
We use TF-IDF as a proxy for term selection.A high tf-idf indicates a characteristic term and a low tf-idf is a non-informative term that does not contribute much to the relevance of a document.

Parsimonious Language Model
Parsimonious Language Models (PLM) introduced by Hiemstra et al. [16] are an elegant way to estimate the importance of terms in a document with respect to a background collection using language models.By giving more probability mass to terms that distinguish the document from the collection model the probability of terms with the same occurrence in the document as in the collection is set to zero.Therefore, PLM not only estimates term importance but also reduces the document to the most discriminative terms.
More formally, information retrieval using language models defines a document model for each document.The language model defines probability P(q i , . . ., q n |D) for the sequence of n query terms q i , . . ., t n .Usually, a mixture of the document model P(q i |D) and a general document model P(q i |C) is used, weighted by a parameter λ which can be estimated through an Expectation-Maximization (EM) algorithm: With typically the collection language model P (q i |C) and document language model P (q i |D) by the following: c f (q i ) being the collection frequency of the term, and t f (q i , d) the term frequency of the term in the document D. Hiemstra et al. [16] reformulate the document language model smoothed by the collection model to shift probability mass in P(q|D) away from terms that are better explained by the general collection model.They estimate P(q|D) unsupervised using the following two EM steps until convergence: M-step: P(t |D) = e q q e q , i.e., normalize the model ( 8) Similarly, we use PLM as a proxy to determine how characteristic a term is for a given document.A high score indicates an informative term, while low or zero scores indicate a term is not important for the given document.
This section discussed two ways to determine salient or characteristic terms per document: TF-IDF and PLM.Table 2 shows an example of the dramatic reduction of a full document to 32 input tokens.

EXPERIMENTAL SETUP
In this section, we will explain the general experimental setup for our experiments.We perform our experiments using PyTorch [39].Our models are based on the Hugging Face library [52].

Ranking Models
For our experiments we will consider the two main classes of models that emerged from the transformer-based rankers Cross-Encoders and Dual-Encoders.

Cross-Encoder Ranker.
For most of the experiments we utilize a Cross-Encoder introduced by [37] encoding both query and document at the same time.The input to the Cross-Encoder is the following: where q i ∈ Q represents query tokens and d i ∈ D tokens represent the document.The tokens representing the document typically fragments of the documents in natural language, as in our experiments a bag of words representation of the most characteristic terms.For ranking, the activations of the CLS token of the last layer are fed to a binary classifier layer to classify a passage as relevant or non-relevant; the relevance probability is then used as a relevance score s i to re-rank the passages.
On one hand, the Cross-Encoder holds the potential for more nuanced matching by contextualizing query and document terms with each other.On the other hand, this means document representations can not be pre-computed and all computation has to be lifted during inference time.This makes Cross-Encoders computationally expensive calling for ways to increase efficiency.

Dual-Encoder
Ranker.In Section 10.2 we additionally utilize the Dual-Encoder to rank on our bag of words representation.Dual-Encoders encode queries and documents independently from each other.Accordingly, terms representing the document d i ∈ D and the query q i ∈ Q are fed separately into the model, by constructing the input as follows:  The characteristic terms that make up the input are colorized in the full-text.Document taken from MS MARCO V1 id: D3059198.Query from NIST 2020 test id: 1119543.
Resulting in hidden representations h Q ∈ R 1×768 for the query and h D ∈ R 1×768 for the document: We use the dot-product to measure the similarity between the document and the query: Document representations for the Dual-Encoder can be pre-computed leaving the major computational weight offline.During inference, only the query has to be encoded.The matching between the query and document reverts to a simple dot-product between the hidden representations of the CLS token at the final layer of the query and document.The quadratic complexity of the selfattention mechanism with respect to the input length limits both the Cross-and Dual-Encoder to an input length of typically 512 sub-word tokens for the BERT model.

Baselines
We compare our method of ranking on bag-of-characteristic-word representations with BM25 and state-of-the-art efficient document Cross-Encoders which we will detail in the following.

BM25.
We compare our method against the traditional retrieval method BM25.Similar to our proposed method it operates on a bag of words representation, however basing the ranking only on lexical matches.We run the Python implementation Anserini [56] of BM25 with default parameters.We do not apply stemming in order to have comparable inputs to the neural models.

BERT First Passage (FirstP).
BERT FirstP serves as a passage-based document model baseline.Truncating documents to the maximum input length is a common practice to fit documents into a BERT passage ranker [6,55,59].While basing the prediction on only the first passage of the document this is a pragmatic solution, however, as demonstrated later it is not a generally valid approach to capture the content of an entire document.

BERT Random Passage (RandP).
As a more realistic passage-based document ranking baseline that does not make any particular assumption about the position of the passage in the document, we introduce a baseline that picks a random passage within the document.This average represents a general truncation case better than only FirstP, as it can not exploit artifacts such as documents starting with a descriptive summary of the entire document (title and lead paragraph in Wikipedia).Further, limiting RandP and CT to the same number of tokens allows for assessing the quality of encoding an entire document by means of characteristic terms in direct comparison and serves as our main baseline with limited input.We average scores over five runs by selecting different random passages, to obtain meaningful results that are not biased by a particular (un)lucky choice of a document fragment.

BERT Maximum Passage (MaxP).
As a document-level baseline, we use the document ranking approach best scoring passage (MaxP) which leverages the score of the best-scoring passage as first studied by Dai and Callan [6].[6,55,59].For MaxP we apply a sliding window with a length of 512 and a stride of 256.We also tried a smaller window of 150 with a stride of 75 which performed worse.When referring to maximum input lengths for the CE model we refer to the entire input which spans over query and document.

Longformer(-QA).
Longformer [2] is a long sequence transformer model optimized to lower the quadratic complexity of self-attention.The model is a variant of the Sparse Transformer [3] which does not attend from each token to all other tokens but reduces complexity by attending to only a subsection through strided patterns.Additionally, they adopt global memory tokens that allow the model access to all input sequences for classification.Longformer-QA [2] is additionally fine-tuned on QA, leading to potential benefits for the ranking task considered in this paper.
114:10 D. Rau et al. [57] is another variant of a long sequence transformer model.It reduces the quadratic complexity of the self-attention to linear complexity by employing global attention tokens, and random attention patterns combined with fixed patterns (local sliding window).

IDCM.
The key idea behind IDCM [18] is to employ an efficient ranking model to select a handful of hopeful candidate passages.Then a more complex ranker is used to score the remaining candidate passages.This can be seen as an approximation to the MaxP, reducing the complexity of the model by scoring only a small subset of all passages with a more complex model.Note that while for MaxP, FirstP, and CT we employ a BERT model with 12 layers IDCM utilizes a distilled BERT model with only 6 layers, making it an even harder baseline to beat in terms of efficiency.
We refrain from comparing to even more complex models such as PARADE [28] and QDS-Transformer [22], as we focus on the efficiency aspects of document models in this paper.

Checkpoints & Training
All BERT-based passage document models used in this paper (BERT MaxP, FirstP, RandP, and CT) utilize the checkpoint cross-encoder/ms-marco-MiniLM-L-12-v2, which is available on Hugging Face.The maximum number of input tokens including the query for BERT FirstP, RandP, and CT is set to 512 tokens.
Checkpoints of the document rankers Big Bird, Longformer, and Longformer-QA are not publicly available.Training long input models is time and resource-consuming due to their large computational footprint and challenging, especially in a more complex setup such as distillation in a teacher and student setup that our passage-based document model was trained with.It remains unclear which training strategy is optimal for document ranking.We therefore revert to a simple training setup.We fine-tuned them using the Adam optimizer [26], batch size 64, warm-up steps 1000, learning rate 3e-6, epoch size 1000, and evaluate the best model within 15 epochs.To generate document training data we use the script "msmarco-doctriples.py" together with the official document training queries and relevance judgments provided by the TREC organizers. 1 We modify the script to generate three training triples per relevant query document pairs (instead of one) randomly selecting documents from the provided top 100 ranking yielding roughly 1m training triples.
For the following experiments, we use the model zero-shot with regard to input manipulations and do not fine-tune the model further, except as indicated differently for Section 9.

MS MARCO.
We base our in-domain evaluation on the TREC 2020 Deep Learning Track's document retrieval task on the MS MARCO dataset [1].The documents are high-quality extractions of real web pages.The documents are on average 3,266 sub-word tokens.

Robust 04. TREC 2004 Robust Track is a collection consisting of news articles. Robust04
which uses the collections of TREC-6,7,8 and reuses (partly) the same topics that scored low in the earlier TREC-6,7,8 years.Therefore, this test collection is particularly build to pick the hardest topics from earlier TREC years.The average length of the documents in this collection is around 549 sub-word tokens.BM25 with no stemming serves as a first-stage ranker to retrieve the top-100 ranks for each topic.
Collection statistics of both collections can be found in Table 3. 114:11

Evaluation
For a fair comparison, we re-rank the same top 100 documents retrieved by BM25 with all models.
In-domain evaluation of MS MARCO is done using the NIST 2020 judgments.For out-of-domain evaluation on Robust04, we use the title queries 301-450 and 601-700.We evaluate using metrics NDCG@10, MAP, and Recall@30.We chose Recall@30 as we are only re-ranking 100 documents, and this score gives an indication of the impact on recall (in particular for Robust04).

EFFECTIVENESS OF DOCUMENT REDUCTION
In this section, we want to explore the impact on effectiveness when reducing documents to only characteristic terms.To this end, we study our first research question:

RQ1 Can we represent long documents by bag-of-characteristic-terms representation without a significant loss of retrieval effectiveness?
We are seeking a reduction of the documents to the maximum input token length of 512 or less to be able to leverage the passage ranker to retrieve documents.We achieve this by revisiting bag of words document representations in which we aim to select only the most meaningful terms and disregard all others.Reducing the input this way offers a straightforward way to carry out single-pass document retrieval already making an important step towards increasing efficiency.A performance decrease with reducing the number of document tokens feels intuitively unavoidable as the model receives less content from the document.

Experiment Design
We select characteristic terms using two different methods following Section 4: TF-IDF and Parsimonious Language Models (PLM).For PLM we use the publicly available PLM implementation wayward2 and found the default parameters with λ = 0.1 to work well, but to not claim those parameters to be optimal.For both experiments we generate a maximum of 512 characteristic words per document.We control the number of characteristic terms per document by feeding the top-n characteristic terms into the model with n = {512, 256, 128, 64, 32}.Note that term selection for both TF-IDF and PLM is query independent and can be computed offline and comprises highly parallelizable computations.In this section, we only evaluate the models on the MS MARCO collection as we investigate the in-domain Performance of our method.

Results and Discussion
The results of feeding different numbers of characteristic terms estimated using TF-IDF and PLM to the BERT Model can be found in Figure 2.
First, in line with our expectation when feeding small, random passage fragments (RandP) to the model performance deteriorates strongly.Generally, the performance of RandP is decreasing in moderate steps for reducing the documents to 512 and 256 tokens respectively.For shorter inputs,   TF-IDF 1 ,.-amphetamines âĂŹ blackbrush guarana ephedra amphetamine cathinone natural: " " 2 ., -carnitine " cla & ?l acetyl fat ; supplement synthesize amino lean supplementation acid 3 , .-amino ?arginine acids proteinogen ornithine carnitine methionine continue taurine acid 4 ', arginine & ornithine amino citrulline acids of the ammonia blood lysine hormone milligrams Parsimonious Language Model 1 amphetamines natural effects ephedra guarana amphetamine blackbrush processes produce as cathinone plant works marketed 2 carnitine l cla fat acetyl supplement acid & body lean amino essential meat while synthesize supplementation dairy mass 3 amino continue acids arginine acid l ornithine methionine carnitine important proteinogen glutamine 4 arginine l ornithine & amino acids blood growth benefits citrulline ammonia cycle hormone according pressure body lysine the performance deteriorates to very low values.The decline clearly shows a convex curve, hinting at a favorable trade-off between potential efficiency gains and effectiveness losses.However, the performance drop for extremely short truncated document representations will be too large for most application scenarios.Second, reducing documents to characteristic terms is effective if terms are chosen in the right way.Estimating characteristic terms using PLM yields higher performance than using TF-IDF.Examples of reducing documents to only 32 most salient tokens using the two methods can be found in Table 4. PLM seems to yield much cleaner term representations than TF-IDF containing fewer non-informative symbols.This gives a possible explanation for why PLM performs consistently better than TF-IDF.
Selecting terms using characteristic terms using TF-IDF and PLM both outperform our baseline of picking random passages (RandP) on all measures across different input lengths.Both methods receive the same number of document tokens, however, reducing the document to the most essential terms proves to be a valid strategy for ranking with a transformer-based model even without specifically adapting the model to our bag of words representations.It is notoriously hard to understand why large pre-trained transformers result in effective rankers, and this observation also suggests one of the main aspects captured in these models.For CT, in contrast to RandP, we observe a remarkably stable performance across the different numbers of input tokens, even when reducing documents of several thousand sub-word tokens to only 32 tokens.For 64 and 32 input tokens we observe a small early precision decrease in performance compared to CT 128.However, CT with 32 document tokens still outperforms CT 512.In this light, it is important to note that reducing the input to characteristic terms destroys the natural sequence order that the model is known the be highly sensitive to [32,41].It is all the more surprising that the Cross-Encoder Ranker obtains good performance on characteristic term bag of words representations where uninformative words are removed.We further would like to point out that feeding only characteristic terms leads to the loss of term frequency, which is known to be a strong indicator of relevance.
Losing the term frequency in combination with receiving too many tokens might weaken the discriminative power of the characteristic terms leading to a slight performance decrease for 512 in comparison with 256, 128, and 64 terms.This observed reduction for extremely short document representations is entirely expected, given that we reduce very long documents to only a handful of remaining tokens.We provided an example of an original document reduced to 32 tokens in Table 2 before, to give an impression of the level of reduction we are applying for CT.The Cross-Encoder Ranker seems to be able to achieve good performance nonetheless and simply duplicating the characteristic terms only resulted in a performance decrease.Next, we want to put the results of our method in context with other retrieval models.We refer to Table 5 for the results.In this table, we compare our method CT to the naive baselines FirstP and RandP using a maximum input of 64 characteristic terms as we found it to maintain high effectiveness while reducing the input to a minimum number of tokens (see Figure 2).We find CT to significantly outperform BM25 and RandP and to perform comparably (slightly lower NDCG@10, but higher MAP) to the very costly neural document models Big Bird and Longformer(-QA).MaxP and IDCM perform significantly better than CT, but with the caveat of considering a 114:14 D. Rau et al. larger set of document tokens.Baseline BERT FirstP performs well on MS MARCO but is less effective on Robust04, as we will show later in Section 10.1, which can be attributed to the strong position bias in MS MARCO.
Regarding RQ1, we conclude that reducing documents to bag-of-word representations with characteristic terms can yield reasonable performance even with only a fraction of selected terms compared to the full document.Our proposed method achieves document retrieval in a single forward pass outperforming the truncated passage-based run.As we are using a ranker trained only on natural language text, this is a surprising finding that offers novel ways to build neural rankers.

EFFICIENCY OF EXTREME DOCUMENT REDUCTION
In this section, we study our second research question:

RQ2 What is the effect on efficiency when reducing the number of input tokens drastically?
After demonstrating that extreme document reduction to a bag of characteristic terms can be a viable strategy, we aim to quantify the theoretical speedup in practice by reducing the number of input tokens.Assuming we could reduce documents below the 512-token limit to a minimal number of representative tokens how much more efficient would the model be?Given the computational complexity of the self-attention that scales quadratically with the number of input tokens, utilizing only a fraction of the input will translate into faster query latency time as well as lower GPU memory usage.The efficiency therewith solely depends on the number of tokens that are fed into the model, which be fixed at any length and therefore hold in general, regardless of specific collections.How does this work out in practice?

Experiment Design
To quantify the efficiency we measure query latency that describes the time that is needed for the contextualization of a single query with the top-n documents.Note that characteristic selecting terms of the document do not impact inference time, as the documents can be pre-processed offline and are query-independent.In addition, we measure the maximum amount of GPU memory used.Analyzing the peak GPU memory is particularly interesting as the number of queries-passage pairs that can be contextualized in one batch is limited by the amount of available GPU memory.Consequently, a lower GPU memory footprint allows for a higher throughput per forward pass.
Dependent on the application scenario this can be leveraged in different ways without requiring additional GPU Hardware.First, the number of documents to be re-ranked per query request can be increased without increasing run-time.This is particularly interesting when a classic first-stage ranker acts as a gatekeeper and having more documents in the re-ranking stage can lead to further improvement [19,20].Second, a multitude of query-document pairs can be contextualized in one batch at the same time.For example, in an operational setting, the deployed model can serve more search requests at the same time.Third, a multitude of documents can be encoded within one batch which can lead to big speed-ups with GPU acceleration, i.e., for indexing a collection using a dual-encoder.We will focus on demonstrating the two latter points in this paper.
To be precise, for estimating the query latency we measure the contextualization of one querypassage pair based on a batch and multiply it by the number of documents to be re-ranked.We carry out all our efficiency experiments on a single NVIDIA V100 with 16GB memory with the maximum batch size in PyTorch inference mode.We determine the maximum batch size by increasing the batch size until we run out of GPU memory.We discard the first warm-up batch from the measurement.In general, we measure the bare forward pass and do not include pre-processing, or disk-access times to be more comparable to other experimental setups.

Results
We will now report query latency and max.GPU memory for a decreasing number of input tokens for MS MARCO.First, in Table 6 we compare the efficiency using a fixed batch size of 32.This is the maximum batch size that fits on the GPU with 512 input tokens for all models.GPU memory decreases exponentially with input length.By reducing the input length from 512 to 32 tokens the GPU memory decreases by 86% (from 1.11 GB to 0.15 GB).Query latency decreases with shorter inputs with up to 8.4% with 32 input tokens compared to the full 512 input tokens.
Second, in Table 7, we demonstrate how the reduced GPU memory footprint of our method can be leveraged to increase the batch size and thus achieve faster inference / higher throughput.The table shows how the batch size can be dramatically increased for shorter inputs.Given the same GPU memory constraint, we can increase the batch size from 256 documents encoding 512 tokens to no less than 10,752 documents encoding 32 tokens.Considering the Query Latency in Table 7 we can observe a quadratic decrease with decreasing input length.By reducing the input length from 512 to 256 tokens the query latency decreases by a factor of 4 (from 4.60 ms to 1.02 ms).When reducing the input to 32 tokens, we achieve a speed-up of 35 times compared to the full input of 512 tokens.
Regarding RQ2, we conclude that dramatic efficiency gains for GPU memory and inference time can be observed when reducing the number of input tokens below 512 input tokens.Revisiting bag of words document representations in the context of neural rankers allows us to explore the entire space between i) more effective, but also far less efficient models, and ii) far more efficient, but possibly less effective, models.In practical applications, dramatic gains in efficiency may still be a favorable cost-benefit trade-off in case this comes with a reduction of their effectiveness.
More generally, we report the efficiency results separately, as they do not depend on the specific corpus or test collection.That is, the observed efficiency gains generalize and hold in general for the specified number of input tokens regardless of the collection.Moreover, the observed trends apply also to other classes of transformer-based rankers, making the reduction of input length a generally applicable strategy.For any concrete application, we will be interested in the trade-off between efficiency and effectiveness for the specific case at hand.

EFFICIENCY-EFFECTIVENESS TRADE-OFF OF EXTREME DOCUMENT REDUCTION In this section, we study our third research question:
RQ3 What is the efficiency-effectiveness trade-off of characteristic term document representations?After having studied the effectiveness and efficiency gains in isolation for short inputs, we want to understand how they relate to each other and compare the efficiency-effectiveness trade-off to other document ranking models.Rather than focus only on the most effective conditions, we like to understand the underlying trade-offs, as in any real-world application scenario, we have to trade off efficiency (and cost) and effectiveness.We use the same experimental setup as detailed in the previous two sections.

Results
We compare the effectiveness-efficiency trade-off of our proposed method of reducing documents to characteristic terms (CT) in Figure 3.For the sake of comprehensibility, we exclude BERT FirstP and BERT RandP as they exhibit the same efficiency as CT but at much lower effectiveness, as shown in Section 6. Figure 3(a) shows the effectiveness in NDCG@10 against maximum GPU memory for different numbers of input tokens for batch size 32. Figure 3(b) depicts the effectiveness in NDCG@10 against the query latency with the maximum batch size that can be fit onto the GPU for the respective input length.Generally, we observe the passage-based document models achieve higher performance while being orders of magnitude more efficient than the costly document models Longformer(-QA), and Big Bird.While the IDCM model reduces computation by scoring only three passages with the costly transformer-based model (query latency 89 ms for the core model components), additional processing overhead required to score and select passages using a cheaper model still leads to a total query latency of 307 ms.Our method CT achieves very strong efficiency gains (as laid out in the previous Section 7) at very favorable effectiveness levels for our proposed method CT.
Being able to provide our model with different numbers of characteristic terms allows us to make deliberate choices with respect to efficiency.We can choose to trade-off efficiency again with effectiveness. Figure 3(a) and 3(b) show that while reducing the input to 32 tokens is most  efficient with respect to GPU Memory and Query Latency, this setting might be too extreme as performance is compromised.Choosing 512 characteristic terms might be at the other end of the extreme, leading to low performance (as explained in Section 6) and efficiency.This leaves input sizes 64, 128, and 256 as viable options to be used in practice depending on the application and requirements.
A summarizing table detailing the computational costs and performance can be found in Table 8.If GPU memory is of concern for CT even a very small number of available GPU memory can achieve impressive performance (see Table 8 CT-Memory).If query latency is of concern and a larger amount of GPU memory is available query latency can be reduced drastically to around 1 ms (see Table 8 CT-Latency).In Table 8 we report the efficiency-effectiveness for the most effective CT model to provide one concrete example: CT with 256 tokens achieves around 90% of the performance in terms of NDCG@10 compared to the most efficient baseline BERT MaxP, while reducing either GPU memory 62% (CT-M) or query latency by around 99.9% (CT-L) leveraging the maximum batch size that fits onto the GPU and therewith increasing throughput.
A further beneficial effect of encoding the documents in characteristic terms is that the collection size of MS MARCO is reduced on the hard disk.(see Table 9).For instance, by committing to 64 characteristic terms storage is reduced 93% from 22 GB to 1.68 GB.
Regarding RQ3, we conclude that using characteristic terms to reduce documents gives a handle to trade-off efficiency in terms of GPU memory and inference time to effectiveness.While being  able to increase efficiency by orders of magnitudes, we demonstrated that we still can retain high levels of effectiveness.

OUT-OF-DOMAIN EFFECTIVENESS FOR CT
The efficiency results of reducing the document to a specific number of tokens are general and transfer to any collection; however, the effectiveness results are not and depend on the collection.In this section, we seek to understand whether our effectiveness results generalize to another collection.We answer this by asking the following research question:

RQ4 How effective is ranking with characteristic terms out-of-domain?
While current state-of-the-art ranking methods consider documents in natural language, CT takes a fundamentally different approach to representing documents as bag of words representations.Does this allow the model to generalize better to other collections?

Experiment Design
We probe the effectiveness of our proposed method CT out-of-domain on the news collection Robust04.The model has therefore never seen any of the documents and is applied zero-shot.Similarly to the previous experiments, we first run PLM to estimate the importance of each term in the document and then feed the most characteristic n terms to the Cross-Encoder BERT ranker with n = {512, 256, 128, 64, 32}.All other neural models receive the documents in full narrative natural language.

Results
The out-of-domain Robust04 ranking results for BERT CT and Baseline RandP for different numbers of characteristic terms can be found in Figure 4. We refer to Table 10 for detailed numbers.Baseline BERT RandP performs similarly to MS MARCO, losing substantial ranking quality with smaller fragments of the documents as input.CT again maintains strong performance even for an extremely small set of characteristic terms.We find CT to outperform all other document ranking methods except MaxP significantly for NDCG@10 being by a multitude more efficient than other methods.The low results of BERT FirstP highlight clearly that truncating documents to their first passage is not a generally valid solution to approximate document ranking and only works on collections that contain most of the relevant information at the beginning.We refer to Figure 6 for a direct comparison between BERT CT, RandP, and FirstP.Although the strategy of BERT MaxP exhaustively scoring all overlapping passages without making any assumptions about the document seems to be much more robust to shifts in the collection, all other neural document models seem to struggle, performing below BM25 for both metrics.IDCM performed best within the training collection; however, given the low zero-shot performance, it seems to have latched on to artifacts (such as position bias in MS MARCO [17]) that do not allow for generalization to other collections.We suspect similar effects apply to Big Bird and Longformer(-QA).Again, efficiency results of CT are independent of the collection and generally hold, thus also on Robust04.
Regarding RQ4, we conclude CT to be robust to collection shifts exhibiting strong zero-shot performance compared to other document ranking methods.This can be explained by the bag of words representation abstracting away from linguistic aspects of the documents such as particular writing styles, text complexity, or collection artifacts.

ANALYSIS
In this section, we provide additional analysis that allows us to understand ranking on bag-ofcharacteristic-word representations better.We first analyse the position bias effect of the used collections by comparing first paragraph and random paragraph truncation.Second, we explore whether our effectiveness results translate to Dual-Encoders in a similar way.Third, we examine whether fine-tuning our bag of words representations can allow the model to make use of the particular input more optimally.

Comparison of RandP, FirstP, and CT
Previous work by Hofstätter et al. [17] has found MS MARCO to exhibit a strong position bias towards containing relevant information at the beginning of the documents.Such position bias may not only lead to biased, degenerate models that learn to excessively focus on the beginning of the documents but also allows building simple methods that exploit such bias.For instance, it   allows reducing a document to only the beginning instead of considering the entire document.In this subsection, we examine how ranking on a bag of words representation might avoid such biases by posing the following research question:

RQ5 Is ranking with characteristic terms sensitive to position bias?
We analyze this by comparing how approaches with and without position bias by design are generalizing to out-of-domain data.Specifically, we make a direct comparison between the input strategies BERT first passage (FirstP), and selecting a random passage in the document (RandP), and reducing the document to characteristic terms using PLM (BERT CT-PLM).We show the effectiveness of those strategies for varying input lengths for NDCG@10, MAP, and R@30. Figure 5 shows the performance of the different input strategies in-domain on MS MARCO, while Figure 6 shows how the method performs out-of-domain on Robust04.
We observe that FirstP indeed performs relatively well on MS MARCO even on extremely short inputs.FirstP performs comparably to CT-PLM in terms of NDCG@10, but for the remaining metrics, CT-PLM performs stronger.Comparing the performance of FirstP on MS MARCO with Ro-bust04, with less position bias, it can be noted that the performance is much lower.On Robust04 FirstP is sharply declining with fewer input tokens, performing close to picking a random passage (RandP).This confirms again, that MS MARCO is biased towards containing important terms at the beginning of the document [17].As a result, experiments on MS MARCO tend to overestimate the effectiveness of FirstP as a retrieval approach.In contrast, RandP, which ranks on random passages of the document performs consistently across collections and may be a more representative baseline, being invariant to any position bias in the documents.CT-PLM also exhibit robust performance on both within-domain and out-of-domain.More generally, considering the performance drops of FirstP and RandP, we conclude that simple truncation is no generally viable strategy for document ranking.
To answer RQ5, by forming a bag of words representation of characteristic terms selected over the entire document, our method is not susceptible to any kind of position bias.

Dual-Encoders on Characteristic Terms
In this section, we examine whether the findings in Cross-Encoders also hold for Dual-Encoders.
In contrast to Dual-Encoder, Cross-Encoders receive both query and document simultaneously as input and therefore, which does not allow to pre-compute document representations offline.Despite sharing the same underlying transformer-based model, the matching mechanism differs fundamentally because documents and queries are encoded independently, making Cross-Encoders query-dependent and thus inherently costly to employ.Therefore, the question of how to increase the efficiency of the Cross-Encoder, as studied in this paper so far, is of the most concern.The efficiency of Dual-Encoders is less critical, but nonetheless, it is important to understand whether our results translate to Dual-Encoders as well.To shed light on ranking on bag of words representations using Dual-Encoders we study the following research question:

RQ6 Do we observe a similar efficiency-effectiveness trade-off for Dual-Encoders?
Dual-Encoders are typically applied as first-stage rankers, in a multi-stage ranking setup.In this function, they are applied to every document in the collection, potentially spanning millions of documents.Therefore, any result that translates from the previous sections on Cross-Encoders to Dual-Encoders is of practical relevance additionally time, energy, and resources.
We repeat the experiment of the previous section of feeding characteristic terms, but this time using the distilled Dual-Encoder SPLADE [13].To measure efficiency we follow again Section 7.1.This model is available on HuggingFace under "naver/splade-cocondenser-ensembledistil".We compare the performance of CT to applying the SPLADE Dual-Encoder with strategy MaxP as a document model with the maximum batch size that fits onto GPU for each model.As previously shown, MaxP serves as a very strong baseline in terms of efficiency and effectiveness.
We report combined efficiency and effectiveness in Table 11.We find a similar efficiency effectiveness trade-off compared to Cross-Encoders, outperforming BM25 even for the smallest number of characteristic terms.We also report the time needed to encode the entire MS MARCO document collection V1 ( ≈3.2m documents) and find that reducing documents to characteristic terms has a serious impact on the encoding time, reducing it dramatically from 58h 11m (Dual MaxP) to 7m 15s (Dual CT) when using the maximum possible batch size possible.At the same time, preserving 90% of the performance in terms of NDCG@10 and 93% in terms of MAP.
We conclude, regarding RQ6, that we can find a similarly beneficial trade-off between efficiency and effectiveness for Dual-Encoders as for the Cross-Encoders.This result shows the validity in terms of the efficiency and effectiveness of our proposed method of reducing documents to characteristic terms across the two predominant model classes for neural ranking with transformer-based models.

Fine-Tuning on Characteristic Terms
In the previous experiments, we have shown that transformer-based rankers, even though never exposed to such bag of words representations, can achieve comparable performance to models that were trained on full natural language.
In this section, we study our seventh research question:

RQ7 Can we improve performance by fine-tuning on characteristic terms?
All transformer-based rankers we used were pre-trained and later fine-tuned on the ranking task in natural language.In contrast, when feeding characteristic terms to the models, we transform a document from a fluent text to a bag of words representation that lacks natural order entirely.Adapting the model to our bag of words input might allow the model to improve performance in several ways.First, as mentioned earlier, by reducing the document to only characteristic terms, we lose term frequency, which might have a negative impact on performance and leave room for improvement.Second, the model might be able to leverage the fact that we feed the terms by decreasing term importance, and therefore put more focus on the initial document terms in the input.
The official training triples for MS MARCO are given for passages only.To generate document training data we use script "msmarco-doctriples.py" provided3 by the TREC organizers with the document training queries and respective relevant judgments.We modify the script to generate three training triples per relevant query document pair by randomly selecting documents from the top-100 yielding roughly 1m triples.We reduce the documents in a similar way to Section 6.1 to a maximum of 512 characteristic terms using PLM.In this experiment, we choose the maximum number of characteristic terms to be 512, as we do not want to restrict the model to any particular choice of tokens it should instead learn during fine-tuning which tokens are important and which to ignore.
To understand the contribution of the term importance that we convey to the model through by sorting the input tokens we conduct another experiment where instead of sorting the characteristic terms after term importance keep them in their natural order according to the original document.This way approximate terms in the sentence stay close to each other and might allow the model to contextualize terms in a more natural way.
For fine-tuning the models on characteristic terms we take the pre-trained bert-base-uncased checkpoint provided on HuggingFace as a basis.We follow the training scheme of Nogueira and Cho [37].For fine-tuning, we use batch size 64, maximum sequence length 512, warm-up steps 1000, learning rate 3e-6, epoch size 1000, and evaluate the best model within 40 epochs.
The results can be found in Table 12.We observe that fine-tuning the model on characteristic terms in the natural order results in a much lower performance than sorting the characteristic terms by term importance.For the latter we observe a significant performance increase on all measures over the zero-shot ranker when fine-tuning the input of characteristic terms.These results confirm our expectation that it is much harder for the model to estimate the importance of individual terms due to the missing term frequency demonstrating the importance of the repurposed position embeddings for this experiment.Regarding the computationally intensive baseline MaxP 114:23 that applies a sliding window to get the context of the entire document, we get close performance with a single forward pass using CT fine-tuned.We even outperform the even more costly document baselines Big Bird and Longformer(-QA).
To answer RQ7, optimizing the model on characteristic terms can improve performance compared to applying the model zero-shot.

CONCLUSIONS
This paper explored a different approach to the efficiency of neural rankers by focusing on the input of the model rather than changing the model.We investigated the effectiveness and efficiency of bag of words document representations based on characteristic terms for document retrieval on two collections MS MARCO and Robust04.We presented the surprising finding that transformerbased rankers trained only on natural language still achieve competitive performance on carefully constructed bag of words representations without being specifically adapted to this input.In our experiments, we observed a favorable trade-off between efficiency and effectiveness by being able to speed up the inference time dramatically by compromising performance minimally compared to passage level-based ranking approaches.This holds for both in-domain (MS MARCO) and out-ofdomain (zero-shot Robust04) document retrieval tasks.We also found that fine-tuning the model on characteristic terms can improve performance to be almost on par with other models that are an order of magnitude less efficient in terms of query latency and GPU memory requirement.
Our findings suggest new ways to make transformer-based rankers more efficient in the future, by exploring optimal document representations that use only a fraction of the input length.Furthermore, by forming bag of words representations our proposed method mitigates any kind of position bias that could occur in the original input documents.Position bias in long documents might not only be problematic for shifts in the distribution between training and test data, but LLMs as such have difficulties to make use of information in long input contexts [30] or that require further mitigation [17,47].In this work, we leveraged existing methods to select characteristic terms, which might not be optimal for transformer-based rankers.We leave it to future work to investigate more optimal ways for selecting the importance of terms.One of the key benefits of this research direction is that it changes only the input of the model while deploying the exact same ranking model.This allows us to vary what we feed into the model, even at request time, depending on the case at hand or the available resources.For example, one could opt to use longer input for important queries, for final stage ranking, or for downstream NLP processing.This gives unprecedented control over the efficiency of the resulting ranker, allowing for agency to choose a desirable trade-off in terms of effectiveness and efficiency.
More generally, our work is not motivated by chasing the state of the art, but by doing conceptual and analytic experiments that contribute to our scientific understanding of transformers for text ranking, and what aspects contribute to their effectiveness.Such understanding is crucial for both connecting current neural IR models to our understanding of traditional IR models, and for suggesting novel research directions that combine the effectiveness of neural IR with the efficiency of traditional IR models.

Fig. 2 .
Fig.2.In-domain effectiveness for document retrieval using strategies BERT Random Passage (RandP) and BERT Characteristic Terms (CT) estimated using TF-IDF and PLM evaluated in-domain on TREC's 2020 document retrieval task on MS MARCO for varying number of characteristic terms.

Fig. 3 .
Fig.3.Efficiency (left) Max.GPU Memory and (right) Query Latency for different input lengths vs Effectiveness in NDCG@10 for our proposed method CT and other document models on TREC's 2020 document retrieval task on MS MARCO.

Fig. 4 .
Fig.4.Out-of-domain effectiveness for document retrieval using strategies BERT Random Passage (RandP) and BERT Characteristic Terms (CT) for varying number of input tokens on Robust04.

Fig. 5 .
Fig.5.In-domain effectiveness for document retrieval using strategies BERT First Passage (FirstP), Random Passage (RandP) and BERT Characteristic Terms (CT) estimated using TF-IDF and PLM evaluated in-domain on TREC's 2020 document retrieval task on MS MARCO for varying number of characteristic terms.

Fig. 6 .
Fig.6.Out-of-domain effectiveness for document retrieval using strategies BERT First Passage (FirstP) Random Passage (RandP) and BERT Characteristic Terms (CT) for varying number of input tokens on Robust04.

Table 1 .
Document (473 terms) about Combine Harvesters Redacted to 64 Characteristic Terms Doc-id: D1960440 in the MS MARCO V1 collection.

Table 4 .
Examples of the Same Documents Reduced to 32 Input Tokens using TF-IDF and PLM, for the Query "what Amino Produces Carnitine"

Table 5 .
In-of-Domain Performance of Different Cross-Encoder Models and our Proposed Method BERT CT for a Different Number of Input Tokens and TF-IDF and PLM for Selecting Characteristic Terms for TREC's 2020 DL Document Ranking Task Statistical significance w.r.t. to baseline 1−8 using a two-tailed Student t-test with p < 0.05, where the direction of the triangle indicates improvement or deterioration.Max.Input tokens refer for models 2-6 to tokens seen in total, models 7-8 consecutive tokens at most seen, and for BERT CT tokens seen out of all tokens.

Table 6 .
Efficiency at Reduced Input: Query Latency and GPU Memory for Fixed Batch Size 32 for the BERT Model

Table 7 .
Efficiency at Reduced Input: Query Latency with Maximum Batch Size Fitting on GP

Table 8 .
Efficiency: GPU Memory in GB and Query Latency in ms on the Doc task on MS MARCO (TREC DL 2020).Comparison of the computational cost (Max.GPU Memory and Query Latency) with the effectiveness for our baselines and ranking on characteristic terms (CT). retrieval

Table 9 .
Storage on Disk (in GB) of the MS MARCO Document Collection (V1) Reducing it to a Varying Number of Characteristic Terms

Table 10 .
Out-of-Domain Performance of Different Cross-Encoder Models and our Proposed Method CT for a Different Number of Input Tokens and TF-IDF and PLM for Selecting Characteristic Terms for Document Ranking on Robust04 Statistical significance w.r.t. to baseline 1−8 using a two-tailed Student t-test with p < 0.05, where the direction of the triangle indicates improvement or deterioration.Max.Input tokens refer for models 2-6 to tokens seen in total, models 7-8 consecutive tokens at most seen, and for BERT CT tokens seen out of all tokens.

Table 11 .
Efficiency vs. Effectiveness (NDCG@10) of Dual-Encoder SPLADE on the Basis of MaxP and Characteristic Terms (CT) for TREC 2020 Doc Statistical significance w.r.t. to baseline 1−2 using a two-tailed Student t-test with p < 0.05, where the direction of the triangle indicates improvement or deterioration.Input tokens refer for MaxP for total tokens seen and for CT to tokens seen out of all tokens.

Table 12 .
Performance of MaxP, Characteristic Terms (CT) Zero-shot and Fine-tuned on TREC DL 2020 Doc task on MS MARCO.Statistical significance w.r.t. to baseline 1−8 using a two-tailed Student t-test with p < 0.05, where the direction of the triangle indicates improvement or deterioration.Max.Input tokens refer for models 2-6 to tokens seen in total and for the other models to tokens seen out of all tokens.CT tokens seen out of all tokens. retrieval