Efficient Neural Ranking Using Forward Indexes and Lightweight Encoders

Dual-encoder-based dense retrieval models have become the standard in IR. They employ large Transformer-based language models, which are notoriously inefficient in terms of resources and latency. We propose Fast-Forward indexes—vector forward indexes which exploit the semantic matching capabilities of dual-encoder models for efficient and effective re-ranking. Our framework enables re-ranking at very high retrieval depths and combines the merits of both lexical and semantic matching via score interpolation. Furthermore, in order to mitigate the limitations of dual-encoders, we tackle two main challenges: Firstly, we improve computational efficiency by either pre-computing representations, avoiding unnecessary computations altogether, or reducing the complexity of encoders. This allows us to considerably improve ranking efficiency and latency. Secondly, we optimize the memory footprint and maintenance cost of indexes; we propose two complementary techniques to reduce the index size and show that, by dynamically dropping irrelevant document tokens, the index maintenance efficiency can be improved substantially. We perform an evaluation to show the effectiveness and efficiency of Fast-Forward indexes—our method has low latency and achieves competitive results without the need for hardware acceleration, such as GPUs.


INTRODUCTION
Neural rankers are typically based on large pre-trained language models, the most popular example being BERT [14].Due to their architectural inductive bias (like self-attention units) and complexity, these models are able to capture the semantics of documents very well, mitigating the limitations of lexical retrievers.However, their capabilities come at a price, as the models commonly used often have upwards of hundreds of millions of parameters.This makes training and even inference without specialized hardware infeasible, and it is impossible to rank all documents in a large corpus in reasonable time.Furthermore, the resources required to run these models produce a considerable amount of emissions, creating a negative impact on the environment [74].
There are two predominant approaches to deal with the inefficiency of neural ranking models.The first one, referred to as retrieve-and-re-rank [25,75], uses an efficient lexical retriever to obtain a candidate set of documents for the given query.The idea is to maximize the recall, i.e., capture most of the relevant documents, in the first stage.Afterwards, the second stage employs a complex neural ranker, which re-ranks the documents in the candidate set, in order to promote the relevant documents to higher ranks.However, the retrieve-and-re-rank approach typically employs crossattention re-rankers, which are expensive to compute even for a small set of candidate documents.This limits the first-stage retrieval depth, as low latency is essential for many applications (e.g., search engines).
The second approach skips the lexical retrieval step entirely and uses neural models for retrieval.The dual-encoder architecture employs a query encoder and a document encoder, both of which are neural models which map their string inputs to dense representations in a common vector space.Retrieval is then performed as a -nearest-neighbor (NN) search operation to find the documents whose representations are most similar to the query.This is referred to as dense retrieval [34].Representing queries and documents independently means that most of the computationally expensive processing happens during the indexing stage, where document representations are pre-computed.However, dense retrieval is still slower than lexical retrieval and benefits from GPU acceleration, because the query needs to be encoded during the query-processing phase.Furthermore, we find that dense retrievers generally have lower recall than term-matching-based models at higher retrieval depths.
In this paper, we argue that neither of the two approaches is ideal.Instead, our first key idea is to explore the utility of dual-encoders in the re-ranking phase instead of the retrieval phase.Using dual-encoders in the re-ranking phase allows for a drastic reduction of query processing times and resource utilization (i.e., GPUs) during document encoding.Towards this, we first show that simple interpolation-based re-ranking that combines the benefits of lexical (computed using sparse retrieval) and semantic (computed using dual-encoders) similarity can result in competitive and sometimes better performance than using cross-attention.We propose a novel index structure called Fast-Forward indexes, which exploits the ability of dual-encoders to pre-compute document representations, in order to substantially improve the runtime efficiency of re-ranking.We empirically establish that dual-encoder models show great performance as re-rankers, even though they do not use cross-attention.
Our second observation is that most current dual-encoder models use the same encoder for both documents and queries.While this design decision makes training easier, it also means that queries have to be encoded during runtime using a, potentially expensive, forward pass.We argue that this is suboptimal; rather, queries, which are often short and concise, do not require a complex encoder to compute their representations.We propose lightweight query encoders, some of which do not contain any self-attention layers, and show that they still perform well as re-rankers, while requiring only a fraction of the resources and time.In this work, we propose two families of lightweight query encoders to drastically reduce query-encoding costs without compromising ranking performance.
Lastly, we focus on the aspects of index footprint and index maintenance.Since dense indexes store the pre-computed representations of documents in the corpus, they exhibit much higher storage and memory requirements compared to sparse indexes [30].At the same time, maintaining the index, i.e., adding new documents, requires expensive forward passes of the document encoder.We propose two means of reducing the memory footprint: On the one hand, we propose sequential coalescing to compress an index by reducing the number of vectors that need to be stored; on the other hand, we experiment with choosing a smaller number of dimensions, which reduces the size of each vector.Finally, we propose efficient document encoders, which dynamically drop irrelevant tokens prior to indexing using a very simple technique.
Our research questions are as follows: RQ1 How suitable are dual-encoder models for interpolation-based re-ranking in terms of performance and efficiency?RQ2 Can the re-ranking efficiency be improved by limiting the number of Fast-Forward lookups?

RELATED WORK
Classical ranking approaches, such as BM25 [70] or the query likelihood model [41], rely on the inverted index that stores term-level statistics like term frequency, inverse document frequency and positional information.We refer to this style of methods as sparse, since it assumes sparse document representations.The recent success of large pre-trained language models (e.g., BERT) shows that semantic or contextualized information is essential for many language tasks.In order to incorporate such information in the relevance measurement, Dai and Callan [12,13] proposed DEEP-CT, which stores contextualized scores for terms in the inverted index for text ranking.SPLADE [18] aims to enrich sparse document representations using a trained contextual Transformer model and sparsity regularization on the term weights.Similarly, DeepImpact [61] enriches the document collection with expansion terms to learn improved term impacts.In our work, we employ efficient sparse models for high-recall first-stage retrieval and perform re-ranking using semantic models in a subsequent step.The ability to accurately determine semantic similarity is essential in order to alleviate the vocabulary mismatch problem [11,13,57,59,64].Computing the semantic similarity of a document given a query has been heavily researched in IR using smoothing methods [37], topic models [84], embeddings [63], personalized models [56], etc.In these classical approaches, ranking is performed by interpolating the semantic similarity scores with the lexical matching scores from the first-stage retrieval.More recently, dense neural ranking methods, which employ large pre-trained language models, have become increasingly popular.Dense rankers do not explicitly model terms, but rather compute low-dimensional dense vector representations through self-attention mechanisms in order to estimate relevance; this allows them to perform semantic matching.However, the inherent complexity of dense ranking models usually has a negative impact on latency and cost, especially with large corpora.Therefore, besides performance, efficiency has been another major concern in developing neural ranking models.
There are two common architectures of dense ranking models: Cross-attention models take a concatenation of a query and a document as input.This allows them to perform query-document attention in order to compute the corresponding relevance score.These models are typically used as re-rankers.Dual-encoder models employ two language models to independently encode queries and documents as fixed-size vector representations.Usually, a similarity metric between query and document vector determines their relevance.As a result, dual-encoders are mostly used for dense retrieval, but also, less commonly, for re-ranking.
We divide the remainder of the related work section into subcategories for cross-attention models, dual-encoder models, and hybrid models, which employ both lexical and semantic rankers.Finally, we briefly cover inference efficiency for BERT-based models.

Cross-Attention Models
The majority of cross-attention approaches have been dominated by large contextual models [1,10,27,29,45,58].The input to these ranking models is a concatenation of the query and document.This combined input results in higher query processing times, since each document has to be processed in conjugation with the query string.Thereby, cross-attention models usually re-rank a relatively small number of potentially relevant candidates retrieved in the first stage by efficient sparse methods.The expensive re-ranking computation cost is then proportional to the retrieval depth (e.g., 1000 documents).
Another key limitation of using cross-attention models for document ranking is the maximum acceptable number of input tokens for Transformer models, which exhibit quadratic complexity w.r.t.input length.Some strategies address this limitation by document truncation [58], or chunking documents into passages [10,72].However, the performance of chunking-based strategies depends on the chunking properties, i.e., passage length or overlap among consecutive passages [73].Recent proposals include a two-stage approach, where a query-specific summary is generated by selecting relevant parts of the document, followed by re-ranking strategies over the query and summarized document [28,43,46,48].Due to the efficiency concerns, we do not consider cross-attention methods in our work, but focus on dual-encoders instead.

Dual-Encoders
Dual-encoders learn dense vector representations for queries and documents using contextual models [34,35].The dense vectors are then indexed in an offline phase [32], where retrieval is akin to performing an approximate nearest neighbor (ANN) search given a vectorized query.This allows dual-encoders to be used for both retrieval and re-ranking.Consequently, there has been a large number of follow-up works that boost the performance of dual-encoder models by improving pre-training [5,20,21,39,82], optimization [23], and negative sampling [68,86,88] techniques, or employing distillation approaches [51,54,90].Lindgren et al. [53] propose a negative cache that allows for efficient training of dual-encoder models.LED [89] uses a SPLADE model to enrich a dense encoder with lexical information.Lin et al. [50] propose Aggretriever, a dual-encoder model which aggregates and exploits all token representations (instead of only the classification token).In this work, we use dual-encoders for computing semantic similarity between queries and passages.Some approaches have also proposed architectural modifications to the aggregations between the query and passage embeddings [6,27,31].Nogueira et al. [67] propose a simple document expansion model.We use dual-encoder models to perform efficient semantic re-ranking in our work.
Efficiency improvements of dual-encoder-based ranking and retrieval focus mostly on either inference efficiency of the encoders or memory footprint of the indexes.TILDE [92] and TILDEv2 [91] efficiently re-rank documents using a deep query and document likelihood model instead of a query encoder.The SpaDE model [7] employs a dual document encoder that has a term weighting and term expansion component; it improves inference efficiency by using a vastly simplified query representation.Li et al. [47] employ dynamic lexical routing in order to reduce the number of dot products in the late interaction step.Cohen et al. [8] use auto-encoders to compress document representations into fewer dimensions in order to reduce the overall size.Dong et al. [15] propose an approach to split documents into variable-length segments and dynamically merge them based on similarity, such that each document has the same number of segments prior to indexing.Hofstätter et al. [26] introduce ColBERTer, an extension of ColBERT [35], which removes irrelevant word representations in order to reduce the number of stored vectors.In a similar fashion, Lassance et al. [40] propose a learned token pruning approach, which is also used to reduce the size of ColBERT indexes by dropping tokens that are deemed irrelevant.Yang et al. [87] propose a contextual quantization approach for pre-computed document representations (such as the ones used by ColBERT) by compressing document-specific representations of terms.
In most of the previous work, dual-encoders are used in a homogeneous or symmetric fashion, meaning that both the query and document encoder have the same architecture or even share weights (Siamese encoders).Jung et al. [33] show that the characteristics of queries and documents are different and employ light fine-tuning in order to adapt each encoder to its specific role.Kim et al. [36] use model distillation for asymmetric dual-encoders, where the query encoder has fewer parameters than the document encoder.Lassance and Clinchant [38] separate the query and document encoder of SPLADE models in order to improve efficiency.In this work, we explore the use of light-weight query encoders for more efficient re-ranking.

Hybrid Models
Hybrid models combine sparse and dense retrieval.The most common approach is a simple linear combination of both scores [51].CLEAR [23] takes the relevance of the lexical retriever into account in the loss function of the dense retriever.COIL [22] performs contextualized exact matching using pre-computed document token representations.COILcr [17] extends this approach by factorizing token representations and approximating them using canonical representations in order to make retrieval more efficient.
Unlike classical methods, where score interpolation is the norm, semantic similarity from neural contextual models (e.g., cross-attention or dual-encoders) is not consistently combined with the matching score.Recently, Wang et al. [83] showed that the interpolation of BERT-based models and lexical retrieval methods can boost the performance.Furthermore, they analyze the role of interpolation in BERT-based dense retrieval strategies and find that dense retrieval alone is not enough, but interpolation with BM25 scores is necessary.Similarly, Askari et al. [2] find that even providing the BM25 score as part of the input text improves the re-ranking performance of BERT models.

Inference Efficiency
Several methods have been proposed to improve the inference efficiency of large Transformerbased models, which have quadratic time complexity w.r.t. the input length.PoWER-BERT [24] progressively eliminates word vectors in the subsequent encoder layers in order to reduce the input size.DeeBERT [85] implements an early-exit mechanism, which may stop the computation after any Transformer layer based on the entropy of its output distribution.SkipBERT [81] uses a technique where intermediate Transformer layers can be skipped dynamically using pre-computed look-up tables.We use a simple Selective BERT approach which dynamically removes irrelevant document tokens in order to make document encoding more efficient.

PRELIMINARIES
In this section, we introduce core concepts that are essential to this work, such as retrieval, reranking, and interpolation.

Interpolation-Based Re-Ranking
The retrieval of documents or passages given a query often happens in two stages [75]: In the first stage, a term frequency-based (sparse) retrieval method (such as BM25 [71]) retrieves a set of documents from a large corpus.In the second stage, another model, which is usually much more computationally expensive, re-ranks the retrieved documents again.
In sparse retrieval, we denote the top-  documents retrieved from the sparse index for a query  as    .The sparse score of a query-document pair (, ) is denoted by   (, ).For the re-ranking part, we focus on self-attention models (such as BERT [14]) in this work.These models operate by creating (internal) high-dimensional dense representations of queries and documents, focusing on their semantic structure.We refer to the outputs of these models as dense or semantic scores and denote them by   (, ).Due to the quadratic time complexity of self-attention w.r.t. the document length (and decreasing performance with increasing document length [55]), long documents are often split into passages, and the score of a document is then computed as the maximum of its passage scores: This approach is referred to as maxP [10].
The retrieval approach for a query  starts by retrieving    from the sparse index.For each retrieved document  ∈    , the corresponding dense score   (, ) is computed.This dense score may then be used to re-rank the retrieved set to obtain the final ranking.However, it has been shown that the scores of the sparse retriever,   , can be beneficial for re-ranking as well [1].To that end, an interpolation approach is employed [4], where the final score of a query-document pair is computed as Setting  = 0 recovers the standard re-ranking procedure.
Since the set of documents retrieved by the sparse model is typically large (e.g.,   = 1000), computing the dense score for each query-document pair can be very computationally expensive.In this paper, we focus on efficient implementations of interpolation-based re-ranking, specifically the computation of the dense scores   .

Dual-Encoder Models
The dual-encoder architecture [34] employs neural semantic models to compute dense vector representations of queries and documents.Specifically, a query encoder  and a document encoder  map queries and documents to representations in a common -dimensional vector space.The relevance score   (, ) of a query-document pair is then computed as the similarity of their vector representations.A common choice for the similarity function is the dot product, such that where  (),  () ∈ R  .

Dense
Retrieval.Dual-encoder models are commonly utilized to perform dense retrieval [34].A dense index contains pre-computed vector representations  () for all documents  in the corpus D. To retrieve a set of documents    for a query , a -nearest neighbor (NN) search is performed to find the documents whose representations are most similar to the query: In order to make dense retrieval more efficient, approximate nearest neighbor (ANN) search is commonly employed [32,60].ANN search can be further accelerated using special hardware, such as GPUs [32].

Training.
In contrast to cross-encoder models, which are often used for re-ranking (cf.Section 3.1), dual-encoders encode the query and document independently, i.e., there is no querydocument attention.Typically, dual-encoders for retrieval are trained using a contrastive loss function [34], where a training instance consists of a query , a positive (relevant) document  + , and a set  − of negative (irrelevant) documents.The temperature  is a hyperparameter.Since it is usually infeasible to include all negative documents for a query in  − , there are various negative sampling approaches, such as distillation [51], asynchronous indexes [86], or negative caches [53].In this work, we use a simple in-batch strategy [34], where, for a query ,  − contains a number of hard negatives (retrieved by BM25) along with all documents from the other queries in the same training batch.

Hybrid Retrieval
Hybrid retrieval [23,51] is similar to interpolation-based re-ranking (cf.Section 3.1).The key difference is that the dense scores   (, ) are not computed for all query-document pairs.Instead,   is a dense retrieval model (cf.Section 3.2.1),which retrieves documents   and their scores   (,   ) using nearest neighbor search given a query .A hybrid retriever combines the retrieved sets of a sparse and a dense retriever.For a query , we retrieve two sets of documents,    and    , using the sparse and dense retriever, respectively.Note that the two retrieved sets are usually not equal.One strategy proposed in [51] ranks all documents in    ∪    , approximating missing scores.In our experiments, however, we found that only considering documents from    for the final ranking and discarding the rest works well.The final score is thus computed as The re-ranking step in hybrid retrieval is essentially a sorting operation over the interpolated scores and takes negligible time in comparison to standard re-ranking.

FAST-FORWARD INDEXES
The hybrid approach described in Section 3.3 has two distinct disadvantages.Firstly, in order to retrieve    , an (approximate) nearest neighbor search has to be performed, which is time consuming.Secondly, some of the query-document scores are expected to be missed, leading to an Ind ex loo kup Fig. 2. Early stopping reduces the number of interpolation steps by computing an approximate upper bound for the dense scores.This example depicts the most extreme case, where only the top-1 document is required.
incomplete interpolation, where the score of one of the retrievers needs to be approximated [52] for a number of query-document pairs.
In this section, we propose Fast-Forward indexes as an efficient way of computing dense scores for known documents that alleviates the aforementioned issues.Specifically, Fast-Forward indexes build upon dual-encoder dense retrieval models that compute the score of a query-document pair as a dot product

Algorithm 1: Compression of dense maxP indexes by sequential coalescing
Input: list of passage vectors  (original order) of a document, distance threshold  Output: where  and  are the query and document encoders, respectively.Examples of such models are ANCE [86] and TCT-ColBERT [52].Since the query and document representations are independent for two-tower models, we can pre-compute the document representations  () for each document  in the corpus.These document representations are then stored in an efficient hash map, allowing for look-ups in constant time.After the index is created, the score of a query-document pair can be computed as , where the superscript   indicates the look-up of a pre-computed document representation in the Fast-Forward index.At retrieval time, only  () needs to be computed once for each query.As queries are usually short, this can be done on CPUs.The main benefit of this method is that the number of documents to be re-ranked can be much higher than with cross-attention models; the scoring operation is a simple look-up and dot product computation.
Note that the use of large Transformer-based query encoders still remains a bottleneck in terms of latency (or, if it is run on GPUs, cost).In Section 5, we focus on lightweight encoder models.

Index Compression via Sequential Coalescing
A major disadvantage of dense indexes and dense retrieval in general is the size of the final index.This is caused by two factors: Firstly, in contrast to sparse indexes, the dense representations cannot be stored as efficiently as sparse vectors.Secondly, the dense encoders are typically Transformerbased, imposing a (soft) limit on their input lengths due to their quadratic time complexity with respect to the inputs.Thus, long documents are split into passages prior to indexing (maxP indexes).
As an increase in the index size has a negative effect on efficiency, both for nearest neighbor search and Fast-Forward indexing as used by our approach, we exploit a sequential coalescing approach as a way of dynamically combining the representations of consecutive passages within a single document in maxP indexes.The idea is to reduce the number of passage representations in the index for a single document.This is achieved by exploiting the topical locality that is inherent to documents [42].For example, a single document might contain information regarding multiple topics; due to the way human readers naturally ingest information, we expect documents to be authored such that a single topic appears mostly in consecutive passages, rather than spread throughout the whole document.Our approach aims to combine consecutive passage representations that encode similar information.To that end, we employ the cosine distance function and a threshold parameter  that controls the degree of coalescing.Within a single document, we iterate over its passage vectors in their original order and maintain a set A, which contains the representations of the already processed passages, and continuously compute A as the average of all vectors in A. For each new passage vector , we compute its cosine distance to A. If it exceeds the distance threshold , the current passages in A are combined as their average representation A. Afterwards, the combined passages are removed from A and A is recomputed.This approach is illustrated in Algorithm 1. Fig. 1 shows an example index after coalescing.To the best of our knowledge, there are no other forward index compression techniques proposed in literature so far.

Faster Interpolation by Early Stopping
As described in Section 3.1, by interpolating the scores of sparse and dense retrieval models, we perform implicit re-ranking, where the dense representations are pre-computed and can be looked up in a Fast-Forward index at retrieval time.Furthermore, increasing the sparse retrieval depth   , such that   > , where  is the final number of documents, improves the performance.A drawback of this is that an increase in the number of retrieved documents also results in an increase in the number of index look-ups.
Common term pruning mechanisms for term-at-a-time retrieval, such as MaxScore [79] or WAND [3], accelerate query processing for inverted-index-based retrievers; however, these techniques are not compatible with neural ranking models based on contextual query and document representations.Our use case is more similar to top-k query evaluation, with algorithms such as the threshold algorithm [16] or probabilistic approximations [77], but these approaches usually require sorted access, which is not available for the dense re-ranking scores in our case.
In this section, we propose an extension to Fast-Forward indexes that allows for early stopping, i.e., avoiding a number of unnecessary look-ups, for cases where   >  by approximating the maximum possible dense score.The early stopping approach takes advantage of the fact that documents are ordered by their sparse scores   (, ).Since the number of retrieved documents,   , is finite, there exists an upper limit   for the corresponding dense scores such that   (, ) ≤   ∀ ∈    .Since the retrieved documents    are ordered by their sparse scores, we can simultaneously perform interpolation and re-ranking by iterating over the ordered list of documents: Let   be the th highest ranked document by the sparse retriever.Recall that we compute the final score as If  > , we can compute the upper bound for  (,   ) by exploiting the aforementioned ordering: In turn, this allows us to stop the interpolation and re-ranking if   ≤   , where   denotes the score of the th document in the current ranking (i.e., the currently lowest ranked document).Intuitively, this means that we stop the computation once the highest possible interpolated score  (,   ) is too low to make a difference.The approach is illustrated in Algorithm 2 and Fig. 2. Since the dense scores   are usually unnormalized, the upper limit   is unknown in practice.We thus approximate it by using the highest observed dense score at any given step.

Theoretical Analysis.
We first show that the early stopping criteria, when using the true maximum of the dense scores, is sufficient to obtain the top- scores.

Algorithm 2: Interpolation with early stopping
Input: query , sparse retrieval depth   , cut-off depth , interpolation parameter  Output: approximated top- scores   Proof.First, note that the sparse scores,   (,   ), are already sorted in decreasing order for a given query.By construction, the priority queue  always contains the highest scores corresponding to the list parsed so far.Let, after parsing  scores,  be full.Now the possible best score   is computed using the sparse score found next in the decreasing sequence and the maximum of all dense scores,   (cf.Line 7).If   is less than the minimum of the scores in , then  already contains the top- scores.To see this, note that the first component of   is the largest among all unseen sparse scores (as the list is sorted) and   is the maximum of the dense scores by our assumption. □ Next, we show that a good approximation of the top- scores can be achieved by using the sample maximum.To prove our claim, we use the Dvoretzky-Kiefer-Wolfowitz (DKW) [62] inequality.Lemma 4.2.Let  1 ,  2 , ...,   be  real-valued independent and identically distributed random variables with the cumulative distribution function  (•).Let   (•) denote the empirical cumulative distributive function, i.e., According to the DKW inequality, the following estimate holds: In the following, we show that, if   is chosen as the maximum of a large random sample drawn from the set of dense scores, then the probability that any given dense score, chosen independently  and uniformly at random from the dense scores, is greater than   is exponentially small in the sample size.
Theorem 4.3.Let  1 ,  2 , ...,   be a real-valued independent and identically distributed random sample drawn from the distribution of the dense scores with the cumulative distribution function  (•).
ln 2, we obtain Proof.Let   (•) denote the empirical cumulative distribution function as above.Specifically,   () is equal to the fraction of variables less than or equal to .We then have   () = 1.By Lemma 4.2, we infer . Substituting   () = 1, we obtain Equation (4).□ This implies that the probability of any random variable  , chosen randomly from the set of dense scores, being less than or equal to   , is greater than or equal to 1 −  with high probability, i.e., , where   denotes the probability distribution of the dense scores.This means that, as our sample size grows until it reaches , the approximation improves.Note that, in our case, the dense scores are sorted (by corresponding sparse score) and thus the i.i.d.assumption cannot be ensured.However, we observed that the dense scores are positively correlated with the sparse scores.We argue that, due to this correlation, we can approximate the maximum score well.

EFFICIENT ENCODERS
BERT models are the de facto standard for both query and document encoders [34,51,86].The encoders are often homogeneous, meaning that the architectures of both models are identical, or even Siamese, i.e., the same encoder weights are used for both queries and documents.Other approaches are semi-Siamese models [33], where light fine-tuning is used to adapt each encoder to its input characteristics, or TILDE [92] and TILDEv2 [91], which do not require dense query representations.However, the most common choice remains the use of BERT base for both encoders.
In this paper, we argue that the homogeneous structure is not ideal for dual-encoder IR models w.r.t.query processing efficiency, since the characteristics of queries and documents differ [33].We illustrate those characteristics w.r.t. the average number of tokens in Fig. 3.This section focuses on model architectures for both query and document encoding that aim to improve the overall efficiency of the ranking process.

Lightweight Query Encoders
Query encoders need to be run online during query processing, i.e., the representations cannot be pre-computed.Consequently, query encoding latency is essential for many downstream applications, such as search engines.Our experiments reveal that even encoding a large batch of 256 queries using a BERT base model on CPU takes more than 3 seconds (cf.Fig. 7a), resulting in roughly 12 milliseconds per query (smaller batch sizes or even single queries lead to even slower encoding).Since queries are typically short and concise, we argue that query encoders require lower complexity (e.g., in terms of the number of parameters) than document encoders.Our proposed query encoders are considerably more lightweight than standard BERT base models, and thus more efficient in terms of latency and resources.

Attention-based.
Attention-based query encoders (such as models based on BERT [14]) use Transformer encoder layers [80] to compute query representations.Each of these layers has two main components -multi-head attention and a feed-forward sub-layer -both of which include residual connections and layer normalization operations.
Attention is computed based on three input matrices -the queries Q, keys K, and values V: Multi-head attention computes attention multiple times (using  attention heads ℎ  ) and concatenates the results, as denoted by where The matrices are trainable parameters,  denotes the dimension of hidden representations in the model, and   =   is a scaling factor.Since Transformer encoders compute self-attention, the three inputs Q, K and V originate from the same place, i.e., they are projections of the output of the previous encoder layer.The inputs to the first encoder layer originate from a token embedding layer.We denote the embedding operation as  : N ↦ → R  , such that  () is the embedding vector of a token .where BERT CLS indicates that the output vector corresponding to the classification token, denoted by [CLS], is used.Figure 4a shows attention-based query encoders.
The usual choice for query encoders, BERT base , has  = 12 layers,  = 768 dimensions for hidden representations and  = 12 attention heads.In this work, we investigate how less complex query encoders impact the re-ranking performance.Specifically, we vary three hyperparameters, namely the number of Transformer layers , hidden dimensions  and attention heads .The pre-trained BERT models we use are provided by Turc et al. [78].Due to the omission of self-attention (and thus, contextualization) altogether, the usage of the [CLS] token is not feasible for this approach.Instead, a query  = ( 1 , ...,  | | ) is represented simply as the average of its token embeddings, i.e., Embedding-based query encoders are illustrated in Fig. 4b.

Selective Document Encoders
Document encoders are not run during query processing time, since document representations are pre-computed and indexed.However, the computation of document representations still requires a substantial amount of time and resources.This is particularly important for applications like web search, where index maintenance plays an important role, usually due to large amounts of new documents constantly needing to be added to the index.The effect is further amplified by the maxP approach (cf.Eq. ( 1)), where long documents require more than one encoding step.Since documents tend to be much longer and more complex than queries, lightweight document encoders would likely negatively affect performance, and recent research suggests that larger document encoders lead to better results [66].However, due to the nature of documents obtained from web pages, we expect a considerable number of document tokens to be irrelevant for the encoding step; examples for this are stop words or redundant (repeated) information.Similar observations have been made in other approaches [26].Furthermore, recent research [69] has shown that certain aspects, such as the position of tokens, are not essential for large language models to perform well.Our proposed document encoders assign a relevance score to each input token and dynamically drop low-scoring tokens before computing self-attention in order to make the document encoding step more efficient.We refer to this approach as Selective BERT.It uses a scoring network Φ : N ↦ → [0, 1] to determine the relevance of each input token before feeding it into the encoding BERT model Ψ.We denote the parameters of the scoring network as  Φ and the parameters of the BERT model as  Ψ .We use a lightweight, non-contextual scoring network with three 384-dimensional feed-forward layers and ReLU activations.The final layer outputs a scalar that is fed into a sigmoid activation function to compute the final score.Selective BERT models are trained in two steps.5.The fine-tuning and inference phase of Selective BERT document encoders.In the given example, the documents in the input batch are dynamically shortened to four tokens each based on the corresponding relevance scores.Note that the positional encoding that is added to BERT input tokens has been omitted in this figure .then trained for a single epoch using the same data as during the unsupervised BERT pre-training step [14].The scoring network Φ is taken into account by multiplying the embedding of an input token   by its corresponding score, i.e.,   =  (  ) • Φ(  ) +  (  ), where  (  ) is the token embedding and  (  ) is the positional encoding.The resulting representation   is then used to compute self-attention in the first encoder layer.
In order to encourage the scoring network to output scores less than one, we introduce a regularization term using the  1 -norm over the scores, where  is the input sequence length: The final objective is a combination of the original BERT pre-training loss L and the scoring regularizer scaled by a hyperparameter :

Fine-Tuning and Inference.
The second step, referred to as fine-tuning, only trains the BERT model Ψ, while the scoring network Φ remains frozen for the remainder of the training process.Furthermore, the weights of the BERT model obtained in the previous step,  Ψ , are discarded and replaced by the same pre-trained model as before.The training objective during this stage is identical to that of other dual-encoder models (cf.Section 3.2.2).
During fine-tuning and inference (i.e., document encoding), we only retain the tokens with the highest scores; we set a ratio  ∈ [0; 1] of the original input length to retain.As a result, the length of the input batch is shortened by 1−.This is achieved by removing the lowest scoring tokens from the input.Since individual documents within a batch are usually padded,  always corresponds to the longest sequence in the batch.Consequently, padding tokens are always removed first before the scores of the other tokens are taken into account.The process is illustrated in Fig. 5.

EXPERIMENTAL SETUP
In this section, we outline the experimental setup, including baselines, datasets, and further details about training and evaluation.

Baselines
We consider the following baselines: (1) Sparse retrievers rely on term-based matching between queries and documents.We consider BM25, which uses term-based retrieval signals.DEEP-CT [12], SPLADE [18], and SpaDE [7] use sparse representations, but contextualize terms in some fashion.(2) Dense retrievers retrieve documents that are semantically similar to the query in a common embedding space.We consider TCT-ColBERT [52], ANCE [86], and the more recent Aggretriever [50].All three approaches are based on BERT encoders.Large documents are split into passages before indexing (maxP).These dense retrievers use exact (brute-force) nearest neighbor search as opposed to approximate nearest neighbor (ANN) search.We evaluate these methods in both the retrieval and re-ranking setting.(3) Hybrid retrievers interpolate sparse and dense retriever scores.We consider CLEAR [23], a retrieval model that complements lexical models with semantic matching.Additionally, we consider the hybrid strategy described in Section 3.3 as a baseline, using the dense retrievers above.(4) Re-rankers operate on the documents retrieved by a sparse retriever (e.g., BM25).Each query-document pair is input into the re-ranker, which outputs a corresponding score.In this paper, we use a BERT-CLS re-ranker, where the output corresponding to the classification token is used as the score.Note that re-ranking is performed using the full documents (i.e., documents are not split into passages).If an input exceeds 512 tokens, it is truncated.Furthermore, we consider TILDEv2 [91] with TILDE expansion.

Datasets
We evaluate our models and baselines on a variety of diverse retrieval datasets: (1) The TREC Deep Learning track [9] provides test sets and relevance judgments for retrieval and ranking evaluation on the MS MARCO corpora [65].We use both the passage and document ranking test sets from the years 2019 and 2020 for our experiments.In addition, we use the MS MARCO development sets to determine the optimal values for hyperparameters.(2) The BEIR benchmark [76] is a collection of various IR datasets, which are commonly evaluated in a zero-shot fashion, i.e., without using any of the data for training the model.We evaluate our models on a subset of the BEIR datasets, including tasks such as passage retrieval, question answering, and fact checking.

Evaluation Details
Our ranking experiments are performed on a single machine using an Intel Xeon Silver 4210 CPU and an NVIDIA Tesla V100 GPU.In our initial experiments (Tables 3 and 4), we measured the per-query latency by performing each experiment four times and reporting the average latency, excluding the first measurement.In subsequent experiments (  1.The pre-trained dense encoders and corresponding indexes we used in our experiments.In each cell, the first line corresponds to a pre-trained encoder (to be obtained from the HuggingFace Hub) and the second line is a pre-built index provided by Pyserini.
contains multiple latency measurements.We then report the average over all measurements of the fastest run.In Tables 3 and 4, latency is reported as the sum of scoring (this includes operations like encoding queries and documents, obtaining representations from a Fast-Forward index, computing the scores as dot-products, and so on), interpolation (cf.Eq. ( 2)), and sorting cost.Any pre-processing or tokenization cost is ignored.Where applicable, dense models use a batch size of 256.The first-stage (sparse) retrieval step is not included, as it is constant for all methods.The Fast-Forward indexes are loaded into the main memory entirely before they are accessed.In Table 5, we report end-to-end latency, which includes retrieval, re-ranking, and tokenization cost.We use the Pyserini [49] toolkit, which provides a number of pre-trained encoders (available on the HuggingFace Hub1 ) and corresponding indexes (see Table 1), for our retrieval experiments.Dense encoders (ANCE, TCT-ColBERT, and Aggretriever) output 768-dimensional representations.The sparse BM25 retriever is provided by Pyserini as well.We use the pre-built indexes msmarco-passage ( 1 = 0.82,  = 0.68) and msmarco-doc ( 1 = 4.46,  = 0.82).Furthermore, we use Pyserini to run SPLADE with the provided msmarco-passage-distill-splade-max index and the pre-trained DistilSPLADE-max model.
We use the MS MARCO development set to determine the interpolation parameter .We set  = 0.2 for TCT-ColBERT,  = 0.5 for ANCE, and  = 0.7 for BERT-CLS (Section 7.1).For Aggretriever, we set  = 0.3 for BM25 re-ranking and  = 0.1 for SPLADE re-ranking.For the dual-encoder models we trained ourselves (Sections 7.3 to 7.5), the value for  is determined based on nDCG@10 re-ranking results on the MS MARCO development set and varies slightly for each model.

Training Details
Our dual-encoder models are trained using the contrastive loss in Eq. ( 3).For each training instance, we sample 8 hard negative documents using BM25.Additionally, we use in-batch negatives and a batch size of 4, resulting in | − | = 32 negatives for each query.Each model is trained on four NVIDIA A100 GPUs.We set the learning rate to 1 • 10 −5 and use gradient accumulation of 32 batches (this results in an effective batch size of 4•4•32 = 512).During training, we perform validation on the MS MARCO development set.Our models are trained until the average precision stops improving for five consecutive iterations.We exclusively train on the MS MARCO passage ranking corpus; the resulting models are then evaluated on multiple datasets (i.e., for BEIR, we do zero-shot evaluation).
Our Selective BERT model (cf.Section 5.2) uses  = 10 −6 during pre-training.We implemented our models and training pipeline using PyTorch,2 PyTorch-Lightning, 3 and Transformers. 4.4.1 Dual-Encoder Architecture.Our dual-encoder rankers consist of a query encoder  and a document encoder  (cf.3.2): The models ζ and η map queries and documents to arbitrary vector representations; examples for these models are pre-trained Transformers or the encoders described in Section 5. We include optional trainable linear layers (with corresponding weights W  ∈ R ×  , W  ∈ R ×  ,   ∈ R  and   ∈ R  ) for heterogeneous encoders, where the dimensions of the representation vectors,   and   , do not match.We further  2 -normalize the representations during training and indexing; we do not normalize the query representations during ranking, as this would only scale the scores, but not change the final ranking.

RESULTS
In this section, we perform experiments to show the effectiveness and efficiency of Fast-Forward indexes.Each subsection corresponds to one of our research questions.

How suitable are dual-encoder models for interpolation-based re-ranking in terms
of performance and efficiency?This section focuses on the effectiveness and efficiency of Fast-Forward indexes for re-ranking.We use pre-trained dual-encoders that are homogeneous (i.e., both encoders are identical models) for our experiments.2, we report the performance of sparse, dense and hybrid retrievers, re-rankers and interpolation.
First, we observe that dense retrieval strategies perform better than sparse ones in terms of nDCG, but have poor recall except on TREC-DL-Psg'19.The contextual weights learned by DEEP-CT are better than tf-idf-based retrieval (BM25), but fall short of dense semantic retrieval strategies (TCT-ColBERT and ANCE) with differences upwards of 0.1 in nDCG.However, the overlap among retrieved documents is rather low, reflecting that dense retrieval cannot match query and document terms well.
Second, dual-encoder-based (TCT-ColBERT and ANCE) perform better than contextual (BERT-CLS) re-rankers.In this setup, we first retrieve   = 1000 documents using a sparse retriever and re-rank them.This approach benefits from high recall in the first stage and promotes the relevant documents to the top of the list through the dense semantic re-ranker.However, re-ranking is typically time-consuming and requires GPU acceleration.The improvements of TCT-ColBERT and ANCE over BERT-CLS (e.g., 0.1 in nDCG) also suggest that dual-encoder-based re-ranking strategies are better than cross-interaction-based methods.However, the difference could also be attributed to the fact that BERT-CLS does not follow the maxP approach (cf.Section 3.1).
Finally, interpolation-based re-ranking, which combines the benefits of sparse and dense scores, significantly outperforms the BERT-CLS re-ranker and dense retrievers.Recall that dense re-rankers operate solely based on the dense scores and discard the sparse BM25 scores of the query-document  [22,23].Superscripts indicate statistically significant improvements using two-paired tests with a sig.level of 95% [19].
pairs.The superiority of interpolation-based methods is also supported by evidence from recent studies [5,6,22,23].

Efficient
Re-Ranking at Higher Retrieval Depths.Tables 3 and 4 show results of re-ranking, hybrid retrieval and interpolation on document and passage datasets, respectively.The metrics are computed for two sparse retrieval depths,   = 1000 and   = 5000.We observe that additionally taking the sparse component into account in the score computation (as is done by the interpolation and hybrid methods) causes performance to improve with retrieval depth.Specifically, some queries receive a considerable recall boost, capturing more relevant documents with large retrieval depths.Interpolation based on Fast-Forward indexes achieves substantially lower latency compared to other methods.Pre-computing the document representations allows for fast look-ups during retrieval time.As only the query needs to be encoded by the dense model, both retrieval and re-ranking can be performed on the CPU while still offering considerable improvements in query processing time.Note that for BERT-CLS, the input length is limited, causing documents to be truncated, similarly to the firstP approach.As a result, the latency is much lower, but in turn the performance suffers.It is important to note here, that, in principle, Fast-Forward indexes can also be used in combination with firstP models.
The hybrid retrieval strategy, as described in Section 3.3, shows good performance.However, as the dense indexes require nearest neighbor search for retrieval, the query processing latency is much higher than for interpolation using Fast-Forward indexes.
Finally, dense re-rankers do not profit reliably from increased sparse retrieval depth; on the contrary, the performance drops in some cases.This trend is more apparent for the document retrieval datasets with higher values of   .We hypothesize that dense rankers only focus on  3. Document ranking performance.Latency is reported per query for   = 5000 on GPU and CPU .The coalesced Fast-Forward indexes are compressed to approximately 25% of their original size.Hybrid retrievers use a dense retrieval depth of   = 1000.Superscripts indicate statistically significant improvements using two-paired tests with a sig.level of 95% [19].  5. Passage ranking performance using various first-stage retrieval models as well as re-rankers.Aggretriever models are used for interpolation-based re-ranking using Fast-Forward indexes.Re-ranking is done with   = 5000 passages.SpaDE results are taken from the corresponding paper [7].For SPLADE, we use the DistilSPLADE-max model.Latency is reported per query on CPU .For retrieval models (BM25 and SPLADE), latency is reported at retrieval depth   = 1000.For re-ranking (TILDEv2 and Fast-Forward), latency is reported as the sum of retrieval and re-ranking, both at depth   = 5000.

Latency
semantic matching and are sensitive to topic drift, causing them to rank irrelevant documents in the top-5000 higher.5, where we compare various first-stage retrieval methods in combination with re-rankers.The idea is to show how Fast-Forward indexes perform in combination with modern sparse retrievers and how they compare with other re-rankers.Additionally, these experiments give an idea of the end-to-end efficiency, as we report the latency as the sum of retrieval, re-ranking, and tokenization.The Aggretriever model [50] we use in combination with Fast-Forward indexes is a recent single-vector dual-encoder model based on coCondenser [21].
Both SpaDE and SPLADE, unsurprisingly, perform substantially better than BM25, as these models use contextualized learnt representations.This boost in performance comes with a large increase in latency, in terms of both indexing and query processing.However, it becomes evident that re-ranking BM25 results comes very close to these models in terms of performance, and sometimes even surpasses them, even though the overall latency remains lower.At the same time, Fast-Forward indexes manage to improve the performance of SPLADE by re-ranking (although the improvements are not as big).Interestingly, TILDEv2 does not exhibit this behavior, but rather performs worse when a SPLADE first-stage retriever is used.We assume that the reason for this is that the model was not optimized for this scenario.
7.2 Can the re-ranking efficiency be improved by limiting the number of Fast-Forward look-ups?We evaluate the utility of the early stopping approach described in Section 4.2 on the TREC-DL-Psg'19 dataset.Figure 6 shows the average number of look-ups performed in the Fast-Forward index during interpolation w.r.t. the cut-off depth .We observe that, for  = 100, early stopping already leads to a reduction of almost 20% in the number of look-ups.Decreasing  further leads to a significant reduction of look-ups, resulting in improved query processing latency.As lower cut-off depths (i.e.,  < 100) are typically used in downstream tasks, such as question answering, the early stopping approach for low values of  turns out to be particularly helpful.
Table 4 shows early stopping applied to the passage dataset to retrieve the top-10 passages and compute reciprocal rank.It is evident that, even though the algorithm approximates the maximum dense score (cf.Section 4.2), the resulting performance is identical, which means that the approximation was accurate in both cases and did not incur any performance hit.Furthermore, the query processing time is decreased by up to a half compared to standard interpolation.This means that presenting a small number top results (as is common in many downstream tasks) can  (d) nDCG@10 on TREC-DL-Psg'20 Fig. 7. Query encoding latency and Fast-Forward ranking performance of dual-encoders with various query encoder models.The sparse retrieval depth is   = 5000. and  correspond to the number of Transformer layers and dimensions of the hidden representations, respectively. = 0 corresponds to embedding-based query encoders, which are initialized with pre-trained token embeddings from BERT base , and  > 0 corresponds to attention-based query encoders, where the number of attention heads is  =  64 .The document encoder is a BERT model with 12 layers and 768-dimensional representations in all cases.Query encoding latency is measured on CPU with a batch size of 256 queries from MSM-Psg-Dev (tokenization cost is excluded, as it is identical for all models).yield substantial speed-ups.Note that early stopping depends on the value of , hence the latency varies between TCT-ColBERT and ANCE.

To what extent does query encoder complexity affect re-ranking performance?
In this section, we investigate the role of the query encoder in interpolation-based re-ranking using Fast-Forward indexes.

7.3.1
The Role of Self-Attention.First, we train a large number of dual-encoder models (as described in Section 6.4) and successively reduce the complexity of the query encoder.At the same time, we monitor the effects on performance and latency.The query encoders we analyze correspond to the attention-based query encoders in Section 5.1.1 and the embedding-based query encoders in Section 5.1.2.Since the embedding-based encoders are, technically speaking, a special case of the attention-based ones, we plot the results together in Fig. 7.The document encoder we use is a BERT base model, which has  = 12 layers and  = 768 hidden dimensions; it is the same across all experiments.For the query encoder, we start with BERT base as well and reduce both the number of layers and hidden dimensions.All pre-trained BERT models we use for this experiment are provided by Turc et al. [78].If the output dimensions of the encoders do not match, we add a single linear layer to the query encoder (cf.Section 6.4.1).
Figure 7a illustrates the time each encoder requires to encode a batch of queries on a CPU; as expected, a reduction in either the number of layers or hidden dimensions has a positive impact on encoding latency, and the most lightweight attention-based model ( = 2,  = 128) is significantly faster than BERT base (27 milliseconds vs. 3.1 seconds).Furthermore, the complete omission of self-attention in the embedding-based encoder ( = 0,  = 768) results in even faster encoding (13 milliseconds).
Next, we analyze to what extent the drastic reduction of complexity affects the ranking performance.Figures 7b to 7d show the corresponding Fast-Forward re-ranking performance on passage development and test sets.It is evident that the absolute difference in performance between the encoders is relatively low; this is especially true on MSM-Psg-Dev and TREC-DL-Psg'19.In fact, the embedding-based query encoder does not always yield worse performance than the attention-based encoders, specifically on TREC-DL-Psg'19.On TREC-DL-Psg'20, the highest absolute difference of 0.05 is the largest among the three datasets.
These results suggest that query encoders do not need to be overly complex; rather, in most cases, either considerably smaller attention-based or even embedding-based models can be used.The embedding-based encoders are particularly useful, since they are essentially a look-up table and hence require no forward pass other than computing the average of all token embeddings.7.4 What is the trade-off between Fast-Forward index size and ranking performance?This research question investigates how index size influences ranking performance and latency.In detail, we reduce index size in two different ways: First, we apply sequential coalescing (cf.Section 4.1) in order to reduce the number of vector representations in the index.Second, we train query and encoders to output lower-dimensional vector representations.Note that these methods are not mutually exclusive, but rather complementary.7.4.1 Sequential Coalescing.In order to evaluate this approach, we first take the pre-trained TCT-ColBERT dense index of the MS MARCO corpus, apply sequential coalescing with varying values for  and evaluate each resulting compressed index using the TREC-DL-Doc'19 test set.The results are illustrated in Fig. 8.It is evident that, by combining the passage representations, the number of vectors in the index can be reduced by more than 80% in the most extreme case, where only a single vector per document remains.At the same time, the performance is correlated with the granularity of the representations.However, the drops are relatively small.For example, for  = 0.025, the index size is reduced by more than half, while the nDCG decreases by roughly 0.015 (3%).
Additionally, Table 3 shows the detailed performance of coalesced Fast-Forward indexes on the document datasets.We chose the indexes corresponding to  = 0.035 (TCT-ColBERT) and  = 0.003 (ANCE), both of which are compressed to approximately 25% of their original size.This is reflected in the query processing latency, which is reduced by more than half.The overall performance drops to some extent, as expected, however, these drops are not statistically significant in all but one case.The trade-off between latency (index size) and performance can be controlled by varying the threshold . the models.The idea is motivated by recent research [66], which suggests that the representation vectors are not the bottleneck of dual-encoder models, but rather the document encoder complexity is.Since the dimensionality of the representations directly influences the index size, it is desirable to keep it as low as possible.
In order to analyze the effect, we train a number of dual-encoder models (cf.3.2.2),where all hyperparameters except the hidden dimension  and number of attention heads  are kept the same.We show results for embedding-based ( = 0) and attention-based ( = 12) query encoders in Fig. 9.There is a trade-off between the dimensionality of representations and ranking performance, which is expected; this trade-off is exhibited by both embedding-based and attention-based query encoders.Overall, the results show that the performance reduction is rather small for  = 512 and We then vary  ∈ [0, 1] during the indexing stage, resulting in progressively higher indexing efficiency (Fig. 10a).The corresponding Fast-Forward ranking performance on MSM-Psg-Dev is shown in Fig. 10b for an embedding-based query encoder ( = 0) and in Fig. 10c for an attention-based query encoder ( = 12).Document encoding latency is measured on GPU with a batch size of 256 passages from the MS MARCO corpus (tokenization cost is excluded, as it is identical for all models).even  = 256 (compared to  = 768), considering that it goes hand in hand with a reduction in index size of approximately 33% and 67%, respectively.

Can the indexing efficiency be improved by removing irrelevant document tokens?
In this experiment, we focus on the Selective BERT document encoders proposed in Section 5.2.In order to analyze the index efficiency and ranking performance, we train two dual-encoders (cf.Section 6.4) with Selective BERT document encoders, where  = 12 and  = 768.The query encoders have  = 0 (embedding-based) and  = 12 (attention-based), respectively, and  = 768.During fine-tuning (cf.Section 5.2.2), we fix the hyperparameter  = 0.75, which controls the ratio of tokens to be removed from the documents; afterwards, we create a number of indexes, where we vary  between 0.1 and 0.9, and compute the corresponding indexing time (using GPUs) and ranking performance.The results are plotted in Fig. 10.
The document encoding latency (Fig. 10a) increases nearly linearly with the ratio of tokens to keep ().Even though the BERT model has a quadratic complexity w.r.t.input length, this is expected, as there is a certain amount of overhead introduced by the scoring network and the reconstruction of the batches.More interestingly, the ranking performance (Figs.10b and 10c) is mostly unchanged for  ≥ 0.5 in both cases, however, neither models manage to match the performance of their respective baselines (the same configuration with a standard BERT model instead of Selective BERT).We hypothesize that the reason for this could be the choice of  = 0.75 during the fine-tuning step.
Overall, our results show that up to 50% of document tokens can be removed without much of a performance reduction.Encoding half of the number of tokens results in approximately halving the time required to encode documents.This has a large impact on efficient index maintenance in the context of dynamically increasing document collections.For future work, the Selective BERT architecture can be further refined, for example, by introducing improved (contextualized) scoring networks.

DISCUSSION
In this section, we reflect upon our work and present possible limitations.6. Retrieval results of dual-encoder models using lightweight query encoders and some baselines.
For TREC-DL-Doc'19, the dense retrieval depth is set to   = 10000 and maxP aggregation is applied (cf.Eq. ( 1)).Our model with  = 0 uses an embedding-based query-encoder, and the one with  = 12 uses an attention-based query encoder.The document encoder is a BERT base model ( = 12,  = 768) in both cases.

Efficient Encoders for Dense Retrieval
Our research questions and experiments have focused exclusively on interpolation-based re-ranking using dual-encoders and Fast-Forward indexes.However, the most common application of dualencoders in the field of IR is the use as dense retrieval models; a natural question that occurs is, whether the encoders proposed in Section 5 can be used for more efficient dense retrieval.
In Table 6, we present passage and document retrieval results on the MS MARCO corpus.Dense retrievers use a FAISS [32] vector index; no interpolation or re-ranking is performed.It is immediately obvious that our models do not achieve competitive results; on the contrary, the embedding-based encoder yields far worse performance than dense retrievers and even BM25, and even the attention-based encoder fails to improve over sparse retrieval.
From these results, we infer that the models we trained are not suitable for dense retrieval.However, we assume that the main reason for this is not the architecture of the query encoder, but instead the following: • We use a simple in-batch negative sampling strategy [34], which has been shown to be less effective than more involved strategies [51,53,86,88].• The hardware we use for training the models is limiting w.r.t. the batch size and thus the number of negative samples, i.e., we cannot use a batch size greater than 4. • We perform validation and early stopping based on re-ranking.
Considering the points above, we expect that our dual-encoder models, including ones with lightweight encoders, could also be used in retrieval settings if the shortcomings of the training setup are addressed, for example, by using more powerful hardware and state-of-the-art training approaches.On the other hand, we argue that the fact that our models perform well in the reranking setting (see Section 7) shows that it is both easier and more efficient (in terms of time and resources) to train models to be used with Fast-Forward indexes instead of for dense retrieval.

Out-of-Domain Performance
In the previous sections, we found that Fast-Forward indexes and lightweight query encoders show good performance in in-domain ranking tasks.This raises the question whether the models generalize well to out-of-domain tasks.In order to ascertain the out-of-domain capabilities of our models, we evaluate them on a number of test sets from the BEIR benchmark.The evaluation happens in a zero-shot fashion, meaning that we use the same models as before and do not re-train them on the respective datasets.The results are shown in Table 7.It is apparent that the attention-based query encoder yields better results than the embedding-based one in all cases, but the difference varies across datasets.Since both models were trained on MS MARCO, they perform well on the BEIR version of that dataset, as expected; notable differences in performance are observed on Fever and DBpedia-Entity, however, both models manage to improve the BM25 results.Finally, on Quora, SciFact and NFCorpus, re-ranking does not lead to a performance improvement, but rather fails to improve or even degrades the results.We assume that the corresponding tasks either require specific in-domain knowledge of the model or would benefit greatly from query-document attention (cross-attention).

Threats to Validity
In this section, we outline and discuss certain aspects of the experimental evaluation in this article which result in possible threats to the validity of the results.8.3.1 Performance of BERT-CLS.In Tables 3 and 4, we report the performance of dual-encoder ranking models, along with a cross-attention model (BERT-CLS).We found that BERT-CLS performed notably worse, especially when the sparse retrieval depth   is increased.This result is unexpected, especially considering the fact that the cross-attention architecture allows for query-document attention.
In addition to the architecture itself, the models differ in the way they are trained: ANCE and TCT-ColBERT use complex distillation and negative sampling approaches, along with contrastive loss functions (cf.Eq. ( 3)), while BERT-CLS is trained using simple pairwise loss.It is thus reasonable to assume that the negative sampling approach has a positive impact on the performance.Specifically, the contrastive loss trains the models to identify relevant documents among a very large number of irrelevant documents, while the pairwise loss focuses on re-ranking mostly related documents, which could explain the performance drop for higher retrieval depths.
Furthermore, it is important to note that, even if BERT-CLS performed similarly to the dualencoder models, the difference in efficiency would remain the same, leaving the claims we make unaffected.

Latency Measurements.
As Fast-Forward indexes aim at improving ranking efficiency, we mainly focus on the query processing latency, which is reported in Tables 3 to 5 and Fig. 7.As the experiments in the paper have been performed over a longer period of time, there have been slight changes with respect to, for example, hardware or implementations.Consequently, the numbers in latency might not be directly comparable across experiments.Thus, we made sure to make each experiment self-contained, such that these comparisons are not necessary; rather, our results highlight relative latency improvements within each experiment, where all measurements are comparable.In general, one should also keep in mind that latency can be heavily influenced by the way a method is implemented.8.3.3Hybrid Retrieval Baselines.In Tables 3 and 4, we presented, along with the results of our own method, some hybrid retrieval baselines.Table 1 shows the corresponding indexes that we used for the dense retrievers.It is important to note that those are brute-force indexes, i.e., they perform exact NN retrieval.It is thus to be expected that the latency of hybrid retrieval can be further reduced by employing approximate dense retrieval instead; this would likely go hand in hand with a reduction in performance though.

CONCLUSION
In this paper, we proposed Fast-Forward indexes, a simple yet effective and efficient look-upbased interpolation method that combines lexical and semantic ranking.Fast-Forward indexes are based on dense dual-encoder models, exploiting the fact that document representations can be preprocessed and stored, providing efficient access in constant time.Using interpolation, we observed increased performance compared to hybrid retrieval.Furthermore, we achieved improvements of up to 75% in memory footprint and query processing latency due to our optimization techniques, sequential coalescing and early stopping.
Moreover, we introduced efficient encoders for dual-encoder models: Embedding-based and lightweight attention-based query encoders can be used to compute query representations significantly faster without compromising performance too much.Selective BERT document encoders dynamically remove irrelevant tokens from input documents prior to indexing, reducing the document encoding latency by up to 50% and thus making index maintenance much faster.
Our method solely requires CPU computations for ranking, completely eliminating the need for expensive GPU-accelerated re-ranking.

Fig. 1 .
Fig.1.Sequential coalescing combines the representations of similar consecutive passages as their average.Note that  3 and  5 are not combined, as they are not consecutive passages.

1 .
Let   , as used in Algorithm 2, be the true maximum of the dense scores.Then the returned scores are the actual top- scores.

Fig. 3 .
Fig. 3.The distribution of query and passage lengths in the MS MARCO corpus.The statistics are computed based on the development set queries and the first 10 000 passages from the corpus using a BERT base tokenizer.

Fig. 4 .
Fig. 4. The query-encoder types used in this work.Note that the positional encoding that is added to BERT input tokens has been omitted in this figure.

5. 1 . 2
Embedding-based.Embedding-based query encoders can be seen as a special case of BERTbased query encoders (cf.Section 5.1.1).Setting  = 0, we obtain a model without any Transformer encoder layers; what's left is only the token embedding layer .

5. 2 . 1 1 [
Fig.5.The fine-tuning and inference phase of Selective BERT document encoders.In the given example, the documents in the input batch are dynamically shortened to four tokens each based on the corresponding relevance scores.Note that the positional encoding that is added to BERT input tokens has been omitted in this figure.

Fig. 6 .
Fig.6.The average number of Fast-Forward index look-ups per query for interpolation with early stopping at varying cut-off depths  on TREC-DL-Psg'19 with   = 5000 using ANCE.

7. 1 . 3
Varying the First-Stage Retrieval Model.We perform additional passage ranking experiments in Table

7. 4 . 2 Fig. 8 . 12 ( 20 Fig. 9 .
Fig.8.Sequential coalescing applied to TREC-DL-Doc'19.The plot shows the index size reduction in terms of the number of passages and the corresponding metric values for Fast-Forward interpolation with TCT-ColBERT.

12 Fig. 10 .
Fig.10.Evaluation of Fast-Forward indexes created using Selective BERT models.The document encoders are BERT base models with  = 12 and  = 768.During fine-tuning, we set the parameter  = 0.75 (percentage of tokens to keep).We then vary  ∈ [0, 1] during the indexing stage, resulting in progressively higher indexing efficiency (Fig.10a).The corresponding Fast-Forward ranking performance on MSM-Psg-Dev is shown in Fig.10bfor an embedding-based query encoder ( = 0) and in Fig.10cfor an attention-based query encoder ( = 12).Document encoding latency is measured on GPU with a batch size of 256 passages from the MS MARCO corpus (tokenization cost is excluded, as it is identical for all models).

Table 5
and Figs.7a and 10a), we adjusted our way of measuring; we perform multiple runs of each experiment, where each run

Table 2 .
Ranking performance.Retrievers use depths = 1000 (sparse) and   = 10000 (dense).Dense retrievers retrieve passages and perform maxP aggregation for documents.Scores for CLEAR and DEEP-CT are taken from the corresponding papers