Optimizing Guided Traversal for Fast Learned Sparse Retrieval

Recent studies show that BM25-driven dynamic index skipping can greatly accelerate MaxScore-based document retrieval based on the learned sparse representation derived by DeepImpact. This paper investigates the effectiveness of such a traversal guidance strategy during top k retrieval when using other models such as SPLADE and uniCOIL, and finds that unconstrained BM25-driven skipping could have a visible relevance degradation when the BM25 model is not well aligned with a learned weight model or when retrieval depth k is small. This paper generalizes the previous work and optimizes the BM25 guided index traversal with a two-level pruning control scheme and model alignment for fast retrieval using a sparse representation. Although there can be a cost of increased latency, the proposed scheme is much faster than the original MaxScore method without BM25 guidance while retaining the relevance effectiveness. This paper analyzes the competitiveness of this two-level pruning scheme, and evaluates its tradeoff in ranking relevance and time efficiency when searching several test datasets.


INTRODUCTION
Document retrieval for searching a large dataset often uses a sparse representation of document feature vectors implemented as an inverted index which associating each search term with a list of documents containing such a term.Recently learned sparse representations have been developed to compute term weights using a neural model such as transformer based retriever [1,9,12,14,24,30] and deliver strong relevance results, together with document expansion (e.g.[5]).A downside is that top  document retrieval latency using a learned sparse representation is much large than using the BM25 model as discussed in [29,30].In the traditional BM25-based document retrieval with additive ranking, a dynamic index pruning strategy based on top  threshold is very effective by computing the rank score upper bound on the fly for each visited document during index traversal in order to skip low-scoring documents that are unable to appear in the final top  list.Well known traversal algorithms with such dynamic pruning strategies include MaxScore [41] and WAND [2], and their block-based versions Block-Max WAND (BMW) [11] and Block-Max MaxScore (BMM) [4,10].
Mallia et al. [31] propose to use BM25 to guide traversal, called GT, for fast learned sparse retrieval because the distribution of learned weights results in less pruning opportunities and they conducted an evaluation with retrieval model DeepImpact [30].One variation they propose is to compute the final rank scoring as a linear combination of the learned weights and BM25 weights, denoted as GTI.GT is a special case of GTI and this paper treats GTI as the main baseline.Since the BM25 weight for a document term pair may not exist in a learned sparse index, zero filling is used in Mallia et al. [31] to align the BM25 and learned weight models.During our evaluation using GT for SPLADE v2 and its revision SPLADE++ [12,13], we find that as retrieval depth  decreases, BM25 driven skipping becomes too aggressive in dropping documents desired by top  ranking based on learned term weights, which can cause a significant relevance degradation.In addition, there is still some room to further improve index alignment of GTI for more accurate BM25 driven pruning.
To address the above issues, we improve our earlier pruning study on dual guidance with combined BM25 and learned weights [36].Our work generalizes GTI by constraining the pruning influence of BM25 and providing an alternative smoothing method to align the BM25 index with learned weights.In Section 4, we propose a twolevel parameterized guidance scheme with index alignment, called 2GTI, to manage pruning decisions during MaxScore based traversal.We analyze some formal properties of 2GTI on its relevance behaviors and configuration conditions when 2GTI outperforms a two-stage top  search algorithm for a query in relevance.
Section 5 and Appendix A present an evaluation of 2GTI with SPLADE++ [12][13][14] and uniCOIL [15,24] in addition to DeepImpact [30] when using MaxScore on the MS MARCO datasets.This evaluation shows that when retrieval depth  is small, or when the BM25 index is not well aligned with the underlying learned sparse representation, 2GTI can outperform GTI and retain relevance more effectively.In some cases, there is a tradeoff that 2GTI based retrieval may be slower than that of GTI while 2GTI is still much faster than the original MaxScore method without BM25 guidance.2GTI is also effective for the BEIR datasets in terms of the zero-shot relevance and retrieval latency.In Appendix B, we have extended the use of 2GTI for a BMW-based algorithm such as VBMW [32].We demonstrate that 2GTI with VBMW can be useful for a class of short queries and when  is small.

BACKGROUND AND RELATED WORK
The top- document retrieval problem identifies top ranked results in matching a query.A document representation uses a feature vector to capture the semantics of a document.If these vectors contain much more zeros than non-zero entries, then such a representation is considered sparse.For a large dataset, document retrieval often uses a simple additive formula as the first stage of search and it computes the rank score of each document  as: where  is the set of all search terms,  (, ) is a weight contribution of term  in document , and   is a document-independent or query-specific term weight.Assume that  (, ) can be statically or dynamically scaled, this paper views   = 1 for simplicity of presentation.An example of such formula is BM25 [18] which is widely used.For a sparse representation, a retrieval algorithm often uses an inverted index with a set of terms, and a document posting list of each term.A posting record in this list contains document ID and its weight for the corresponding term.
Threshold-based skipping.During the traversal of posting lists in document retrieval, the previous studies have advocated dynamic pruning strategies to skip low-scoring documents, which cannot appear on the final top- list [2,37].To skip the scoring of a document, a pruning strategy computes the upper bound rank score of a candidate document , referred to as  ().
If  () ≤  where  is the rank score threshold in the top final  list, this document can be skipped.For example, WAND [2] uses the maximum term weights of documents of each posting list to determine the rank score upper bound of a pivot document while BMW [11] and its variants (e.g.[32]) optimize WAND use block-based maximum weights to compute the score upper bounds.MaxScore [41] uses term partitioning and the top- threshold to skip unnecessary index visitation and scoring computation.Previous work has also pursued a "rank-unsafe" skipping strategy by deliberately over-estimating the current top- threshold by a factor [2,7,27,39].
Learned sparse representations.Earlier sparse representation studies are conducted in [43], DeepCT [9], and SparTerm [1].Recent work on this subject includes SPLADE [12][13][14], which learns token importance for document expansion with sparsity control.DeepImpact [30] learns neural term weights on documents expanded by DocT5Query [5].Similarly, uniCOIL [24] extends the work of COIL [15] for contextualized term weights.Document retrieval with term weights learned from a transformer has been found slow in [29,31].Mallia et al. [31] state that the MaxScore retrieval algorithm does not efficiently exploit the DeepImpact scores.Mackenzie et al. [29] view that the learned sparse term weights are "wacky" as they affect document skipping during retrieval thus they advocate ranking approximation with score-at-a-time traversal.
Our scheme uses a hybrid combination of BM25 and learned term weights, motivated by the previous work in composing lexical and neural ranking [16,22,25,26,42].GTI adopts that for final ranking.A key difference in our work is that hybrid scoring is used for two-level pruning control and its formula can be different from final ranking.The multi-level hybrid scoring difference provides an opportunity for additional pruning and its quality control.Thus the outcome of 2GTI is not a simple linear ranking combination of BM25 and learned weights and two-level guided pruning yields a nonlinear ensemble effect to improve time efficiency while retaining relevance.Our evaluation will include a relevance and efficiency comparison with MaxScore using a simple linear combination.
This paper mainly focuses on MaxScore because it has been shown more effective for relatively longer queries [34].We also consider VBMW [32] because it is generally acknowledged to represent the state of the art [29] for many cases, especially when  is small and the query length is short [34].
Figure 1 shows the performance of the original MaxScore retrieval algorithm without BM25 guidance, GTI, and the proposed 2GTI scheme in terms of MRR@10 and recall@k when varying top  in searching MS MARCO passages on Dev query set.Here  is the targeted number of top documents to retrieve and it is also called retrieval depth sometime in the literature.Section 5 has more detailed dataset and index information.For both SPLADE++ and uniCOIL, we build the BM25 model following [31] to expand passages first using DocT5Query, and then use the BERT's Word Piece tokenizer to tokenize the text, and align the token choices of BM25 with these learned models.From Figure 1, there are significant recall and MRR drops with GTI when  varies from 1,000 to 10.There are two reasons contributing to the relevance drops.
(1) When the number of top documents  is relatively small, the relevance drops significantly.As  is small, dynamically-updated top  score threshold becomes closer to the maximum rank score of the best document.Fewer documents fall into top  positions and more documents below the updated top  score threshold would be removed earlier.Then the accuracy of skipping becomes more sensitive.The discrepancy of BM25 scoring and learned weight scoring can cause good candidates to be removed inappropriately by BM25 guided pruning, which can lead to a significant relevance drop for small .
(2) The relevance drop for SPLADE++ with BM25 guided pruning is noticeably much more significant than uniCOIL.That can be related to the fact that SPLADE++ expands tokens of each document tokens differently and much more aggressively than uniCOIL.As a result, 98.6% of term document pairs in SPLADE++ index does not exist in the BM25 index even after docT5Query document expansion while this number is 1.4% for uniCOIL.Thus, BM25 guidance can become less accurate and improperly skip more good documents.
With the above consideration, our objective is to control the influence of BM25 weights in a constrained manner for safeguarding relevance prudently, and to develop better weight alignment when the BM25 index is not well aligned with the learned sparse index.In Figure 1, the recall@k number of 2GTI marked with blue squares is similar to that of the original method without BM25 guidance.Their MRR@10 numbers overlapped with each other, forming a nearlyflat lines, which indicates their MRR@10 numbers are similar even  decreases.The following two sections present our solutions in addressing the above two issues respectively.

TWO-LEVEL GUIDED TRAVERSAL 4.1 Two-level guidance for MaxScore
We assume the posting list of each term is sorted in an increasing order of document IDs in the list.The MaxScore algorithm [41] can be viewed to conduct a sequence of traversal steps and at each traversal step, it conducts term partitioning and then examines if scoring of a selected document should be skipped.We differentiate pruning-oriented actions in two levels as follows.
• Global level.MaxScore uses the maximum scores (upper bounds) of each term and the current known top  threshold to partition terms into two lists at each index traversal step: the essential list and non-essential list.The documents that do not contain essential terms are impossible to appear in top  results and thus can be eliminated.In the next step of index traversal, it will start with the minimum unvisited document ID only from the posting lists of essential terms.Thus index visitation is driven by moving such a minimum document ID pointer from the essential list.
We consider this level of pruning as global because it guides skipping of multiple documents and explores inter-document relationship implied by maximum term weights.Figure 2(a) depicts an example of global pruning flow in MaxScore with 4 terms and each posting list maintains a pointer to the current document being visited at a traversal step.The term partitioning identifies two essential terms  3 and  4 .The minimum document ID among the current document pointers in these essential terms is  3 , and any document ID smaller than  3 is skipped from further consideration during this traversal step.The current visitation pointer of the posting list of non-essential lists also moves to the smallest document ID equal to or bigger than  3 .• Local level.Once a document is selected for possible full evaluation, the ranking score upper bound of this document can be estimated and gradually tightened using maximum weight contribution or the actual weight of each query term for this document.This incrementally refined score upper bound is compared against the dynamically updated top  threshold, which provides another opportunity to fully or partially skip the evaluation of this document.We differentiate this level of skipping decision as local because this pruning is localized towards a specific document selected.Figure 2(b) illustrates an example of local pruning in MaxScore. 3 is the document selected after term partitioning and the maximum or actual weights contributed from all posting lists for document  3 are utilized for the local pruning decision.
Instead of directly using BM25 to guide pruning at the global and local levels, we propose to use a linear combination of BM25 weights and learned weights to guide skipping at each level as follows, which allows a parameterizable control of their influence.
• We incrementally maintain three accumulated scores for each document  (),  (), and  (). () is for global pruning,  () is for local pruning, and  () is for final ranking.
where 0 ≤ , ,  ≤ 1,   () follows Expression 1 using BM25 weights, and   () follows Expression 1 using learned weights.The RankScore formula follows the GTI setting in [31], and 2GTI with  =  = 1 behaves like GTI. 2GTI with  =  =  is the same as MaxScore retrieval and it uses learned neural weights only when  = 0. • With the above three scores for each evaluated document, we maintain three separate queues:   ,   ,   for documents with the  largest scores in terms of  (),  (), and  () respectively.The lowest-scoring document in each queue is removed separately without inter-queue coordination.These queues are maintained for different purposes: the first two queues regulate global and local pruning while the last queue is to produce the final top  results.When a document based on local pruning is eliminated for further consideration, this document is not added to global and local queues   and   .
But this document may have some partial score accumulated for its  (), and it is still added to   in case this document with the partial score may qualify in the top  results based on the latest  () value.These three queues yield three dynamic top- thresholds   ,   , and   .They can be used for a pruning decision to avoid any further scoring effort to obtain or refine  ().
Revised MaxScore pruning control flow: Figure 2(c) illustrates the extra control flow added for the revised MaxScore algorithm.Let  be the number of query terms.We define: • Given  posting lists corresponding to  query terms, each th posting list contains a sequence of posting records and each record contains document ID , its BM25 weight   (, ) and learned weight   (, ).Posting records are sorted in an increasing order of their document IDs.• An array   of  where   [] is the maximum contribution of the learned weight to any document for -th term.• An array   of  where   [] is the maximum contribution of the BM25 weight to any document for -th term.
Global pruning with term partitioning.For each query term 1 ≤  ≤  , we find the largest integer  from 1 to  so that All terms from  to  are considered as essential.If a document  does not contain any

Local Pruning in MaxScore
Check if further scoring of selected doc can be skipped.essential term, the upper bound of  () ≤

Non-essential terms
This document cannot appear in the final top  list based on the global score.Then this document is skipped without appearing in any of the three queues.
Once the essential term list above the  position is determined, let the next minimum document ID among the current position pointers in the posting lists of all essential terms be document .We also call it the  document.
Local pruning.Next we check if the detailed scoring of the selected pivot document  can be avoided fully or partially.Following an implementation in [40], we describe this procedure with a modification to use hybrid scoring as follows and it repeats the following three steps with the initial value of term position  as the  position and  decreases by 1 at each loop iteration.
• Let   () be the sum of all term weights of document  in the posting lists from position  to  after linear combination.Namely   () =  −1 =   (, ) + (1 − )  (, ) when -th posting list contains , and otherwise this value is 0. As  decreases, the term weight of pivot document  is extracted from the posting list of -th term if available.
• Let   () be the bound for partial local score of document  in the posting lists of the first to -th query terms.
• At any time during the above calculation, if further rank scoring for  document  is skipped and this document will not appear in any of the three queues.Figure 2(b) depicts that the partial bound and partial score of  ( 3 ) for pivot document  3 are computed to assist a pruning decision.
Complexity. 2GTI's complexity is the same as MaxScore and GTI.The in-memory space cost includes the space to host the inverted index involved for this query and the three queues.The time complexity is proportional to the total number of posting records involved for a query multiplied by log  for queue updating.
A posting list may be divided and compressed in a block-wise manner and Block MaxScore can use 2GT similarly while a previous study [34] shows Block-Max MaxScore is actually slower than MaxScore under several compression schemes.We will discuss the use of 2GT in block-based BMW in Appendix B.

Relevance properties of 2GTI
2GTI ensembles BM25 and learned weights for pruning in addition to rank score composition, producing a top  ranked list which can be different than additive ranking with learned weights or their linear combination of BM25 weights.Thus 2GTI is not rank-safe compared to any of such baselines.Two-level pruning is driven by different combination coefficients , , and  configured in 2GTI and their value gap provides an opportunity for additional pruning while 2GTI tries to retain relevance effectiveness.Is there a relevance guarantee 2GTI can offer in case such pruning skips relevant documents erroneously sometimes?To address this question analytically, this subsection presents three properties regarding the relevance outcome and competitiveness of the 2GTI based retrieval.
Our analysis will use the following terms.Given query , let   be a ranked list of all documents of the given dataset sorted in a descend order of their rank scores based on a linear combination of their BM25 weights and learned weights with coefficient , namely  ∈  *   (, ) + (1 − )  (, ) for document .Specifically, there are three ranked lists:   ,   , and   .2GTI maintains 3 queues   ,   , and   with 3 dynamically updated top  thresholds,   ,   ,   .Let Θ  , Θ  , Θ  be the final top  threshold of these 3 queues at the end of 2GTI.Namely it is the rank score of -th document in the corresponding queue.The following fact is true: Proposition 1. Assume the subset of top  documents in each of   ,  , and   is unique after arbitrarily swapping rank positions of documents with the same score.Then any document that appears in top- positions of   ,   , and   is in the top- outcome of 2GTI.
Proof.For any document  that appears in the top  positions of all three ranked lists, If document  is eliminated by global pruning during 2GTI retrieval,  () = Θ  =   and the   -based rank score of both document  and ( + 1)-th document in ranked list   has to be Θ  .Then the subset of top  documents in   is not unique after arbitrarily swapping rank positions of documents with the same score, which is a contradiction.
With the same reason, we can argue that document  cannot be eliminated by local pruning or rejected by   when being added to   during 2GTI retrieval.Then this document has to appear in the final outcome of 2GTI. □ The following two propositions analyze when 2GTI performs better in relevance than a two-stage search algorithm called 2 , which fetches top  results from list   , and then re-ranks using the scoring formula of   .Proposition 2. Assume the subset of top  documents in each of   ,  , and   is unique after arbitrarily swapping rank positions of documents with the same score.If 2GTI is configured with  =  or  = , the average   -based rank score of the top  documents produced by 2GTI is no less than that of two-stage algorithm 2 , .
Proof.We let 2[] denote the top  document subset in the outcome of 2 , .To prove this proposition, we compare the average   -based rank score of documents in 2[] and that in   at the end of 2GTI.Notice that for any document  satisfying  ∈ 2[], it is in the top  results of ranked list   and this top  subset is deterministic based on the assumption of this proposition.Then  cannot be eliminated by global pruning in 2GTI.
Given any document  satisfying  ∈ 2[] and  ∉   at the end of 2GTI, it is either eliminated by local pruning with threshold Θ  or by top  thresholding of Queue   with threshold   .In the later case,  () ≤   ≤ Θ  .When  is eliminated by local pruning, global pruning has to use a different formula because  is not eliminated by global pruning, and then 2GTI has to be configured with  =  instead of  = .In that case local pruning is identical to elimination with top  threshold of   .Then  () ≤   ≤ Θ  .
Since the size of both 2[] and □ Definition 1.For a dataset in which documents are only labeled relevant or irrelevant for any test query, we call ranked list   outmatches   if whenever   orders a pair of relevant and irrelevant documents correctly for a query,   also orders them correctly.Proposition 3. Assume documents in a dataset are only labeled as relevant or irrelevant for a test query.Given a query, when   outmatches   , which outmatches   , 2GTI retrieves equal or more relevant documents in top- positions than two-stage algorithm 2 , .
Proof.When 2GTI completes its retrieval for a query, we count the number of relevant documents in top  positions of list   , queue   , and queue   as   ,   , and   , respectively.To show   ≤   , we initialize them as 0 first and run the following loop to compute   and   iteratively.The loop index variable  varies from ,  − 1, until 1, and at each iteration we look at document  at Position  of   , and document  at Position  of   .Let   and   be their binary label by which value 1 means relevant and 0 means irrelevant.
• The above process repeats and moves to a higher position until  = 1.When  = 1, with top-1 document  in   and top-1  in   , the only possible cases are   =   or   = 0 and   = 1.Therefore, at the end of the above process,   ≥   .Similarly, we can verify that   ≥   since   outmatches   .Therefore   ≥   ≥   .The number of relevant documents up to position  retrieved for 2GTI is   while the number of relevant documents up to position  retrieved for 2 , is   .Thus this proposition is true.□ The above analysis indicates that the top documents agreed by three rankings   ,   , and   are always kept on the top by 2GTI, and a properly configured 2GTI algorithm could outperform a two-stage retrieval and re-ranking algorithm in relevance, especially when ranking   outmatches   and   outmatches   for a query.Since two-stage search with neural re-ranking conducted after BM25 retrieval is well adopted in the literature, this analysis provides useful insight into the "worst-case" relevance competitiveness of 2GTI with two-level pruning.GTI can be considered as a special case of 2GTI with  =  = 1 when the same index is used, and the above three propositions are true for GTI.2GTI provides more flexibility in pruning with quality control than GTI and Section 5 further evaluates their relevance difference.

Alignment of tokens and weights
The BM25 model is usually built on word-level tokenization on the original or expanded document sets and the popular expansion method uses DocT5Query with the same tokenization method.When a learned representation uses a different tokenization method such as BERT's WordPiece based on subwords from BERT vocabulary, we need to align it with BM25 for a consistent term reference.For example, when using BM25 to guide the traversal of SPLADE index, the WordPiece tokenizer is used for a document expanded with DocT5Query before BM25 weighting is applied to each token.Once tokens are aligned, from the index point of view, the same token has two different posting lists based on BM25 weights and based on SPLADE.To merge them when postings do not align oneto-one, the missing weight is set to zero as proposed in [31].We call this zero-filling alignment.As alternatives, we propose two more methods to fill missing weights with better weight smoothness.
• One-filling alignment.We assign 1 as term frequency for a missing token in the BM25 model while this token appears in the learned token list of a document.The justification is that a zero weight is to be too abrupt when such a term is considered to be useful for a document based on a learned neural model.
Having term frequency one means that this token is present in the document, even with the lowest value.• Scaled alignment.This alternative replaces the missing weights in the BM25 model based on a scaled learned score by using the ratio of mean values of non-zero weights in both models.
For document ID  that contains term , let its BM25 weight be   (, ) and its learned weight be   (, ).Let  *  (, ) be an adjusted BM25 weight.Set   contains all posting records with nonzero BM25 weights.Set   contains posting records with non-zero learned weights.Then  *  (, ) is defined as: (, ),   (, ) = 0. Datasets and settings.Our evaluation uses the MS MARCO document and passage collections [3,8], and 13 publicly available BEIR datasets [38].The results for the BEIR datasets are described in Appendix A. For MS MARCO, the contents in the document collections are segmented during indexing and re-grouped after retrieval using "max-passage" strategy following [23].There are 8.8M passages with an average length of 55 words, and 3.2M documents with an average length of 1131 words before segmentation.The Dev query set for passage and document ranking has 6980 and 5193 queries respectively with about one judgment label per query.Each of the passage/document ranking task of TREC Deep Learning (DL) 2019 and 2020 tracks provides 43 and 54 queries respectively with many judgment labels per query.

EVALUATIONS
In producing an inverted index, all words use lower case letters.Following GT, we packed the learned score and the term frequency in the same integer.For DeepImpact, we adopt GT's index1 directly.The BM25-T5's index is dumped from the DeepImpact index.Both BM25-T5 and DeepImpact are using natural words tokenization.
SPLADE and uniCOIL use the BERT's Word Piece tokenizer.In order to align with them, the BM25-T5-B index reported in the following tables uses the same tokenizer as well.The impact scores of uniCOIL is obtained from Pyserini [23] 2 .For SPLADE, in order to achieve the best performance, we retrained the model following the setup in SPLADE++ [13].We start from the pretrained model coCondenser [6] and distill using the sentenceBERT hard negatives3 from a cross-encoder teacher [35] with MarginMSE loss.For FLOP regularization, we use 0.01 and 0.008 for query and documents respectively.We construct the inverted indexes, convert them to the PISA format, and compress them using SIMD-BP128 [21] following [31,34].
Table 1 shows the dataset and index characteristics of the different weighting models on the MS MARCO Dev dataset.Following [29], we assume that a query can be pre-processed with a "pseudodocument" trick that assigns custom weights to query terms in uniCOIL and SPLADE.Therefore, there may be token repetition in each query to reflect token weighting.Column 1 is the mean query length in tokens without or with counting duplicates.Column 3 is the inverted index size while the last column is the size after merging BM25 and learned weights in the index.
The C++ implementation of 2GTI with the modified MaxScore and VBMW algorithms are embedded in PISA [33], and the code will be released in https://github.com/Qiaoyf96/2GTI.Our evaluation using this implementation runs as a single thread on a Linux server with Intel i5-8259U 2.3GHz and 32GB memory.Weights are chosen by sampling queries from the MS MARCO training dataset.
Metrics.For MS MARCO Dev set, we report the relevance in terms of mean reciprocal rank (MRR@10 on passages and MRR@100 on documents), following the official leader-board standard.We also report the recall@k ratio which is the percentage of relevant-labeled results appeared in the final top- results.For TREC DL test sets, we report normalized discounted cumulative gain (nDCG@10) [17].The above reporting follows the common practice of the previous work (e.g.[12,15,16,30]).
Before timing queries, all compressed posting lists and metadata for tested queries are pre-loaded into memory, following the same assumption in [19,32].Retrieval mean response times (MRT) are reported in milliseconds.The 99th percentile time ( 99 ) is reported within parentheses in the tables below, corresponding to the time occurring in the 99th percentile denoted as tail latency in [28].
Statistical significance.For the reported numbers on MS MARCO passage and document Dev sets in the rest of this section, we have performed a pairwise t-test on relevance difference between 2GTI and a GTI baseline, and between 2GTI and the original learned sparse retrieval without BM25 guidance.No statistically significant UniCOIL, Documents. = 1.For 2GTI-Accurate:  = 0; for 2GTI-Fast:  = 0.5 ( = 10), 1 ( = 1000).For GTI and 2GTI:  = 0.1.DeepImpact, Passages. = 1.For 2GTI-Accurate:  = 0; for 2GTI-Fast:  = 0.5 ( = 10), 1 ( = 1000).For GTI and 2GTI:  = 0.5.degradation has been observed at the 95% confidence level.We have also performed a pairwise t-test comparing the reported relevance numbers of 2GTI and GTI and mark ' † ' in the evaluation tables if there is a statistically significant improvement by 2GTI over GTI at the 95% confidence level.We do not perform a t-test on DL'19 and DL'20 query sets as the number of queries in these sets is small.Overall results with MS MACRO.Table 2 lists a comparison of 2GTI with the baseline using three sparse representations for retrieval on MS MARCO and TREC DL datasets.2GTI uses scaled filling alignment as default while GTI uses zero filling as specified in [31].The  value is chosen the same for GTI and 2GTI for each representation, which is the best for most of cases.The "accurate" configuration denotes the one that reaches the highest relevance score.The "fast" configuration denotes the one that reaches a relevance score within 1% of the accurate configuration while being much more faster.
2GTI vs. GTI in SPLADE++.Table 2 shows 2GTI with default scaled filling significantly outperforms GTI with default zero filling for SPLADE++, where BM25 index is not well aligned."SPLADE++-Org" denotes the original MaxScore retrieval performance using SPLADE++ model trained by ourselves and its MRR@10 number is higher than what has been reported in [13].When  = 1, 000, GT is slightly better than GTI, and with the fast configuration, MRR@10 of 2GTI is 32.4% higher than that of GT while 2GTI is 7.8x faster than GT for the Dev set.The significant increase in nDCG@10 and decrease in the MRT are also observed in DL'19 and DL'20.When  = 10, there is also a large relevance increase and time reduction from GTI or GT to 2GTI for all three test sets.For example, the relevance is 46.4% or 44.6% higher and the mean latency is 5.2x or 5.3x faster for the Dev set.
Compared to the original MaxScore method, 2GTI has about the same relevance score for both  = 10 and  = 1, 000 while having much smaller latency.For example, 6.5x reduction (278ms vs. 43.1ms)for the Dev passage set when  = 1, 000 and 5.3x reduction when  = 10 (121ms vs 22.7ms) with the 2GTI-fast configuration.
2GTI vs. GTI in DeepImpact and uniCOIL.As shown in Table 2, GTI (or GT) performs very well for  = 1, 000 in both Deep-Impact and uniCOIL in speeding up retrieval while maintaining a relevance similar as the original retrieval.The two-level differentiation for dynamic index pruning does not improve relevance or shorten retrieval time.This can be explained as BM25-T5 index is well aligned with the DeepImpact index and with the uniCOIL index.Also because of this reason, filling to address index alignment is not needed with no improvement in these two cases.
When  decreases from 1,000 to 10, as shown in Figure 1 discussed in Section 3, the recall ratio starts to drop, and relevance effectiveness degrades.When  = 10 as shown in Table 2, DeepImpact-2GTI-fast can increase MRR@10 from 0.3375 by GTI to 0.3395 for the Dev set and deliver slightly higher MRR@10 or nDCG@10 scores than GTI in DL'19 and DL'19 sets.For uniCOIL, 2GTI-fast increases MRR@10 from 0.3384 by GTI to 0.3548 for the Dev set and increases nDCG@10 from 0.6959 to 0.7135 for DL '19.There is also a modest relevance increase for DL'20 passages with  = 10 and a similar trend is observed for the document retrieval task.The price paid for 2GTI is its retrieval latency increase while its latency is still much smaller than the original retrieval time.
Design options with weight alignment and threshold overestimation.Table 3 examines the impact of weight alignment and a design alternative based on threshold over-estimation for MS MARCO passage Dev set using SPLADE++ when  = 10.In the top portion of this table, threshold over-estimation by a factor of  (1.1, 1.3, and 1.5) is used in the original retrieval algorithm without BM25 guidance, and these factor choices are similar as ones in [7,27,39].That essentially sets  = 0,  = 0, and  = 0 while multiplying   and   by the above factor in 2GTI.The result shows that even threshold over-estimation can reduce the retrieval time, relevance reduction is significant, meaning that the aggressive threshold used causes incorrect dropping of some desired documents.
The second portion of Table 3 examines the impact of different weight filling methods described in Section 4.3 for alignment when they are applied to GTI and 2GTI, respectively.In both cases, scaled filling marked as "/s" is most effective while one-filling marked as "/1" outperforms zero-filling marked as "/0" also.The MRT of 2GTI/s becomes 10.5x smaller than 2GTI/0 while there is no negative impact to its MRR@10.The MRT of GTI/s is about 13.0x smaller than GTI/0 while there is a large MRR@10 number increase.A validation on 2GTI's properties.To corroborate the competitiveness analysis in Section 4.2, Table 4 gives MRR@10 scores and retrieval times in milliseconds of the algorithms with different configurations on the Dev set of MS MARCO passages with  = 10 and SPLADE++ weights.The result shows that the listed configurations of 2GTI have a higher MRR@10 number than 2-stage search 2 , , and 2GTI with  =  = 1 that behaves as GTI.MRR@10 of ranking with a simple linear combination of BM25 and learned weights is only slightly higher than 2GTI, but it is much slower.
Sensitivity on weight distribution.We have distorted the SPLADE++ weight distribution in several ways to examine the sensitivity of 2GTI and found that 2GTI is still effective.For example, we apply a square root function to the neural weight of every token in MS MARCO passages, the relevance score of both original retrieval and 2GTI drops to 0.356 MRR@10 due to weight distortion, while 2GTI is 5.0x faster than the original MaxScore when  = 10.Efficient SPLADE model.Table 5 shows the application of 2GTI in a recently published efficient SPLADE model [20] which has made several improvements in retrieval speed.We have used the released checkpoint of this efficient model called BT-SPLADE-L, which has a weaker MRR@10 score, but significantly faster than our trained SPLADE baseline reported in Table 2.When used with this new SPLADE model, 2GTI/s-Fast version results in a 2.2x retrieval time speedup over MaxScore.Its MRR@10 is higher than GTI/s and has less than 1% degradation compared to the original MaxScore.

CONCLUDING REMARKS
The contribution of this paper is a two-level parameterized guidance scheme with index alignment to optimize retrieval traversal with a learned sparse representation.Our formal analysis shows that a properly configured 2GTI algorithm including GTI can outperform a two-stage retrieval and re-ranking algorithm in relevance.
Our evaluation shows that the proposed 2GTI scheme can make the BM25 pruning guidance more accurate to retain the relevance.For MaxScore with SPLADE++ on MS MARCO passages, 2GTI can lift relevance by up-to 32.4% and is 7.8x faster than GTI when  = 1, 000, and by up-to 46.4% more accurate and 5.2x faster when  = 10.In all evaluated cases, 2GTI is much faster than the original retrieval without BM25 guidance.For example, up-to 6.5x faster than MaxScore on SPLADE++ when  = 10.We have also observed similar performance patterns on BEIR datasets when comparing 2GTI with GTI and the original MaxScore using SPLADE++ learned weights.Compared to other options such as threshold underestimation to reduce the influence of BM25 weights, the two-level control is more accurate in maintaining the strong relevance with a much lower time cost.While our study is mainly centered with MaxScore-based retrieval, 2GTI can be used for VBMW and our evaluation shows that VBMW-2GTI can be a preferred choice for a class of short queries without stop words when  is small.The red curve connected with dots fixes  = 1 and varies  from 1 at the left end to 0 at the right end.As  decreases from 1 to 0, the latency increases because BM25 influences diminish at the global pruning level and fewer documents are skipped.The relevance for this curve is relatively flat in general and lower than that of the blue curve, representing the global level BM25 guidance reduces time significantly, while having less impact on the relevance.

A ADDITIONAL EVALUATION RESULTS
The blue curve connected with squares fixes  = 1 at the global level and varies  from 1 at the left bottom end to 0 at the right top end.Decreasing  value is positive in general for relevance towards some point as BM25 influence decreases gradually at the local level and after such a point, the relevance gain becomes much smaller or negative.For example, after  in the blue curve in SPLADE++ becomes 0.3 for the Dev set, its additional decrease does not lift MRR@10 visibly anymore while the latency continues to increase, which indicates the relevance benefit has reached the peak at that point.Our experience with the tested datasets is that the parameter setting for 2GTI can reach a relevance peak typically when  is close to 1 and  varies between 0.3 and 1.
Note that even the above result advocates that  is close to 1,  and  still have different values to be more effective for the tested data, reflecting the usefulness of two-level pruning control.
Threshold under-estimation.In Figure 3, the brown curve connected with triangles fixes  =  = 1 and under-estimates the skipping threshold by a factor of  at the local and global levels.That behaves like GTI coupled with scaled weight filling as a special case of 2GTI. varies from 1 at the left bottom end to 0.7 at the right top end of this brown curve.As  decreases, the skipping threshold becomes very loose and there is less chance that desired documents are skipped.Then retrieval relevance can improve while retrieval time can increase substantially.Comparing with the blue curve that adjusts , retrieval takes a much longer time in the brown curve to reach the peak relevance, as shown in this figure, and the brown curve is generally placed on the right side of the blue curve.For example on the Dev set with uniCOIL, the brown curve with threshold under-estimation reaches the best relevance at mean latency 3.7ms while the blue curve with  adjustment reaches the same peak at mean latency 2.3ms, which is 1.6x faster.
Zero-shot performance on the BEIR datasets.We evaluate the zero-shot ranking effectiveness and response time of 2GTI using the 13 search and semantic relatedness datasets from the BEIR collection.Our training of SPLADE++ model is only based on MS MARCO data without using any BEIR data.Table 6 lists the nDCG@10 scores of original MaxScore on SPLADE++, 2GTI/s-Fast (=1, =0.3,=0.05) and GTI (==1, =0.05).The retrieval depth is  = 10 and  = 1000.This table also reports mean response time of retrieval in milliseconds.The SPLADE++ model trained by ourself has an average nDCG@10 score 0.500 close to 0.507 reported in the SPLADE++ paper [13].The original MaxScore's nDCG@10 score does not change when  = 10 and  = 1000.
. When  = 10, 2GTI has almost identical nDCG@10 scores as the original MaxScore while 2GTI is on average 2.0x faster than MaxScore for these BEIR datasets.When GTI runs on the same index data, its average nDCG@10 score is 0.43 MRR@10 and it is faster than 2GTI with an average 6.1x speedup over the original MaxScore for these datasets.Two-level pruning in 2GTI can preserve relevance better than GTI and this is consistent with what we have observed for searching MS MARCO passages.
When  = 1000, the guided traversal algorithms have a better chance to retain relevance.2GTI has a slightly higher average relevance of 0.501 MRR@10 than that with  = 10 and it is about 2.5x faster on average than the original MaxScore.For GTI running on the same index with the same alignment, the average MRR@10 is 0.496 whil average speedup 2.7x over MaxScore.Its relevance score is close to that of 2GTI as BM25-driven pruning under a large  value can still keep a good recall ratio.

1 Figure 3 :
Figure 3: Controlling influence of BM25 on pruning If   =   , we add   to both   and   .Continue this loop.•Now≠   .If   = 0,   = 1,we add 1 to   , and continue the loop.If   = 1,   = 0, there are two cases: -If  is within top  positions of current   , we add 1 to both   and   .Swap the positions of documents  and  in   .Continue the loop.-If  is not within top  positions of   , since  is in the top  of   , it cannot be globally pruned and it will be evaluated by 2GTI for a possibility of entering   .If  is ranked before  in list   , and since   outmatches   ,  has to be ranked before  in both   and   .That is a contradiction.If  is ranked after  in   , we swap the positions of  and  in   .Continue the loop.

Table 1 :
Model characteristics with MS MARCO Dev set

Table 3 :
Impact of design options on MS MARCO passages

Table 6 :
Zero-shot relevance in NDCG@10 and retrieval latency in milliseconds on BEIR datasets with SPLADE++