Representation Sparsification with Hybrid Thresholding for Fast SPLADE-based Document Retrieval

Learned sparse document representations using a transformer-based neural model has been found to be attractive in both relevance effectiveness and time efficiency. This paper describes a representation sparsification scheme based on hard and soft thresholding with an inverted index approximation for faster SPLADE-based document retrieval. It provides analytical and experimental results on the impact of this learnable hybrid thresholding scheme.


INTRODUCTION
Recently learned sparse retrieval techniques [5-8, 10, 20, 23, 37] have become attractive because such a representation can deliver a strong relevance by leveraging transformer-based models to expand document tokens with learned weights and can take advantage of traditional inverted index based retrieval techniques [24,25].Its query processing is cheaper than a dense representation which requires GPU support (e.g.[31,32,35]) even with efficiency optimization through approximate nearest neighbor search [14,34,38].
This paper focuses on the SPLADE family of sparse representations [6][7][8] because it can deliver a high MRR@10 score for MS MARCO passage ranking [4] and a strong zero-shot performance for the BEIR datasets [33], which are well-recognized IR benchmarks.The sparsification optimization in SPLADE has used L1 and FLOPS regularization to minimize non-zero weights during model learning, and our objective is to exploit additional opportunities to further increase the sparsity of inverted indices produced by SPLADE.Earlier static inverted index pruning research [1][2][3] for a lexical model has shown the usefulness of trimming a term posting list or a document by a limit.Yang et al. [36] conduct top token masking by limiting the top activated weight count uniformly per document and gradually reducing this weight count limit to a targeted constant during training.Motivated by these studies [1][2][3]36] and since they have not addressed the learnability of a pruning limit through relevance-driven training, this paper exploits a learnable thresholding architecture to filter out unimportant neural weights produced by the SPLADE model through joint training.
The contribution of this paper is a learnable hybrid hard and soft thresholding scheme with an inverted index approximation to increase the sparsity of SPLADE-based document and query feature vectors for faster retrieval.In addition to experimental validation with MS MARCO and BEIR datasets, we provide an analysis of the impact of hybrid thresholding with joint training on index approximation errors and training update effectiveness.

BACKGROUND
For a query  and a document , after expansion and encoding, they can be represented by vector ì () and ì () with length | |, where  is the vocabulary set.The rank score of  and  is computed as For sparse vectors with many zeros, retrieval can utilize a data structure called inverted index during online inference for fast score computation [24,25].The SPLADE model uses the BERT token space to predict the feature vector ì .In its latest SPLADE++ model, it first calculates the importance of -th input token in  for each  in  : is the BERT input embedding for -th token.Transform() is a linear layer with GeLU activation and Layer-Norm.The weights in this linear layer, ì   , and   are the SPLADE parameters updated during training and we call them set Θ. Then the -th entry   of document  (or a query) is max-pooled as The loss function of SPLADE models [6][7][8] contains a per-query ranking loss   and sparsity regularization.The ranking loss has evolved from a log likelihood based function for maximizing positive document probability to margin MSE for knowledge distillation.This paper uses the loss of SPLADE with a combination that delivers the best result in our training process.  is the ranking loss with margin MSE for knowledge distillation [12].[37], DeepCT [5], DeepImpact [23], and uniCOIL [10,20].The sparsity of a neural network is studied in the deep learning community.Soft thresholding in [16] adopts a learnable threshold with function (, ) =  ( − ) to make parameter  zero under threshold .A hard thresholding function  (, ) =  when  ≥  otherwise 0. Approximate hard thresholding [28] uses a Gauss error function to approximate  (, ) with smooth gradients.Dynamic sparse training [21] finds a dynamic threshold with marked layers.These works including the recent ones [9] are targeted for sparsification of parameter edges in a deep neural network.In our context, a token weight   is an output node in a network.The sparsification of output nodes is addressed in activation map compression [11] using ReLU as soft thresholding together with L1 regularization.The work of [15] further boosts sparsity with the Hoyer regularization and a variant of ReLU.The above techniques have not been investigated in the context of sparse retrieval, and the impact of thresholding on relevance and query processing time with inverted indices, requires new design considerations and model structuring for document retrieval, even the previous work can be leveraged.Design considerations.To zero out a token weight below a learnable threshold, there are two options: soft thresholding [16], and approximate hard thresholding [28].For query token weights, we find that soft thresholding does not affect relevance significantly.For document token weights, our study finds that compared to soft thresholding, hard thresholding can retain relevance better since it does not change token weights when exceeding a threshold.Since the subgradient for hard thresholding with respect to a threshold is always 0, an approximation needs to be carried out for training.For search index generation, an inverted index produced with the same approximate hard thresholding as training keeps many unnecessary non-zero document token weights, slowing down retrieval significantly.Thus we directly apply hard thresholding with a threshold learned from training, as shown in Figure 1.There is a gap between trained document token weights and actual weights used in our inverted index generation and online inference, and we intend to minimize this gap (called an index approximation error).

HYBRID THRESHOLDING (HT)
Thus our design takes a hybrid approach that applies soft thresholding to query token weights during training and inference and applies approximate hard thresholding to document token weights during training while using hard thresholding for documents during index generation.For approximate hard thresholding, we propose to use a logistic sigmoid-based function instead of a Gauss error function [28].This sigmoid thresholding simplifies our analysis of the impact of its hyperparameter choice to index approximation errors, and to training stability.

Trainable and approximate thresholding
Training computes threshold parameters   , and   for documents and queries, respectively.From the output of the SPLADE model, every token weight of a query is replaced with (   ,   ), which is  (   −   ), and every document token weight is replaced with  (   ,   ) before their dot product is computed during training as shown in Figure 1(a).Sigmoid thresholding  is defined as: Here  is a hyperparameter to control the slope steepness of step approximation that jumps from 0 to 1 when exceeding a threshold.
The indexing process uses hard thresholding to replace all document weights that are below threshold   as 0 as depicted in Figure 1 .
When    ≥   , we can derive that In both of the above cases, the error upper bound is minimized when  is large.This is consistent with the fact that error  is monotonically decreasing as Let  and  be the non-zero token weight count of document  and query , respectively.For our hybrid thresholding, or    decreases, more weights can quickly be zeroed out.

Threshold and token weight updating
We study the change of   ,   ,    , and    after each training step with a mini-batch gradient descent update.The analysis below uses the first-order Taylor polynomial approximation and follows the fact that sigmoid thresholding  and soft thresholding function  are used independently for a query and a document in the loss function.Symbol  is the learning rate.Let " ⊳ " mean  is a positive or negative document of query .

EVALUATION
Our evaluation uses MS MARCO passages [4] and BEIR datasets [33].MS MARCO has 8.8M passages while BEIR has 13 different datasets of varying sizes up-to 5.4M.As a common practice, we report the relevance in terms of mean reciprocal rank MRR@10 for the MS MARCO passage Dev query set with 6980 queries, and the normalized discounted cumulative gain nDCG@10 [13] for its DL'19 and DL'20 sets, and also for BEIR.For retrieval with a SPLADE inverted index, we report the mean response time (MRT) and 99th percentile time ( 99 ) in milliseconds.The query encoding time is not included.For the SPLADE model, we warm up it following [7,17], and train it with   = 0.01 and   = 0.008, and hybrid thresholding.We use the PISA [26] search system to index documents and search queries using SIMD-BP128 compression [18] and MaxScore retrieval [24,27].Our evaluation runs as a single thread on a Linux CPU-only server with Intel i5-8259U 2.3GHz and 32GB memory.Similar retrieval latency results are observed on a 2.3GHz AMD EPYC 7742 processor.The checkpoints and related code will be released in https://github.com/Qiaoyf96/HT.Overall results with MS MARCO.Table 1 is a comparison with the baselines on MS MARCO passage Dev set, DL'19, and DL'20.It lists the average  value, and top- retrieval time with depth  = 10 and 1000.Row 3 is for original SPLADE trained by ourselves with an MRR number higher than 0.38 reported in [7,17].Rows 12 and 13 list the result of our hybrid thresholding marked as HT   and  = 25.With   = 1, SPLADE/HT 1 converges to a point where   = 0.4 and   = 0.5, which is about 2x faster in retrieval.HT 3 with   = 3 converges at   = 0.7 and   = 0.8, resulting 3.1x speedup than SPLADE while having a slightly lower MRR@10 0.3942.No statistically significant degradation in relevance has been observed at the 95% confidence level for both HT 1 and HT 3 .The inverted index size reduces from 6.4GB for original SPLADE to 2.8GB and 2.2GB for HT 1 and HT 3 respectively.When applying two-level guided traversal 2GTI [30] with its fast configuration, Rows 14 and 15 show a further latency reduction to 6.9ms or 19.3ms.
We discuss other baselines listed in this table.Row 4 named DT uses the thresholding scheme from [28].Its training does not converge with its loss function, and its retrieval is much slower.Rows 5 and 6 follow joint training of top- masking [36] with the top 305 tokens as suggested in [36] and with the top 100 tokens.Rows 7, 8 and 9 marked with DCP follow document centric pruning [2] that keeps  of top tokens per document where =50%, 40%, and 30%.We did not list term centric pruning [1,3] because [2] shows DCP is slightly better in relevance under the same latency constraint.Rows 10 and 11 with "/Cut0.5"and "/Cut0.8"apply a hard threshold with 0.5 and 0.8 in the output of original SPLADE without joint training.The index pruning options without learning from Rows 5 to 11 can either reduce the latency to the same level as HT, but their relevance score is visibly lower; or have a relevance similar to HT but with much slower latency.This illustrates the advantage of learned hybrid thresholding with joint training.
. Table 2 lists the zero-shot performance of HT when  = 1000 by applying the SPLADE/HT model learned from MS MARCO to the BEIR datasets without any additional training.HT 1 has a similar nDCG@10 score as SPLADE without HT, while having a 2x MRT speedup on average.HT 3 is even faster with 3.6x speedup, and its nDCG@10 drops in some degree to 0.494.Hyperparameter  in sigmoid thresholding  .Table 3 com Retaining   and   .Last three rows of Table 3 shows that the query length is higher when   is removed from the loss function, and documents get longer when   is removed further.The result means   and   are useful in sparsity control together with   .

CONCLUDING REMARKS
Our evaluation shows that learnable hybrid thresholding with index approximation can effectively increase the sparsity of inverted indices with 2-3x faster retrieval and competitive or slightly degraded relevance (0.28% -0.6% MRR@10 drop).Its trainability allows relevance and sparsity guided threshold learning and it can outperform index pruning without such an optimization.Our scheme retains a non-uniform number of non-zero token weights per vector based on a trainable weight and threshold difference for flexibility in relevance optimization.Our analysis shows that hyperparameter  in sigmoid thresholding needs to be chosen judiciously for a small index approximation error without hurting training stability.
If a small relevance tradeoff is allowed, more retrieval time reduction is possible when applying other related orthogonal efficiency optimization techniques [17,19,22,24,29,30].Applying hybrid thresholding HT 3 to a checkpoint of a recent efficiency-driven SPLADE model [17] with 0.3799 MRR@10 on the MS MARCO passage Dev set, decreases the response time from 36.6ms to 21.7ms (1.7x faster) when =1000 while having 0.3868 MRR@10.This latency can be further reduced to 14.2ms with the same MRR@10 number (0.3868) when 2GTI [30] is applied to the above index.
A future study is to investigate the use of the proposed hybrid thresholding scheme for other learned sparse models [10,20,23].

Figure 1 :
Figure 1: Hybrid thresholding with an index approximation

2 +
(b).The above post processing introduces an index approximation error  = |  (   ,   ) −  (   ,   )|.We derive its upper bound as follows.Notice that   ≥ 0, and for any  ≥ 0, 1+ ≤   . =    ((   −   )) =    1 +  (  −   ) (  −    ) big, the error is negligible and when |   −   | is small, the error could become big with a small  value.But as shown later, an excessively large  value could cause a big parameter update during a training step, affecting joint training stability.

Figure 2 :
Figure 2: Weight/threshold/sparsity changes during training Figure 2 depicts the average values of    ,   , and  on the left and    ,   , and  on the right during MS MARCO training under  1 .-axis is the training epoch number.It shows that  and  decrease while   and   increase as training makes progress and SPLADE/HT 1 converges after about 20 epochs.Design options.Table 3 lists performance under 4 thresholding combinations from Row 3 to Row 7. [] means soft thresholding function () is applied to  for both training and indexing where  can be documents (D) or queries (Q). [x] means sigmoid thresholding  is applied in both training and indexing. [] means  is applied in training and  is applied in indexing.[] means no thresholding is applied to  during training and indexing.When thresholding is not applied to queries,  [] is 1.3x faster than [] when  = 10 and  = 1000 while their relevance scores are similar.Shifting of document weight distribution by soft thresholding significantly affects retrieval time.Rows 6 and 7 fix  [] setting, and show that soft thresholding is more effective in relevance than hard thresholding for query tokens.Shifting of query weight distribution has less effect on latency while gaining more relevance through model consistency between training and indexing.
pares  [] with  [] when varying  from Row 8 to Row 14.In these cases, training always uses  while indexing uses  or  .When  is small as 2.5, applying  to both training and indexing yields good relevance, but retrieval is about 1.8x slower because much more non-zero weights are kept in the index.When  becomes large as 250, training does not converge to the global optimum due to large update sizes, resulting in an MRR score lower than =25 even with no index approximation. = 25 has a reasonable MRR while  [] is up-to 26% faster than  [].
The document token regularization   is computed on the training documents in each batch based on FLOPS regularization.The query token regularization   is based on L1 norm.Let  be a set of training queries with  documents involved in a batch.  =  ∈ Here 1  ≥ is an indicator function as 1 if  ≥  otherwise 0. When increasing   and   ,  and  decrease.Thus for a batch of training queries , the original SPLADE loss is extended as:  = ( 1 | |  ∈   ) +     +     +    .The extra item added is   = log(1+ −  )+log(1+  −  ).We retain the original   and   expressions because as The above results indicate: • A significant number of terms in ∆  and ∆   involve linear coefficient .This is verifiably true also for ∆   .Although a large  value can minimize the index approximation error |  (   ,   ) −  (   ,   )|, it can cause an aggressive change of token weights and thresholds at a training iteration, making training overshoot and miss the global optimum.Thus  cannot be too large, and our evaluation further studies this.• If     ≥ 0, ∆  ≥ 0, and the document threshold increases, decreasing .Otherwise document token threshold may decrease after a parameter update step during training, and the degree of decreasing is reduced by a positive value  −  1+ −  .Based on the sign of    , we can draw a similar conclusion on ∆  .

Table 1 :
Overall results on MS MARCO passages

Table 2 :
Zero-shot performance on BEIR datasets