Adapting Learned Sparse Retrieval for Long Documents

Learned sparse retrieval (LSR) is a family of neural retrieval methods that transform queries and documents into sparse weight vectors aligned with a vocabulary. While LSR approaches like Splade work well for short passages, it is unclear how well they handle longer documents. We investigate existing aggregation approaches for adapting LSR to longer documents and find that proximal scoring is crucial for LSR to handle long documents. To leverage this property, we proposed two adaptations of the Sequential Dependence Model (SDM) to LSR: ExactSDM and SoftSDM. ExactSDM assumes only exact query term dependence, while SoftSDM uses potential functions that model the dependence of query terms and their expansion terms (i.e., terms identified using a transformer's masked language modeling head). Experiments on the MSMARCO Document and TREC Robust04 datasets demonstrate that both ExactSDM and SoftSDM outperform existing LSR aggregation approaches for different document length constraints. Surprisingly, SoftSDM does not provide any performance benefits over ExactSDM. This suggests that soft proximity matching is not necessary for modeling term dependence in LSR. Overall, this study provides insights into handling long documents with LSR, proposing adaptations that improve its performance.


INTRODUCTION
Learned sparse retrieval (LSR) is a recent family of first-stage information retrieval that utilizes neural models, typically transformers, to encode queries and documents into sparse weight vectors aligned with a vocabulary.By operating in the lexical space, LSR provides several benefits over dense retrieval, including greater transparency  while remaining comparably effective [4,5,15].Moreover, LSR's compatibility with an inverted index allows for the reuse of techniques previously developed for lexical retrieval (e.g., BM25) [8].Previous evaluations of LSR methods have focused on short texts due to transformers' input length limit, such as by focusing on the MSMARCO passage collection [16] or truncating texts to the transformer's maximum input length.
To handle long documents, a common practice is to split the input into multiple segments, encode these segments individually, and aggregate the output.Several works have investigated score and representation aggregation strategies for the Cross-Encoder architecture, finding that representation aggregation typically outperforms score aggregation (e.g., taking the max of segments' vector representations rather than taking the max segment score) [2,7,11,21].
We hypothesize that these insights may change for LSR due to the inherent difference in the encoding mechanism between Cross-Encoders and LSR.While Cross-Encoders encode the query and document simultaneously, LSR encodes queries and documents separately, which makes it challenging to judge what from a document segment is not relevant to the information need.As a result, the aggregation operation may accumulate noise and make documents less separable.In addition, as the documents get longer, it is particularly difficult for LSR, which relies solely on term-based representations, to deal with scattered term matches.Intuitively, matches within neighbouring text should be a stronger relevance signal than isolated matches present in different segments [19].
In this work, we first study the effectiveness of existing aggregation approaches for adapting LSR to long documents.Specifically, we investigate three aggregation operators (max, sum, mean) on two output levels (representation, score).By applying these approaches to the state-of-the-art LSR method [5,15] on two datasets (MSMARCO Document, TREC Robust04), we find that most of the approaches, except for the max score aggregation, are fragile with long documents; as can be seen in Figure 1, their performance drops severely as more document segments are added into the representations.
Max score aggregation, however, avoids this downward trend with mostly stable performance over different numbers of segments.This suggests that the proximity matching that inherently happens with max score aggregation is crucial for LSR to handle long documents.
In order to better utilize local proximity, we propose two approaches, ExactSDM and SoftSDM, that adapt the SDM [14] for use with LSR.ExactSDM assumes a phrase or proximity match if the exact phrase or constituent terms appear in the document within a local document window.SoftSDM relaxes this assumption by allowing constituent terms to be softly matched with their expansion terms, as the Splade method does for unigrams.
We evaluate the performance of both ExactSDM and SoftSDM on the MSMARCO Document and TREC Robust04 datasets, finding that both approaches consistently outperform previous aggregation methods in the context of LSR.However, SoftSDM does not provide significant advantages over ExactSDM, making ExactSDM a more attractive option as it can be applied to a wider range of LSR methods, including both MLM-based architectures that leverage a masked language modeling (MLM) head to identify and score expansion terms [4,5,10] and MLP-based architectures that estimate term salience using a multilayer perceptron (MLP) with no soft matching via expansion terms [3,9,13].
Overall, our work sheds light on adapting LSR to long documents, with an approach adapted from SDM that significantly outperforms existing aggregation techniques for neural retrieval methods.

BACKGROUND 2.1 Learned sparse retrieval
Learned Sparse Retrieval (LSR) methods score a query-document pair by taking the dot product between their sparse vectors produced by a query encoder (  ) and a document encoder (  ), i.e., where   represents the predicted weight of the  ℎ term in the vocabulary.
Compared to dense retrieval, LSR's vectors are high dimensional and sparse, with most of their elements being zero.The dimensions of these vectors are tied to a vocabulary, so LSR is closely connected to traditional sparse retrieval methods such as BM25.However, unlike BM25, LSR learns the term weights instead of obtaining them from corpus statistics like TF-IDF.This similarity with BM25 makes LSR compatible with many existing techniques such as the inverted index, which were previously built for purely lexical retrieval models like BM25 [8].
Splade is a state-of-the-art LSR method that uses a masked language modeling (MLM) head based on BERT to perform term expansion and term weighting end-to-end.Given an input query/document, Splade encodes it into a logit matrix W, where  , represents the translation probability score from the  ℎ term in the input sequence to the  ℎ vocabulary item (| | ≈ 30).Splade then outputs, for each term in the vocabulary, a non-negative log-scaled weight, which is the maximum logit value of the term across the sequence, i.e.,   = max  (log(1 + ReLU( , ))).This max aggregation retains only the term weights and, like all LSR methods, it drops positional information.In other words, Splade uses term positional information when estimating query and document term weights, but is position-agnostic when scoring documents.Positional information is critical for inferring phrase semantics, however; "MU defeated Arsenal" has the opposite meaning of "Arsenal defeated MU".Furthermore, IR Axioms suggest that documents that contain phrases from the query (or query terms in close proximity) are more relevant than those with query terms scatted throughout [19].

Term Dependence Model
In prior work, researchers introduced techniques that incorporate the dependence of two or more terms into traditional bag-of-words retrieval models [1,6,14].Among these techniques, the Sequential Dependence Model [14] (SDM) is one of the most widely known and has been shown to be effective.
SDM assumes a dependence between neighboring query terms and models this dependence by a Markov Random Field (MRF).The MRF in SDM defines a joint probability over a graph G whose nodes are document random variable (D) and query terms  1 ,  2 , ..,  | | .The edges between nodes represent the dependency between them, where a node is independent of all other nodes given its neighbors.The query document conditional relevance probability is defined via this joint probability factorized as follows: Here,  () is the set of cliques in the graph,  (, Λ) is a potential function parameterized by Λ, and  Λ is a normalization factor.SDM realizes the above formula by defining three potential functions: • Exact individual term matching: • Exact n-gram/phrase matching:

SEQUENTIAL DEPENDENCE FOR LSR
In this section, we will introduce SoftSDM and ExactSDM, which are adapted from the original SDM, for performing phrase and proximity matching with LSR.We formulate these two adaptations on top of the state-of-the-art Splade LSR encoder that utilizes BERT's masked language modeling head.While SoftSDM's formulation is tied to the MLM head, ExactSDM can be generalized to other LSR architectures that produce term weights differently.

Query and document representations
In order to measure the dependence between terms, it is necessary to track both term positions and weights.We utilize a modified version of Splade [15] that enhances efficiency without compromising its effectiveness.It produces query term weights using an MLP layer (instead of MLM) on top of BERT's last hidden states, generating a sequence of term weights ( representing the weight of the  ℎ query token.
On the document side, the model's MLM document encoder produces for each document a logit matrix W D where   , represents the translation score from the  ℎ document term to the  ℎ vocabulary item.This logit matrix provides access to term positions and preceding term-weight vectors.We sparsify W D during training, enabling efficient storage via a sparse matrix format.

Soft Sequential Dependence Model
To adapt SoftSDM for use with the MLM head, we introduce three new functions derived from the query and document representations described previously.These potential functions are defined as follows (with  [  ] used to denote the index of   in the vocabulary): • Soft individual term matching: This potential function is equivalent to the max-aggregation followed by a dot product used in LSR methods (e.g., Splade).
• Soft n-gram/phrase matching: Soft phrase similarity measures the likelihood of terms in one phrase translating to corresponding terms in another, considering the importance of each term.This function computes the maximum similarity between a query phrase and document phrases starting at every position  .
• Soft proximity matching: This function approximates the maximum likelihood of translating terms within document windows of size p to a set of query terms regardless of the order, while also considering term importance.
3.2.1 Exact Sequential Dependence Model.ExactSDM uses the same potential function for individual term matching as SoftSDM (Equation 4), which has been shown to benefit LSR [4,5].However, with ExactSDM, we are interested in evaluating the impact of soft phrase/proximity matching.It accomplishes this by allowing a document term to only translate to itself in the output logits and disabling document expansion.This change modifies the potential function formulas in Equation 5 and 6 by modifying the logit matrix to only contain self-translations.ExactSDM is compatible with an MLM or MLP document encoder, making it applicable to additional LSR methods like uniCOIL [9], DeepCT [3], and DeepImpact [13].

EXPERIMENTS 4.1 Datasets
Our experiments use two long-document benchmarks: MSMARCO Document and TREC Robust04.The MSMARCO Document dataset [16] includes 3.2 million documents, 5.2K dev queries, and 367K queries in the train split.TREC Robust04 [20] consists of approximately 0.5 million news articles and 250 queries, with each query containing three fields: title, description, and narrative.In this work, we use only the description field, which contains a natural language version of the query.We access these datasets using ir_datasets [12].

Experimental settings
We utilize the Splade-based LSR architecture using DistilBERT [18] with a maximum input length of 512 tokens for all experiments.This architecture was trained on the MS-MARCO passage dataset using hard negatives and distillation from a Cross-Encoder [17].To handle long documents, we divided them into sentences and grouped sentences into segments of up to 400 tokens, with longer sentences considered as a single segment.Document representations or scores were derived from the segments' representations/scores generated by the Splade model, which achieved a Passage MRR@10 of 38.51 when previously trained on the MS-MARCO passage dataset [16] (with no fine-tuning on the document dataset).We explore various aggregation approaches, including Rep-max, Score-max, Rep(score)-sum, and Rep(score)-mean, that describe how segment-level term scores are treated to arrive at a documentlevel score.The representation aggregation methods (e.g., Repmax) operate on the sparse vector corresponding to each segment, whereas the score aggregation methods (e.g., Score-max) operate on relevance scores assigned to each segment.With Rep-max, a document-level sparse vector is computed by taking the maximum weight for each vocabulary term across all document segments.This document-level vector is then used to compute the relevance score by taking the dot product with the query vector.Score-max calculates a relevance score between each segment and the query, and then returns the maximum segment relevance score as the document-level relevance score.Rep-sum and Rep-mean replace Rep-max's max pooling with sum pooling or mean pooling across segments.Due to the distributive property of dot products, sum and mean pooling return identical results regardless of whether they operate on sparse vectors (representations) or scores.Thus, we refer to these as Rep(score)-sum and Rep(score)-mean.
SoftSDM and ExactSDM were evaluated using the same query and document segment representations as the above approaches, but with additional positional information.Score-max can be viewed as a special case of SoftSDM with soft proximity matching within the segment windows.When using these approaches, we fine-tune the three weighting factors of the potential functions using the MSMARCO Document training triplets obtained from Boytsov et al. [2].We also assess the generalizability of these weighting factors on TREC Robust04 without any further fine-tuning.

Results and Discussion
In our experiments, we compare existing aggregation methods with our proposed SDM for LSR variants as the number of segments in documents increases.We first look at the performance of existing long-document aggregation methods evaluated with LSR. Figure 1 illustrates the results of segment aggregation techniques with respect to the number of segments, with a detailed breakdown in Table 1.We observe a strong downward trend with existing approaches as more document segments are considered, with the exception of Score-max on both the MSMARCO Document and Robust04 datasets.
The approach of summing segment scores or representations, Rep(Score)-sum, suffers from the most severe decrease in performance on the MRR@10 and NDCG@10 measures.The MRR@10/ NDCG@10 on MSMARCO Document/Robust04 drops significantly from 36.63/44.88when using the first segment to worse than BM25 SDM/RM3 when 10 segments are used.The gap becomes even larger (not shown in the Table) when all segments are used, while BM25's scores only slightly decrease.This issue can be attributed to the longer and noisier texts resulting from using more segments, which leads to the addition of more non-relevant terms to the document's representation.Moreover, sum aggregation is inherently biased towards longer documents; less relevant documents may receive a higher accumulated score than shorter but more relevant ones.
The Rep(score)-mean aggregation method corrects for this bias by dividing the sum by the number of segments, thus exhibiting a clear recovery on both datasets.We also note the competitiveness between Rep(score)-mean and Rep-max.On MSMARCO Document, Rep-max consistently outperforms the mean aggregation, but the opposite is true for Robust04.In contrast with prior results on Cross-Encoders [7], where representation aggregation methods are preferable, the Score-max approach outperforms all other aggregation methods and is more robust than the max representation pooling.The MRR@10 of Score-max slightly drops on MSMARCO Document, while its NDCG@10 on Robust04 goes up initially, up to the third segment, and becomes stable after that.The local scoring capability of Score-max likely enables it to avoid noise accumulation.This discrepancy between performance with LSR and with Cross-Encoders may be due to the fact that Cross-Encoders can effectively ignore non-relevant segments, because they know both the query and the document at the time of encoding.LSR methods must encode documents without any knowledge of the query.
The performance of Score-max in the face of a downtrend suggests that proximity is crucial for effectively scoring long documents using LSR.This observation leads to the ExactSDM and SoftSDM models, both of which are adapted from the well-known SDM model from the pre-neural era.As shown in the two leftmost blocks of Table 1, both ExactSDM and SoftSDM consistently outperform previous aggregation approaches and BM25 RM3/SDM on both datasets, regardless of varying document lengths.Overall, ExactSDM and SoftSDM perform competitively on the two datasets.
On MSMARCO Document, the MRR@10 of both SoftSDM and ExactSDM slightly increases on the first two or three segments, but starts to fluctuate after that.In comparison with the previous best pooling method, Score-max, both SDM variants achieve better MRR@10, with the improvement ranging from 1.0% to 2.0%.On Robust04 (zero-shot), the improvement over Score-max is even greater, ranging from 3.9% to 5.8%.Additionally, the improvement slightly increases when more segments are used, demonstrating the ability of SDM variants to exploit long context.
Comparing SoftSDM and ExactSDM, we see that ExactSDM slightly outperforms SoftSDM, making it a great choice for modeling term dependence with LSR.ExactSDM does not necessarily rely on the MLM's logits, which means it is compatible with other LSR methods not using the MLM encoders.
Regarding the optimal phrase and proximity window size, we found that using two-term phrases (bi-grams) and a proximity window of size 8 often returns higher results on MSMARCO Document, which is consistent with the recommendations in [1,14].While this setting generalizes well to TREC Robust04 with ExactSDM, SoftSDM needs longer phrases (5 grams) and a longer proximity window (size=10) for better zero-shot performance.

CONCLUSION
Our work explores techniques for adapting learned sparse retrieval to long documents.We find that the max-score aggregation approach is robust to varying document lengths, leading to our SoftSDM and Exact SDM adaptations of the SDM model that outperform existing approaches.Surprisingly, SoftSDM's soft-matching ability does not outperform ExactSDM, indicating that it may not be necessary for modeling term dependence.

Figure 1 :
Figure 1: Performance of aggregation methods proposed for neural rankers with LSR and long documents.

Table 1 :
The results of baselines and SDM variants on MSMARCO Document and TREC Robust04.MRR and NDCG are @10.Recall is @1000.A † indicates  < 0.05 (paired t-test between a SDM method and Score-max with Bonferroni correction).