CSurF: Sparse Lexical Retrieval through Contextualized Surface Forms

Lexical exact-match systems perform text retrieval efficiently with sparse matching signals and fast retrieval through inverted lists, but naturally suffer from the mismatch between lexical surface form and implicit term semantics. This paper proposes to directly bridge the surface form space and the term semantics space in lexical exact-match retrieval via contextualized surface forms (CSF). Each CSF pairs a lexical surface form with a context source, and is represented by a lexical form weight and a contextualized semantic vector representation. This framework is able to perform sparse lexicon-based retrieval by learning to represent each query and document as a "bag-of-CSFs", simultaneously addressing two key factors in sparse retrieval: vocabulary expansion of surface form and semantic representation of term meaning. At retrieval time, it efficiently matches CSFs through exact-match of learned surface forms, and effectively scores each CSF pair via contextual semantic representations, leading to joint improvement in both term match and term scoring. Multiple experiments show that this approach successfully resolves the main mismatch issues in lexical exact-match retrieval and outperforms state-of-the-art lexical exact-match systems, reaching comparable accuracy as lexical all-to-all soft match systems as an efficient exact-match-based system.


INTRODUCTION
Lexical exact match, or matching query and document based on overlapping terms, has been widely utilized in text retrieval [37].
The strict and imperfect premise to indifferently match all terms with identical lexical surface form (e.g.matching all documents containing the word "Christmas" or "present" to the query "Christmas present") leads to sparse text sequence representations and matching signals, which enables fast retrieval via inverted indexing.However, this premise naturally neglects any information about the term semantics beneath a surface form, leading to vocabulary mismatch (same meaning shared by multiple surface forms) and semantic mismatch (different meanings under the same surface form) and consequently suboptimal retrieval performance.
The era of neural IR and pretrained language models [8,42] has led to several directions of work to train end-to-end lexical matching retrievers.Within the term-weighting premise, SPLADE [11,12] and its numerous extension models perform implicit vocabulary expansion, projecting the original text sequence to a new set of surface forms with learned term weights, but do not further track or distinguish the individual semantics of the generated lexical forms.Other systems propose to augment lexical form matching with contextualized vector representations and use vector similarity scoring to distinguish semantic agreement between query and document terms.Gao et al. [15] proposed contextualized inverted lists (COIL) to reduce semantic mismatch within the lexical exact-match framework, but this approach does not change the vocabulary of the text sequence and thus suffers from vocabulary mismatch.ColBERT [21,39] systems further perform soft all-toall match, completely removing the lexical form match restriction and matching all possible query-doc term pairs with vector representation scoring.This results in maximum model capacity but sacrifices the efficiency gain of the exact-match framework, leading to impractical indexing and storage cost for first-stage retrieval.
The advances and limitations of current systems inspire us to rethink the lexical text retrieval process.Lexical retrievers judge a query-document pair by matching query and document terms and scoring each term pair.In the matching stage, performing exact lexical match via surface form naturally ensures sparse signals, which is fundamental for retrieval efficiency.In the scoring stage, matching terms by semantic representations produces more precise term match scores, but this step itself does not determine the correctness and sparsity of term matching signals.Previous systems utilizing vector term representations still match terms through the lexical surface forms of the original text and thus suffer from vocabulary mismatch of exact match (COIL) or efficiency issues of all-to-all soft match (ColBERT).On the other hand, SPLADE-based systems demonstrate the potential to learn new "bag-of-surface-forms" to overcome the mismatch in query-document vocabulary, but such learned surface forms are disconnected from the original context.An ideal end-to-end system should combine the advantage of the  two sides, to learn to jointly find term matches via surface form, and further score such matches via term semantic representation.This paper proposes CSurF, which enables effective sparse retrieval through Contextualized Surface Forms.Each contextualized surface form consists of a lexical surface form with an importance weight and a context source with a contextualized representation.We use CSurF to name the retrieval model, and CSF to name the "contextualized surface form" concept, which is the basic unit of the sparse lexical match process.Figure 1a demonstrates the model structure of the CSurF framework, and examples of generated CSFs.The model represents each query and document using a "bag-of-CSFs" by first generating a set of lexical surface forms.It grounds all such surface forms to a contextual semantics source in the original text to assign each surface form with a vector representation.At retrieval time, the CSurF model first sparsely matches CSFs by exactmatch of lexical surface forms, and further scores each CSF pair by comparing contextual representation similarity, which assures both the efficiency and effectiveness of the retrieval process.
Figure 1b-1d further demonstrate the retrieval process of CSurF compared to lexical exact-match and all-to-all soft-match systems.CSurF is able to link the original query term "gift" with the document term "present" since both the query and document generate CSFs with surface form "gift" and "present".From the classic lexical exact-match perspective, CSurF simultaneously incorporates lexical form expansion and contextualized scoring to resolve vocabulary and semantic mismatch in an end-to-end system.From the lexical soft-match perspective, CSurF efficiently connects original query-doc terms through fast exact-match of projected surface forms, leading to improved efficiency without loss in model capacity.CSurF outperforms current lexical exact-match systems on multiple in-domain and zero-shot retrieval experiments, consistently reaching the performance level of state-of-the-art lexical all-to-all soft-match models, but in an efficient sparse retrieval framework.We further discuss the effectiveness-efficiency tradeoff of CSurF and its improvements over current lexical retrieval systems, including the benefits of context source grounding, vector term representation extension, and sparse lexical soft match.
The rest of the paper is structured as follows: Section 2 discusses related work on neural lexical retrieval systems.Section 3 introduces the CSurF framework.Sections 4 and 5 discuss experiments and analysis of model performance.Section 6 concludes.

RELATED WORK
Before the emergence of pretrained language models, classic text retrieval systems [37] rely on exact-match signals of query and document terms to provide relevance judgment.Such systems assign each term with an importance weight estimated from term frequency, and perform efficient retrieval via inverted indexing.Traditional approaches to resolve the vocabulary and semantic mismatch in lexical exact-match systems include rewriting and expansion of the query and document [1,3,4,43,47], as well as n-gram matching [26].Later systems utilize pretrained word embeddings such as Word2vec [27] to calculate term similarity [17], and propose lexical soft match or all-to-all match [7,45], performing complete interaction between query and document terms regardless of lexical form, but such systems are thus much less efficient and often limited to re-ranking settings.
The introduction of BERT-scale pretrained language models [8,42] provides a natural backbone for text understanding, and has led to significant improvement in both re-ranking [30] and first-stage ranking.Pretrained LMs were first used to improve BM25-based retrieval via improved term-reweighting [6] or document expansion [31,32].Later research proposed to directly learn query and document representations for retrieval.One thread of work involves building dense retrievers to directly encode the query and document into a single vector representation, and directly predict relevance based on vector similarity.Main research in dense retrieval focuses on better training strategies such as negative sampling [46,48], knowledge distillation [18,25] and training objective design [13,14,16,19], as well as retrieval efficiency [20,49].
This paper mainly focuses on end-to-end lexicon-based retrieval or learned sparse retrieval [2,24,28], which represents the query and document with a set of vocabulary terms and term representations, and predicts relevance from aggregating term-level shallow interactions.Some systems follow the exact-match precondition and aim to perform efficient retrieval with contextualized term importance weights [12,24] or semantic vector representations [15,50].To resolve vocabulary mismatch, systems such as SPLADE [12] and its extensions [11] also introduce the concept of vocabulary expansion to remap the original bag-of-words to a new set of vocabulary terms.Such systems further benefit from the speed-up of the inverted list structure to store the scalar or vector term representations [15].Another group of LM-augmented lexicon-based systems performs lexical soft match with all-to-all token interaction and vector scoring, with the most important being ColBERT [21,39].Soft-match retrievers fall on the other side of the capacity-efficiency tradeoff with high retrieval performance but also more token interactions required.Recent advances accordingly focus on reducing its storage and computational costs with approaches including residual compression and centroid pruning [38,39].In this paper, we aim to find a direct approach to both allow and control the soft interaction between original terms with different surface forms, achieving accurate retrieval with sparse matching signals.
Concurrent with our work, Li et al. [23] proposed CITADEL which utilizes token-level vocabulary expansion as "routing" to achieve controlled soft match of original text terms with contextualized vector representations.While this approach is similar to our CSurF approach, the expanded tokens are solely generated by termwise top-k selection of expansion tokens, and rely on post-hoc pruning of generated expansion tokens to balance retrieval efficiency.This is contrary to CSurF which directly and dynamically learns a set of surface forms for the input text sequence.Qian et al. [35] proposed ALIGNER, which directly learns sparse alignment between query and document terms by predicting token salience, or whether a given token requires alignment with other tokens.This leads to direct pruning of the all-to-all soft match matrix of ColBERT, and thus improved retrieval efficiency.

CONTEXTUALIZED SURFACE FORMS
In this section, we first formally define the lexical match and retrieval process in Section 3.1.Sections 3.2 and 3.3 introduce the encoding and retrieval process of CSurF respectively.Section 3.4 discusses the connection and advances of CSurF compared to current systems.Section 3.5 discusses model implementation and training.

Prelimiaries
Given a query q = { 1 ,  2 , • • • ,   } and document d = { 1 ,  2 , ...,   }, a lexical retriever scores the (q, d) pair by accumulating individual term match scores, commonly via the "max-sum" framework: For query term   and document term   , S(  ,   ) is a term scoring operation, such as term weight multiplication or vector similarity calculation.M (  ,   ) is the term matching criteria, indicating whether   and   are matched.The strictness of the selection mask M directly determines the capacity and efficiency of the lexical retriever.As shown in Figure 1b, in exact-match systems, M (  ,   ) = I(  =   ) and only terms with identical form are matched.This results in sparse matching signals but suffers from the mismatch in query and document vocabulary.On the other hand, for all-to-all soft-match systems such as ColBERT, M (  ,   ) ≡ 1 and all possible term pairs are matched regardless of surface form.This leads to maximum model capacity but at the cost of extremely high computation and storage cost.

Contextualized Surface Form Generation
CSurF aims to jointly control the sparsity of matching signals M and the precision of the scoring function S through sparse retrieval of contextualized surface forms (CSF).For each input sequence (query or document), it generates a bag-of-CSFs by (1) generating candidate lexical surface forms for the sequence, and (2) pairing each generated surface form with a term in the original text as its context source.CSurF follows Formal et al. [11] and generates candidate lexical surface forms from the entire vocabulary space.It first encodes each input sequence (query or document) with a language model (LM) backbone and separately projects the LM output of each query and document term to two components: a dense representation denoting term semantics, and a sparse | |-dim vector denoting expansion weights for each token in the vocabulary.
For query term   ,  (, ) denotes its LM output representation.  and   denote the semantic and expansion projection layers, which we discuss in Section 3.5.v   ∈ R |  | denotes the contextual semantics representation for   .E   ∈ R | | represents non-negative expansion weights from   to each token in the vocabulary  .For instance, E   [ ] represents the expansion weight from original term   to surface form  ∈  , with higher weight value denoting higher likelihood of expansion.Query and document term representations are processed through the same projection layers.
Over the whole query or document sequence, for surface form , CSurF performs max-pooling over all expansion weights E  [ ] to select its sequence importance weight.Additionally, CSurF grounds the surface form  to the original text by tracking the projection source of the selected expansion weight.
Given a query  and surface form ,    and    respectively denote the final expansion weight and its projection source, i.e.    , =   .
The grounding step crucially links the lexical surface form space and the original text semantic space, enabling CSurF to pair each surface form with a context source and construct CSFs.We denote a contextualized surface form as  = (  ,   ), where   ∈  is its lexical surface form with a scalar importance weight    , and   ∈  or   ∈  is its context source with semantic representation v   .This enables CSurF to combine the advantage of lexical form matching and semantic-based scoring, which we discuss in the following sections.
For each query and document, the expansion-based "bag-of-CSFs" E is the set of all CSFs with positive surface form weights.
Additionally, we construct CSFs for all original query and document terms to preserve the lexical form information of the original text.We directly use the term's self-projection weight     ,  or     ,  as its lexical form weight.
where O denotes the original text-based CSFs.The final bag-of-CSFs C for each query and document is the union of expansionbased and original text-based CSFs.

Indexing and Retrieval
CSurF scores a query-document pair by sparse retrieval through their bag-of-CSFs C  and C  .Specifically, it strictly matches CSF pairs through lexical exact match of their surface forms, and scores a matched CSF pair through vector similarity.For CSFs Here, f () denotes a similarity function such as dot-product or cosine similarity.CSurF jointly utilizes the efficiency of lexical form match and the accuracy of contextualized term scoring.The exactmatch precondition M limits the density of actual CSF matches, which enables the model to efficiently index document CSFs in inverted lists via their lexical surface forms.The contextualized scoring component further complements lexical match by introducing term semantics.
CSurF follows the common "max-sum" framework to aggregate individual CSF match scores.At retrieval time, for each original query term, it selects the maximum score over all document CSFs.
where C q  denotes the subset of CSFs with   as source term.The final (, ) score is the sum of individual query term scores.

Connection to Current Systems
CSurF is a direct extension of current lexicon-based retrieval models.From the surface form exact-match perspective, CSurF maintains the exact-match precondition and supports inverted indexing and sparse retrieval of contextualized surface forms.It simultaneously addresses the vocabulary and semantic mismatch issues of lexical exact-match by performing surface form expansion and assigning a contextualized representation to each surface form.The lexical exact-match systems COIL-tok and SPLADE can both be viewed as special cases of CSurF with limitations.COIL-tok is equivalent to CSurF with only original text-based surface forms O q and O d and without expansion, thus suffering in model capacity.SPLADE can be viewed as CSurF matching surface forms with only term weights (|v| = 0) and with slightly different score accumulation methods.Specifically, the CSurF name framework enables SPLADE-based systems to be expanded to utilize vector term representations, by explicitly grounding each expanded surface form to the original context.We analyze the performance of CSurF compared to baseline lexical retrievers in Sections 5.1 and 5.3.
From the contextualized semantic match perspective, CSurF performs multi-vector lexical retrieval over the vector representations of query and document terms in the original text, but further introduces the surface form space to efficiently bridge the terms.CSurF is the equivalent of ColBERT with a strong surface form match constraint.It maps each term in the original text sequences to a set of candidate surface forms, so that original terms can be matched via expansion surface form overlap.The model is trained to jointly learn the soft-match constraint M of whether terms should match, along with the vector scoring S, to filter unnecessary soft matches.This dramatically reduces the computation and indexing overload of ColBERT systems and leads to significantly improved efficiency, which we discuss in Section 5.2.

Model Implementation and Training
Following previous work, CSurF initializes the LM base with co-Condenser [14], which is trained on a retrieval objective.At the encoding stage, we follow the implementation of Formal et al. [11] to generate surface form expansion weights.
The projection layers   and   are used for both the query and the document.The semantic projection layer   ,   is trained from scratch, while the expansion projection layer   ,   can be initialized from the Masked language modeling (MLM) layer of the LM base.
CSurF is trained end-to-end on the ranking objective with a contrastive loss.Given a query  and a set of  documents  = { + ,  − 1 , ...,  − −1 }, the training loss is where S is the scoring function.CSurF also applies a FLOPS [33] regularization loss to control the sparsity of the expansion weight matrix E, and the number of expanded surface forms.The final loss is the sum of the retrieval loss and the regularization losses of the query and document.

𝑟𝑒𝑔
We first train CSurF on the MSMARCO dataset [29] with training data sampled from a BM25 ranking.Additionally, we boost model performance by incorporating hard negative mining and knowledge distillation [25], where we sample hard negative training triplets from CSurF itself, utilize a cross-encoder re-ranker teacher 1 to generate (, ) scores, and train a new CSurF model on the sampled training data with an additional KL-divergence loss [36], which aims to minimize the relevance distributions between the crossencoder teacher and the trained CSurF: s (, ) =  S  (, )   ∈  S  (,  ) s (, ) =  S (, ) where S  is the score of the cross-encoder teacher.In the rest of the paper, we use CSurF   to refer to CSurF models trained with hard negatives and distillation.

EXPERIMENTAL METHODOLOGY
Implementation.The retrieval framework of CSurF was built upon the implementation of COIL2 and with Pytorch [34] and Huggingface [44].Aside from ablation studies discussed in Section 5.1, we set the semantic representation dimension |v| = 32 with cosine similarity scoring.At indexing time, we prune CSFs with lexical form weight <1e-8 and organize CSFs in inverted lists in matrix format.Given a query, CSurF (i) performs score calculation for each query CSF and all document CSFs with the same lexical form, (ii) scatters each CSF-CSF match score to the source query-document term match with max reduction, and (iii) performs max-sum score aggregation to calculate the final query-document score.
Training.We first train CSurF on BM25 negatives on a single GPU for 6 epochs, with 6 queries per batch, 8 documents per training sample (1 positive and 7 negative), and   =  =1e-4.We train CSurF   variants on 4 GPUs for 8 epochs, with 6 queries per batch and 12 documents per training sample.For simplicity, all CSurF   variants are trained with the same set of hard negatives sampled from the top 1000 of CSurF rankings on MSMARCO training queries.For all experiments we utilize in-batch negatives for better training.We discuss the effect of tuning   and   in Section 5.2.
Evaluation.We train and evaluate CSurF's in-domain retrieval performance on the MSMARCO passage dataset and on two sets of queries, the MSMARCO Dev query set, and the TREC DL 2019 [5] test queries.Following previous work, we report MRR@10 and Recall@1000 on MSMARCO dev, and NDCG@10 and Recall@1000 for TREC 19 queries.We also perform out-of-domain retrieval experiments on the BEIR benchmark [40], which include multiple datasets with drastically different retrieval settings, domains, and document content.We report model performance on 13 BEIR datasets, with NDCG@10 as the official metric.
Baselines.We mainly compare the performance of CSurF to lexical matching retrieval systems, including: (1) BM25 [37] and BM25 with DocT5Query augmentation [31] (2) lexical exact-match systems COIL-tok [15] and SPLADE [11,12] (3) lexical all-to-all softmatch system ColBERT [21,39].Note that in this work we do not perform extended model and training setup design as discussed in recent works to improve learned sparse retrieval training [22,28], and compare CSurF's performance to the original results reported for SPLADE++ and ColBERT-v2, which uses the same or comparable training setups.We also include the performance of two dense retrieval systems [14,19] and the hybrid COIL system which performs hybrid retrieval with dense and lexical components.We separately evaluate and compare systems trained without and with knowledge distillation and hard negative mining.

EXPERIMENTAL RESULTS
In this section, we report CSurF's in-domain retrieval performance on MSMARCO in Sections 5.1 and 5.2, respectively focusing on retrieval effectiveness and efficiency.Section 5.3 discusses out-ofdomain retrieval performance.Section 5.4 concludes the experiments with detailed case study of generated CSFs.

In-domain passage retrieval effectiveness
Table 1 reports CSurF's retrieval performance on MSMARCO.Without knowledge distillation and under comparable LM settings, CSurF outperforms current lexical exact match systems, including SPLADE and COIL-tok, on both recall and accuracy (MRR@10).With hard-negative mining and distillation, CSurF still outperforms SPLADE++ on MRR@10.This demonstrates CSurF's ability to bridge the vocabulary and semantic mismatch in lexical exact match.Furthermore, the sparse retrieval framework CSurF consistently reaches the performance level of the all-to-all soft-match ColBERT in all training settings, with much lower retrieval complexity.We analyze the retrieval cost and effectiveness-efficiency tradeoff for CSurF in Section 5.2.We further perform two sets of ablation studies on model components of CSurF, to investigate the connection and comparison between CSurF and existing lexical exact-match systems.Namely, we look into the effect of two aspects discussed in Section 3.4: lexical form expansion and contextual semantics grounding.
Table 2 reports the retrieval performance on MSMARCO for CSurF variants.To look into the effect of lexical expansion, we train CSurF models which only generate CSFs from original-textbased lexical forms O or only from expansion-based lexical forms E. Compared to solely performing exact-match over the original text, introducing lexical form expansion improves model performance by a wide margin.Specifically, it is the primary source of gain in recall (0.955 to 0.975 without distillation).Without the expansion component, the model capacity of CSurF is limited, even with extra knowledge distillation from the cross-encoder teacher.This suggests that lexical exact-match signals of original text terms are naturally not sufficient to accurately predict query-document relevance, and it is critical to introduce extra matching signals.
To look into the effect of contextualized representations and term scoring, we experiment with four model variants, which include: (i) three vector representation settings where |v| = 32 with the term scoring function f as dot product and cosine similarity, and |v| = 4 with cosine scoring, and (ii) a term weighting-based model variant where |v| = 0 (i.e.only the lexical expansion weight  is used).We see from Table 2 that all systems perform similarly on recall, with multi-vector systems outperforming term-weighting systems in accuracy (MRR@10).This result together with the shown result in Table 1 demonstrates the benefit of expanding sparse retrieval systems such as SPLADE to multi-vector systems.We further note that the lexical weight-only CSurF variant still outperforms the current SPLADE++ model in accuracy, and is slightly better than the highest reported results in recent SPLADE-related works [22],   which performs in-depth analysis of applying different model training techniques, such as regularization and separate query encoder, to improve SPLADE performance.This demonstrates the effectiveness of the additional grounding step from lexical form to context source in CSurF, which allows the match performance of original terms to directly guide the learning of expansion terms at training time.This also shows that CSurF's performance could be potentially further improved with the application of training techniques discussed above.

Retrieval Efficiency
This section discusses the effectiveness-efficiency trade-off in the retrieval procedure of CSurF.As a sparse retrieval system, CSurF represents a query or document with a set of CSFs.Without changing the dimensionality of the semantic representations, the indexing storage cost of CSurF is directly determined by the number of CSFs generated per document, and the retrieval computational cost is further determined by the number of CSF matches given a (, ) pair.Therefore, we report the following metrics to compare the run cost of CSurF: (i) the average number of terms to represent the query and document, i.e. |C  | and |C  |, and (ii) the average number of "term matches", or term-scoring operations required given a (, ) pair [12].For exact-match systems such as CSurF, this is estimated by where n  and n  denote the average number of occurrences for token  in a query or document.
At training time, the number of generated CSFs is mainly affected by the regularization weights   and   which control the FLOPS penalty and thus the sparsity of the expansion weights.We list the performance of CSurF with different   and   settings in Table 3. CSurF is able to maintain high model capacity with a relatively small bag-of-CSFs size, achieving >0.39 MRR@10 with (3.5×, 1.5×) or (1.5×, 2.7×) the lengths of the original query and document.More importantly, CSurF learns sparse matching signals without degradation in model capacity.Compared to current multivector retrievers, CSurF significantly outperforms the exact-match retriever COIL-tok, while requiring comparable and often fewer retrieval time matching operations.On the other hand, CSurF reaches comparable model effectiveness to ColBERT-v2, with a significantly lower calculation operation count than all-to-all soft match of all tokens (>400 ops.), and not requiring further filtering stages of candidate passage selection as introduced in ColBERT-v2.
We also experiment with post-hoc CSF inverted index pruning, where we explicitly prune  ∈ [0, 1) of the encoded corpus, removing document CSFs with the lowest lexical form expansion weights.Figure 2 reports the performance change of two CSurF models trained with   =  =1e-2 and   =  =1e-3 with cosine similarity scoring, with different pruning threshold .We see that pruning further reduces the number of redundant CSFs and match signals for both models, leading to further retrieval efficiency improvement while maintaining retrieval capacity (MRR@10>0.39,R@1000>0.98with 50% of terms pruned).These experiments demonstrate that the CSurF retrieval process can be highly efficient due to the sparsity of CSF match signals.To look into the properties of the CSF generation and matching processes, we analyze and plot the distribution of corpus CSFs' lexical form frequency and expansion weight in Figure 3. Compared to the lexical form frequency distribution of the original text, CSurF is trained to simultaneously expand meaningful lexical surface forms but also prune existing lexical terms with low term importance, and removes a significant proportion of tokens with the highest occurrence frequency, most of which do not carry important contextual meaning such as stop words.This leads to the aforementioned comparison where CSurF requires lower scoring operations per query than COIL-tok despite having "longer" queries or documents.Post-hoc index pruning with  = 0.5 further removes redundant matching signals of lexical forms at all frequencies, resulting in a significant decrease of matching operations without major influence in retrieval performance.For instance, after training and post-hoc pruning, the five most frequent terms in the MSMARCO passage set ("the", "of", "and", "in", "to") are removed by over 98.5% compared to their original corpus term frequency.
Finally, we note that CSurF utilizes vector term representations, and the storage and run cost of CSurF is affected by the representation dimension |v| and the computational cost of the actual term-scoring operation f ().As listed in Table 2, we recognize the tradeoff in performance where a higher representation dimension leads to improved performance.Many recent works have also targeted improving the efficiency of representation storage and scoring in multi-vector retrieval, with proposed methods such as semantic representation clustering and residual compression [10,38,39] also compatible with CSurF.We leave detailed engineering and optimization of the vector scoring step as future work.

Out-of-domain retrieval
In this section, we evaluate CSurF on out-of-domain retrieval.Table 4 lists the retrieval performance of CSurF on 13 datasets in the BEIR benchmark.Compared to baseline approaches, CSurF achieves best performance on 6 of 13 datasets, with a win-loss-tie of 10:3:0 and 8:4:1 compared to ColBERT-v2 and SPLADE++ respectively.We also observed very different trends in performance across different datasets.Specifically, CSurF has low performance on Climate-Fever [9] and Fever [41].This may be related to the property of the retrieval tasks and queries focusing more on exact match of specific entities, and vector representations introducing noise.
Based on previous work which discusses the effect of original lexical terms in zero-shot retrieval settings, we utilize CSurF's capability to track lexical form source (original text or expansion) and perform an extra experiment to explicitly introduce and emphasize the original text lexical form information.We experiment with a weight-based interpolation approach for CSurF scoring, where at retrieval time, we modify the lexical form weights of CSFs and apply a penalty to all CSFs generated solely via expansion, i.e.  *  = (1 − )  + I (∈O)   , where  ∈ [0, 1] is the penalty parameter. = 0 represents the original CSurF performance without interpolation, while a higher  indicates lower confidence or larger penalty on expansion-based CSFs.
We test =[0.0,0.1,...,0.9,1.0] and list the oracle performance of CSurF after interpolation and the corresponding  in Table 4. On most datasets, the oracle performance of CSurF is achieved at =0.1-0.3.This demonstrates a mixed message: CSurF is overall effective in expanding meaningful surface forms even in zero-shot settings, but the LM backbone and expansion component may still suffer from the change of retrieval domain, and interpolation with an originaltext-only retrieval source or explicitly emphasizing original text importance can still be helpful.

Case study
In this section, we present 3 detailed examples of the generated bag-of-CSFs in Table 5.We select a query and document from the MSMARCO passage dataset, and a query from the FEVER dataset.We observe that CSurF possesses the ability to understand the context and expand surface forms that are related to the original term (e.g.plural forms or acronyms), express the same concept ("duration" and "length" expanded from "how long"), or other terms in related fields ("nfl" and "final" expanded from "super bowl", which are all related to concepts in American football).It also assigns higher weights to important CSFs such as core entities, and assigns lower weights or directly prunes terms like "how", "is" and "the", matching the analysis in Section 5.2.
We also discuss two interesting observations which points to directions of further improvement of CSurF.For proper nouns such as person names, dates or locations, CSurF may not understand the entity name (e.g."Death Note" as a TV series), or may over-generate lexical terms which refer to the same type of entities but irrelevant to the current context (e.g.generating names from original term "Jefferson").These are classic problems in exact-match-based systems, and potential solutions include injecting external knowledge to correctly distinguish and generate related entities, and rethinking and refining the post-hoc pruning stage for CSF-selection.We also observe that CSurF occasionally generates relevant lexical surface forms from the seemingly "incorrect" source such as stop words.This does not effect surface form matching but may affect the accuracy of vector term representation and scoring.In the FEVER example CSurF learns to prune the original term "a" and "in", but still generates surface forms "genre" and "date" from such terms.This calls for a deeper analysis of the lexical expansion step, but also raises a potential model extension, where we introduce extra sources to generate lexical forms related to the overall concept or topic of the text sequence with an independent representation.

CONCLUSION
This paper proposes CSurF, which performs sparse lexicon-based retrieval through constructing and matching Contextualized Surface Forms.Its retrieval process combines efficient surface form exact match and fine-grained contextualized semantic scoring, which leads to maximized model capacity while maintaining the simplicity and efficiency of exact-match-based retrieval systems.
CSurF extends current term-weight based learned sparse retrieval approaches with vector term representations.On experiments across multiple datasets and retrieval settings, CSurF is able to simultaneously bridge the vocabulary and semantic mismatch in exact-match retrieval, and achieve state-of-the-art retrieval performance for lexical exact-match systems.Ablation studies and analysis further demonstrate CSurF's ability to jointly expand meaningful surface forms and ground surface forms to underlying semantics, which leads to increased model capacity.We also propose a simple interpolation approach in out-of-domain retrieval settings, to analyze the effect of original text vs. expanded surface forms as well as the quality of lexical form expansion on different retrieval tasks.
Compared to all-to-all soft-match retrievers, CSurF achieves comparable performance across all retrieval tasks as an exact-matchbased retrieval system.CSurF is able to learn sparse connections of the original query and document terms, resolving the key efficiency issue of lexical soft-match.The retrieval efficiency of CSurF can also be further optimized with different approaches including training regularization adjustment, post-hoc index pruning, and vector representation approximation or dimension control, without significantly affecting retrieval accuracy.We hope this work encourages more research on building effective, efficient, robust and knowledge-enhanced sparse retrieval systems in the real world, as well as exploring the connection and distinction among current retrieval frameworks and systems.
(a) Model structure and encoding process of CSurF.(b) Matching process of exact-match retrievers (c) Matching process of all-to-all soft-match retrievers (d) Matching process of CSurF

Figure 1 :
Figure 1: The model structure of the CSurF retriever (left), and the retrieval process of CSurF compared to current lexicon-based exact match or soft match retrievers(right).CSurF generates Contextualized Surface Forms (CSFs) by pairing lexical forms with semantic representations of original text terms.CSFs are further indexed in inverted lists and matched based on lexical form.

Figure 2 :
Figure 2: Performance of CSurF with different pruning thresholds.

Figure 3 :
Figure 3: Lexical form frequency and weights for CSurF.The red dotted curve denotes weight distribution.Model trained with   =  =1e-2.X axis denotes percentage of corpus CSFs.

Table 1 :
Retrieval results on MSMARCO.Best performance under each training setting is labeled in bold.

Table 2 :
Ablation study for CSurF and comparison to current lexical exact-match systems.All models initialized with co-Condenser.CSurF models are trained with   =  =1e-3.The "Full" model capacity indicates using CSFs from both original text and expansion (O + E), and vector scoring with |v| = 32.

Table 3 :
Model efficiency analysis.Numbers labeled with * indicate the original MARCO dev query/document lengths after BERT tokenization.Other lengths stats are calculated after pruning inverted index entries with 0 term weight.

Table 5 :
Examples of bag-of-CSFs.We show a query and document from MSMARCO and a query from FEVER in this order.Original terms are marked in bold.The lexical form and lexical weights of generated CSFs are listed after its source term.CSFs removed by post-hoc pruning ( = 0.5) are presented in gray, and remaining CSFs are underlined.