L2R: Lifelong Learning for First-stage Retrieval with Backward-Compatible Representations

First-stage retrieval is a critical task that aims to retrieve relevant document candidates from a large-scale collection. While existing retrieval models have achieved impressive performance, they are mostly studied on static data sets, ignoring that in the real-world, the data on the Web is continuously growing with potential distribution drift. Consequently, retrievers trained on static old data may not suit new-coming data well and inevitably produce sub-optimal results. In this work, we study lifelong learning for first-stage retrieval, especially focusing on the setting where the emerging documents are unlabeled since relevance annotation is expensive and may not keep up with data emergence. Under this setting, we aim to develop model updating with two goals: (1) to effectively adapt to the evolving distribution with the unlabeled new-coming data, and (2) to avoid re-inferring all embeddings of old documents to efficiently update the index each time the model is updated. We first formalize the task and then propose a novel Lifelong Learning method for the first-stage Retrieval, namely L2R. L2R adopts the typical memory mechanism for lifelong learning, and incorporates two crucial components: (1) selecting diverse support negatives for model training and memory updating for effective model adaptation, and (2) a ranking alignment objective to ensure the backward-compatibility of representations to save the cost of index rebuilding without hurting the model performance. For evaluation, we construct two new benchmarks from LoTTE and Multi-CPR datasets to simulate the document distribution drift in realistic retrieval scenarios. Extensive experiments show that L^2R significantly outperforms competitive lifelong learning baselines.


INTRODUCTION
First-stage retrieval aims to quickly retrieve a few relevant document candidates from a large-scale collection, which has become a core component in information retrieval (IR) applications [12,47].While retrieval models based on pre-trained language models (PLMs) [21,36,45,48] have demonstrated impressive performance, most of them are studied on static datasets, neglecting that in the real world, new documents are continuously emerging on the Web.For example, when a new event (e.g., ChatGPT) breaks out, a large number of documents on this topic were generated and shared, accompanied by booming information needs regarding the topic (searching for not only new documents but also old ones).The emerging documents and queries on the new topics may cause the distribution of retrieval collection to drift over time.Consequently, directly applying the model trained on previous data to the new collection is obviously not an optimal solution.Then, how can we continuously learn a retrieval model to adapt to the evolving data distribution effectively and efficiently?To study this problem, we formalize the task of lifelong learning for first-stage retrieval.
Lifelong learning [6,41] has been widely studied in the machine learning community, especially on computer vision (CV) tasks [25,43].In a typical setting of lifelong learning, a model is set to learn with non-identically and independently distributed (non-I.I.D.) newcoming data [43], with the goal of preserving acquired knowledge and learning new knowledge.Most research [1,18,35] in this field focuses on addressing the catastrophic forgetting issue [11,26], i.e., the model's inability to perform well on previously seen data after being updated with new data.One representative paradigm for lifelong learning is the memory-based method [1,5], which stores and replays historical samples while training on new data to mitigate the forgetting of acquired knowledge.These lifelong learning methods have been shown to be effective in various CV tasks [30,46].However, there has been limited research on the lifelong learning problem for IR tasks.
In this paper, we study the task of lifelong learning for firststage retrieval in a setting where new documents are unlabeled.We focus on this setting in our initial attempt because relevance annotation on the new data is expensive and may not catch up with data emergence.In this setting, besides the essential goal of general lifelong learning, i.e., preserving acquired knowledge and learning new knowledge, it poses several new challenges: 1) Without labeled positive samples, new data could have limited benefit to supervise model learning.Moreover, the unlabeled positives in new data could mislead the model if we simply take all new documents as irrelevant for training.
2) It incurs significant costs to re-compute all document representations and rebuild the entire index each time the model is updated.It would be ideal to avoid repeated representation computation without harming model performance.
3) The pairwise modeling of query-document pairs in IR makes the task more complicated, compared to the pointwise modeling of the classification tasks in CV.For any query, either seen or unseen, the model needs to achieve good retrieval performance on both new and old documents.Due to these challenges, existing lifelong learning methods in other fields cannot be directly applied to the retrieval task.Although some work has explored the catastrophic forgetting issue of re-ranking models under the lifelong learning setting, no feasible solutions are proposed to solve it [10,23].
To address the above challenges, we propose a memory-based Lifelong Learning method for first-stage Retrieval, named as L 2 R. L 2 R maintains a buffer to store the historical support negatives (i.e., negative samples that are important for learning the decision boundary of the model) for each training query, and when a session of unlabeled new documents arrives, it updates the model as follows: 1) To adapt the model to the new distribution, L 2 R selects diverse support negatives from the unlabeled new data for model training, by estimating their confidence of being hard negatives and redundancy with other selected ones.2) To balance the model's ability to preserve acquired knowledge and learn new knowledge, L 2 R selects historical support documents distinct from the selected new samples and uses them together for model updating.3) To avoid re-inferring embeddings of old documents each time the model is updated, L 2 R incorporates a novel ranking alignment objective to ensure the backward compatibility of document representations without harming retrieval performance.Overall, through the selection strategy of diverse support negatives and the ranking alignment objective for compatible learning, L 2 R enables effective and efficient retrieval model lifelong learning.
For evaluation, we construct two benchmarks based on the LoTTE [36] and Multi-CPR [22] datasets, namely LL-LoTTE and LL-MultiCPR, to simulate the realistic retrieval scenario where documents emerge continuously with distribution drift.The empirical results on both benchmarks show that L 2 R outperforms representative and state-of-the-art lifelong learning baselines in terms of metrics on both learning new data and addressing the forgetting issue.Moreover, our proposed ranking alignment objective achieves not only representation backward compatibility but also remarkably even better performance.We further confirm the advantages of our model through in-depth studies on the data selection strategy and the backward-compatible alignment objectives.

RELATED WORK
Lifelong Learning.Lifelong learning [6], also referred to as continual learning [41] or incremental learning [4], has received much attention in building adaptive systems that are able to gain, retain, and transfer knowledge when facing non-stationary data streams.Research in this field mainly focuses on solving the catastrophic forgetting issue [25,43].There are three main method paradigms, including regularization-based [16,18], architecture-based [7,35] and memory-based methods [1,5,17,32].
Lifelong learning has been widely studied in various machine learning tasks [30,32,39,46].Recently, Mai et al. [25] surveyed a wide range of methods to address the lifelong learning problem for image classification.In natural language processing tasks, the research on lifelong learning mostly focuses on pre-training [3,44].For example, Qin et al. [29] proposed ELLE for incremental pretraining on emerging data efficiently.Wu et al. [44] compared the performance over the combination of five PLMs and four lifelong learning approaches.However, to our best knowledge, there have been no studies on lifelong learning for first-stage retrieval.First-stage Retrieval.In recent years, substantial efforts have been made on various retrieval models [12,47], including both classical term-based methods like BM25 [34], and more recent PLMsbased dense retrieval models [9,19,20,24].PLMs-based retrieval models have compelling performance and are widely adopted in the industry.However, most existing studies are on static document sets, ignoring the realistic scenario wherein new documents continually arrive at the system.
Lifelong learning for information retrieval (IR) is an important but less-explored topic, including both the first-stage retrieval and re-ranking stage.Recently, Lovón-Melgarejo et al. [23] and Gerald and Soulier [10] studied the lifelong learning problem for PLMsbased re-ranking models.They observed the catastrophic forgetting issue in lifelong IR model learning.Later, Mehta et al. [27] studied continual learning for generative IR models [40], in which they studied how to incrementally index new documents into the model parameters, instead of the distribution shift caused by newly emerged data.Lifelong learning has been studied in image retrieval [37,42].However, the experiments were conducted on fine-grained image classification datasets, the settings of which completely differ from the realistic scenario for document retrieval.
Compatible Representation Learning.Learning compatible representations [8,13,31,37] is a practical need in many scenarios, with the goal of ensuring the embeddings generated by different models are compatible.For example, BCT [37] and LCE [28] learn compatible representations for image recognition, where the embeddings computed by the updated model are directly comparable to those generated by previous models.Specifically, BCT [37] constrains the feature space by simultaneously enabling gradient flow from both the old and new classifiers.However, this method is not suitable for first-stage retrieval, since the relevance score is calculated on the embeddings directly and there are no classification layers.LCE [28] bridges the multiple feature spaces via a lightweight transformation function.However, they still need to re-compute all embeddings of the previous images, which is inefficient for large collection in retrieval.Beyond these, representation compatibility has received increasing attention in asymmetric retrieval [8], where the query and document use different models due to the constrained resources of the computing platform.In contrast to these methods, we study compatible representation learning under the lifelong learning setting for first-stage retrieval.

TASK DESCRIPTION
First-stage Retrieval.Given a query  and a document collection D 0 , first-stage retrieval aims to find potentially relevant documents.With a labeled training dataset C 0 = {(,  +  )}, where  is a query and  +  ∈  +  is one of the relevant documents for , we can build an initial retrieval model  0 using a dual-encoder architecture with a standard contrastive learning objective [12,47].Then, the embeddings of documents in D 0 are extracted and indexed, and the retrieval is performed by estimating the similarity between the query embedding with the document embeddings in the index.Lifelong Learning for First-stage Retrieval.A stream of document sets {D 1 , • • • , D  } having different distributions arrive in  sessions sequentially.Note that these new documents have no relevance labels.For any session  ∈ {1, • • • , }, the lifelong learning algorithm A utilizes documents in D  to update  −1 to   , aiming to adapt the retriever to the new distribution, where  −1 and   are the external memory for session  −1 and  respectively, which store useful information for lifelong model learning, e.g., a subset of training samples or historical versions of the model.If the model updating is representation backwardcompatible, at session , we only need to compute document embeddings for D  using model   .The embeddings of D 0:−1 =  −1 =0 D  that are computed with historical models can be reused when updating the index with existing techniques [14].Otherwise, we need to use   to compute the embeddings for all documents in D 0: to rebuild the retrieval index.

METHODOLOGY 4.1 Overview of the Approach
We employ the typical memory mechanism [5,38] in L 2 R and maintain a restricted external memory to store a subset of historical documents for each training query.This memory mechanism enables the model to efficiently determine the replay samples to address the catastrophic forgetting issue, without browsing from the entire collection.Based on the memory mechanism, for effective As shown in Algorithm 1, when the newly emerged data D  arrives at session , L 2 R selects diverse support negative documents from the new data and the memory buffer  −1 respectively, then updates the retriever from   −1 to   with the selected samples, and finally updates the memory with the new data.Next, we introduce the detailed data selection method (Section 4.2), and the optimization objective for compatible learning (Section 4.3).

Diverse Support Negative Selection
To effectively adapt the retriever to the new distribution, we desire to select support and diverse negatives for model training.Thus, we define positive sample superiority () and inter sample diversity () criteria to instruct the data selection in each step.
Let  and  denote the embedding of query  and document , and  ∥  and  ⊥ denote the projection of  on the directions that are horizontal and vertical to .Intuitively,  ∥  and  ⊥ represent the information in  that is related and unrelated to  respectively.Definition 1 (Positive Sample Superiority).The positive sample superiority between  and  +  for the query  is given by where (•) = 1 if  + ∥  −  ∥  and  + ∥  are in the same direction and −1 otherwise, ∥•∥ 2 is the ℓ 2 norm,  can be any document and  +  is a relevant document for , and  ∥  is defined as The  measures the superiority of  +  being more relevant to the query  than , by comparing the differences between their information related to .Therefore, a higher  value suggests that  is less likely to be an unlabeled relevant sample for .Definition 2 (Inter Sample Diversity).For a given query , the inter sample diversity between  and a document set  is where  ⊥ =  −  ∥  .The  measures the diversity of document  relative to the document set , by comparing the differences between the information unrelated to  among the documents.Based on the two defined criteria, we introduce each step of the model learning, taking (,  +  ) ∼ C 0 in session  as an example.STEP 1: New Data Selection.For using the new data to adapt to the current session , we have the following principles: (1) Documents that are likely to be unlabeled positives should be avoided during selection, since mistakenly identifying relevant documents for model learning could cause serious damage to the performance.
(2) The selected documents should be the negatives that can support the model to learn the decision boundary (we refer to them as support negatives).Such documents should be not trivial for the model to differentiate.(3) The selected documents should have minimum redundancy.With these principles, we propose the following selection strategy for the new data.
We first retrieve top results for  from the new-coming document collection D  with BM25 to filter out massive non-informative samples, and obtain its potential support samples    .Then, based on the defined  and  criteria, we adaptively select  1 diverse support negatives from    with: where the embedding  and  used to calculate  and  are obtained from the latest model   .The  component helps to bypass unlabeled relevant documents and the  component prefers the samples that are distinct from the majority.We use a hyperparameter  to reconcile the two measures.With Eq. ( 5), we select  1 new documents  new q from D  that satisfy the aforementioned three principles.These selected samples are reserved for model updating and also added to the temporary memory   as candidates to update the memory  −1 .STEP 2: Memory Data Selection.To prevent the model from forgetting old knowledge when learning from the new data D  , we also select replay samples for model training from  −1 that are: (1) pivotal for learning the historical versions of the model, i.e., historical support samples; (2) not redundant with each other for efficiency concerns; (3) different from the selected samples in  new q to better balance the acquired knowledge and new knowledge.
With the memory updating strategy in STEP 4, the samples in  −1 already satisfy the first two desiderata.To filter with the third principle, we select  2 documents  mem q from  −1 that have the maximum  score regarding  new q : mem q = arg max where    denotes the stored old documents for  in the memory buffer  −1 .Note that for computing the  score, the embedding  of memory samples are the existing ones computed in previous sessions when the learning is backward-compatible.Otherwise, the embedding is obtained using the latest model   .In the rest of this paper, we adopt the same approach to compute , and we will omit this reminder unless there are special circumstances.STEP 3: Model Update.With the selected new documents  new q (from STEP 1) and memory documents  mem q (from STEP 2), we can update a standard retrieval model from  −1 to   .Without loss of generality, the retrieval model   can be formalized as, where E   and E   are the query and document encoders, and the dot-product function is used to calculate the relevance score based on their embeddings.For model training, we use the standard contrastive learning objective [15,45] to compute the loss for the positive document  +  (no compatibility is ensured) 1 : When the model updating is not backward-compatible for document representations, we need to re-embed all the documents up to the -th session, i.e.,  0: , with   to rebuild the retrieval index.To eliminate the need for re-inferring embeddings of old documents, we can replace the learning objective in Eq. ( 8) with the backwardcompatible learning objective in Section 4.3.STEP 4: Memory Update.In practice, the memory buffer size is often limited to ensure efficiency in selecting replay samples, even though it does not impose a heavy storage burden.Given the limited budget  for each query, selecting which samples to include or replace in the memory is critical.We consider two principles to populate the memory: (1) The sample should have a strong impact on the learning of the decision boundary; (2) The redundancy between stored samples should be minimized.In contrast to most work that updates the memory in each training step [1,5], we delay the memory update until after the completion of model updating in each session in order not to occupy the limited slots in the buffer.
To preserve important information of the current session  for the future, we follow the first principle and consider only the support samples in   as candidates to update  −1 .We calculate the  score of documents in  −1 and   regarding  randomly-sampled anchor documents in  −1 , and use the new documents with the maximum diversity to replace  3 memory samples with the minimum diversity.Finally, we empty the temporary memory buffer to prepare for the next session.Note that, for the initial session ( = 0), we use reservoir sampling [5] to fill the memory.

Backward-compatible Learning
To save the cost of repeated embedding computation, it is desirable for the model updating to ensure backward-compatibility of document representations.It means that existing embeddings for D 0:−1 do not need to be updated, and only the embeddings of new documents in D  are computed with the lastest model   to update the index.We first introduce a vanilla method that can ensure backward compatibility, and two auxiliary alignment objectives for effective compatible learning.Vanilla Compatible Learning.A straightforward approach is to optimize a new contrastive learning loss by fixing the embeddings of previous documents (i.e., the positive sample and the memory samples selected in the current training): where  is a normalization term: The Eq. ( 9) optimizes the model on the new data and existing document embeddings in a unified space to ensure compatibility.However, since all the new samples in  new q are negatives and only the embeddings of new samples are learnable, the model could easily learn the wrong correlation between a document being in the new distribution and it being irrelevant, leading to significant performance regression (see the experimental results in Section 6).In order to facilitate effective backward-compatible representation learning, we introduce two auxiliary alignment objectives.Embedding-aligned Learning.As in [37], a common approach to ensure backward-compatible model updating is to minimize the ℓ 2 distance between the embeddings of previous documents (i.e., { +  }∪ mem q ) calculated with the new model   and their existing embeddings: By guiding the model to encode the old documents similarly to their existing embeddings, it could urge the model to learn decent document representations instead of blindly demoting new documents.However, this pointwise alignment is too strict for the model to adapt to the new documents sufficiently.
Ranking-aligned Learning.To relax the constraint on the model to learn new knowledge, we propose a loose listwise alignment objective.The goal is to minimize the divergence between the predicted distributions of the candidate documents calculated based on the existing and currently learned embeddings, i.e.,  ( | ) and  ′ ( | ), respectively: where  = { +  } ∪  mem q ∪  new q , and The probability distribution  ( | ) represents the model inference when backward compatibility is enabled, and  ′ ( | ) represents the model predictions without compatible learning where all the embeddings need to be learned.In contrast to the pointwise embedding alignment, this ranking-based alignment not only allows more flexible exploration in the representation space but also facilitates bidirectional supervision for model learning: , and  is a hyper-parameter to control the effect of the alignment regularization.

EXPERIMENTAL SETTINGS 5.1 Benchmark Construction
There are no publicly available datasets that could show the continuous growth of documents in realistic retrieval scenarios, potentially with distribution drift, booming events, and newly emerged relevant documents to previous queries.Thus, we build two benchmarks, i.e., LL-LoTTE and LL-MultiCPR, based on two retrieval datasets LoTTE [36] and Multi-CPR [22], to simulate the scenario with the aforementioned properties through the following steps: Preprocessing.LoTTE and Multi-CPR are two retrieval datasets that consist of 5 and 3 domains with separate subsets of documents and queries respectively.For each domain of the two datasets, we merge all the data and re-split them into train/dev/test sets with a ratio of 0.7:0.15:0.15for LoTTE and 0.9:0.05:0.05for Multi-CPR.Table 1 lists the statistics of the two datasets.
Session Partitioning.We build an initial collection D 0 and 3 upcoming sessions with different document distributions for both LL-LoTTE and LL-MultiPCR.In LL-LoTTE, we use technology and writing as the common domains where documents emerge evenly over time, and lifestyle, recreation, and science as the booming domains in each of the upcoming sessions.We keep 70% and 40% of ) can have more annotated relevant documents for evaluation.In LL-MultiCPR, similar to LL-LoTTE, we choose e-commerce as the common domain, medical and entertainment as the booming domains for Session 1 and 2 respectively.Since there are only three domains, Session 3 has no booming domains and simply includes the remaining documents.
Postprocessing.In LoTTE, almost all relevant documents of each query have positive labels.This makes it hard to simulate the realistic scenario where quite a few relevant documents to training queries may appear in the upcoming sessions and not be labeled.To overcome this issue, we collect extra pseudo-relevant documents for training queries using OpenAI API (text-davinci-003), and distribute these unlabeled documents to each coming session with the same sampling ratios in session partitioning.Specifically, we use two types of instructions for pseudo-relevant document generation: (1) "Given a question {} and a relevant document { +  }, please generate 5 other relevant documents.";(2) "Given a document { +  }, please rephrase it.".Through this process, we obtain approximately 18.5 documents for each training query 2 .For LL-MultiCPR, we have not conducted post-processing since there are sizable unlabeled relevant documents in Multi-CPR (see [22]).
Table 2 lists the statistics of the final LL-LoTTE and LL-MultiCPR datasets.Following similar steps, other existing retrieval datasets can also be transformed to evaluate lifelong learning of first-stage retrieval.When there are no explicitly separate domains, topic clustering could be applied for simulation and we leave such investigation for future research.

Evaluation Metrics
We define metrics to evaluate lifelong learning methods for firststage retrieval.Considering the realistic scenario, for each session, we care more about the retrieval performance on the queries in the 2 We perform quantitative analysis on these generated pseudo documents to ensure that they are of high quality.It shows that 63% of them can be retrieved in the top-200 results of BM25 for training queries in each upcoming session.) over the document collection D 0: after the learning of session , and  , can be measured using any common retrieval metric like Recall or MRR.We take the performance at session , namely   , and average performance over all coming sessions, namely , to compare various methods: Following [25], we also apply auxiliary metrics to assess how fast a model learns (Training Time), how much the model forgets (Forget t ), and how well the model transfers knowledge from one session to future sessions (FWT).Formally, they are defined as: To instantiate the above metrics in our work, we consider the evaluation method of the original LoTTE [36] and Multi-CPR [22].Besides Recall (R@ ), Success (S@ ) and Mean Reciprocal Rank (MRR@ ) are used in LoTTE and Multi-CPR respectively.Following the official cutoffs for  , we show the lifelong learning performance on the above defined metrics regarding S@5 and R@100 for LL-LoTTE, and MRR@10 and R@1000 for LL-MultiCPR.

Baselines
We consider two types of baselines for comparison: Memory-based Methods: (1) ER [5] applies random sampling for memory data selection and reservoir sampling for memory update.Despite its simplicity, ER outperforms many complex lifelong learning methods [25].( 2) MIR [1] chooses replay samples according to their loss increment regarding the updated model learned on the new data, and also uses reservoir sampling for memory update.(3) GSS [2] has the same memory data selection strategy as ER but refines the memory update by trying to diversify the samples in the memory buffer based on their gradients.However, it incurs huge computation costs.(4) OCS [46] is one of the latest methods for lifelong learning containing noisy data.It selects high-affinity samples to previous data based on their gradients for model and memory update.Naive Methods: (1) Initial conducts no model updating and uses the model trained in the initial session for the retrieval in the upcoming sessions.(2) Incre-train initializes the model training from the previous session and updates it with the new data in the current session.( 3) Retrain trains the model from scratch in each session using the whole available data until that session.
To see the separate effect of our proposed data selection strategy and ranking alignment objective, we compare our method with the baselines both without and with backward-compatible representation learning (based on Eq. ( 8) or Eq. ( 15)).Note that: (1) Initial has the same performance under the two settings since the model is not updated; (2) Retrain works only without backward compatibility since the model is retrained from scratch in each session.For the comparisons with the backward-compatibility constraint, we equip the baselines with the embedding alignment objective in Eq. ( 15), and our model L 2 R uses no or one of the two alignment objectives, named as L 2 R vanilla , L 2 R emb , and L 2 R rank , respectively.

Implementation Details
We implement the retrieval model with DPR [15], and the parameters are initialized with BERT-base released by Google.The hyperparameters in baselines and our method are tuned on the dev set.For LL-LoTTE, we truncate the input query and passage to a maximum of 32 and 256 tokens respectively.We train retrieval models with BM25 top-500 results for the initial session and top-200 results for the upcoming sessions, and the key hyper-parameters of BM25 are tuned to  1 = 0.80 and  = 0.72.We use a batch size of 96, and a learning rate of 5e-6 and 1e-6 for the initial session and upcoming sessions respectively.For LL-MultiCPR, we set the query and passage length to 32 and 128 respectively.We train the initial session and upcoming sessions with BM25 top-500 results, with  1 = 0.20 and  = 0.72.We use a learning rate of 1e-5 and 3e-6 for the initial and upcoming sessions respectively, and a batch size of 192.For the two datasets, we pair each positive document with 5 negatives for training, including 3 new documents and 2 memory documents.For data selection, we upsample a subset with twice the desired number of documents in each training step, instead of the entire collection, to save the computation cost.For memory update, we set the number of anchor documents and replaced documents are 1/3 of the memory buffer size .We set  to 0.6 and 0.8 for LL-LoTTE and LL-MultiCPR respectively, and  to 1.0 and 3.0 for both.For each dataset, we set the buffer size  of each training query with two settings that can hold: (1) half of training negatives in the initial session (i.e., 30 for LL-LoTTE and 10 for LL-MultiCPR); (2) total training negatives in all the sessions (i.e., 100 for LL-LoTTE and 30 for LL-MultiCPR).We use the former as the default setting.
We adopt the Transformers for implementations and all experiments run on Nvidia Tesla V100-32GB GPUs.Statistically significant differences are measured by a two-tailed t-test.The datasets and code are available at https://github.com/caiyinqiong/L-2R.

RESULTS AND DISCUSSION
In this section, we present the experimental results and conduct thorough analysis of L 2 R to clarify its advantages.

Main Evaluation
We compare L 2 R with all the baselines in Section 5.3, and record their results under both settings in Table 3 Performance without Representation Compatibility.From Table 3, we find that: (1) Without special measures for lifelong learning, neural retriever DPR (i.e., Initial) shows poorer generalization ability than the term-based retrievers (i.e., BM25), especially when the distribution changes violently.For example, DPR outperforms BM25 on LL-LoTTE in Session 0-2 but underperforms it when massive science documents influx in Session 3 (there are significantly more documents in the science domain than others).This observation is consistent with the conclusion in [33] that neural retrievers are less robust than BM25.(2) For the methods that learn from new data (i.e., methods except Initial), Incre-train performs poorly than memory-based methods, probably because it does nothing to address the catastrophic forgetting issue.Additionally, Incre-train is not always superior to Initial, particularly on LL-MultiCPR.Apart from the forgetting issue, we believe a potential reason is that the sizable unlabeled relevant documents in the new data could hurt model updating.(3) It is worth noting that Retrain exhibits worse performance, particularly on recall, which deviates from findings in other lifelong learning tasks like image classification [1].It is probably because the retrained retriever has seen fewer varieties of negative samples and has a higher probability of using emerged unlabeled positive documents for training.(4) Among the memory-based methods, MIR has the overall best performance on both datasets and OCS can not exceed it.This shows that the gradient-based method to filter out noisy data in OCS does not work effectively for unlabeled relevant documents in the retrieval task.(5) On both benchmarks, L 2 R consistently outperforms the baselines in all upcoming sessions.Especially in Session 3 of LL-LoTTE that has violent distribution drift, L 2 R beats others by a large margin and surpasses BM25.These performance gains confirm the advantages of our proposed data selection strategy in L 2 R.
Performance with Representation Compatibility.The performance of all the methods with representation compatibility is presented in Table 4. Compared to Table 3, we have the following observations.From the perspective of effectiveness: (1) Adding the embedding alignment to ensure representation compatibility leads to significantly lower model performance, even worse than Initial.It shows that enforced embedding alignment could hurt model learning on new data.Among these methods, Incre-train is hurt the least, probably because the regularization is applied on fewer documents.(2) For the three variants of L 2 R with representation compatibility, L 2 R vanilla suffers from model collapse by only optimized with the contrastive learning loss on existing embeddings of previous documents.By injecting an alignment regularization, the model could be updated more effectively with backward-compatible representations.(3) It is exciting that L 2 R rank can significantly exceed L 2 R emb , and even outperform L 2 R that without representation compatibility in some sessions (e.g., Session 1 of LL-LoTTE and almost all the sessions of LL-MultiCPR).It shows that the alignment on predicted ranking lists allows for more flexible encoder updates than direct embedding alignment, and the prediction results based on existing embeddings (computed by old models) provide beneficial information based on the previously acquired knowledge to guide the model to learn new data (see further analysis in Section 6.3).From the perspective of efficiency: (1) With representation backward-compatibility, it can save 79% (2.73M vs. 13.16M) and 81% (1.47M vs. 7.85M) of computation costs for inferring document representations than that without compatibility on LL-LoTTE and LL-MultiCPR respectively (accumulated on 3 upcoming sessions).Overall, these results demonstrate that the ranking alignment objective in L 2 R could promote both the effectiveness and efficiency of model lifelong learning.S@5 R@100 S@5 R@100 MRR@10 R@1000 MRR@10 R@1000 ER 37. 8  Performance with Larger Memory Buffer Size.To investigate the impact of memory buffer size on model performance, we conduct experiments using a larger buffer that can hold all the training samples used in the four sessions.Only the results of the last session using different  are reported in Table 5 for a clear comparison.We observe that with a larger buffer size, the performance of L 2 R rank is further improved, particularly on precision metrics such as S@5 and MRR@10 (e.g., the improvement is 1.3% on S@5 for LL-LoTTE and 4.0% on MRR@10 for LL-MultiCPR.).However, the baseline methods do not benefit as much from a larger memory, probably because ER and MIR store random-sampled documents, and more importantly, all of them cannot filter out the unlabeled positives.In contrast, L 2 R stores diverse support negative samples, thereby making more efficient use of the memory buffer slots.

Studies on Data Selection Strategy
We run ablation studies on the data selection strategy to investigate its impact on model learning.
For data selection, we define  and  to measure the likelihood of a document being negative and its diversity relative to others.We compare L 2 R rank with several ablation variants to verify the effectiveness of our criteria in Table 6: (1) For the   module, we observe that without the  component to filter out unlabeled relevant documents in the new data, the retrieval performance on the two datasets significantly decreases.Removing  also causes a performance drop, especially on recall, since the retriever has seen fewer varieties of negative samples if redundancy among the selected samples is not considered.These results demonstrate that both criteria are important in selecting new data for the model to adapt to new distributions.(2) For the  module, we remove the  component and randomly select replay samples from the memory.The performance decreases on both datasets, showing that selecting samples different from the new data for replaying is critical for effective model updating.It is probably because the cooccurrence of discrepant or even conflicting data encourages the model to deliberate the balance between learning new knowledge and preserving old knowledge.(3) For the   module, we remove the  component and replace the samples in the memory randomly.The results show that LL-LoTTE has less regression in performance compared to LL-MultiCPR, probably because LL-MultiCPR uses L 2 R emb L 2 R rank S@5 R@100 S@5 R@100 S@5 R@100 a smaller buffer size (=10), and storing non-redundant samples becomes more important for it to address the forgetting issue.

Studies on Alignment Objectives
We conduct studies on the embedding and ranking alignment objectives to probe their impact on model updating.
Performance on Seen and Unseen Queries.To understand how the alignment objectives affect model updating, we split the test set of each coming session in LL-LoTTE to previously seen queries and newly unseen queries, and evaluate the performance of L 2 R, L 2 R emb , and L 2 R rank .From Table 7, we find that: (1) The seen queries generally achieve higher S@5 but lower R@100.It is because the seen queries usually have more relevant documents than the unseen queries, which is less favourable for them on recall.
(2) In L 2 R emb , both seen and unseen queries experience a significant performance drop compared to L 2 R that without compatibility.Especially, the drop on unseen queries is more dramatic than that on seen queries.It shows that direct embedding alignment constrains the model to learn new knowledge.(3) It is interesting that L 2 R rank demonstrates improved performance on unseen queries compared to L 2 R. It shows that the ranking results predicted on the old embeddings provide beneficial supervision to the model to learn relevance matching on new data.Moreover, the ranking alignment does not harm seen queries, unless the distribution changes drastically and the model compromises to fit new data (i.e., Session 3).
Performance on Auxiliary Metrics.We compare all the methods with representation compatibility on auxiliary metrics, including Forget T , FWT, and Training Time, to gain insights into the model updating process.From Figure 2, we observe: (1) Among all the methods, L 2 R rank performs best in addressing the catastrophic forgetting issue.Particularly on LL-LoTTE, it has negative values on Forget T .Apart from the superior memory mechanism in L 2 R, one possible reason is that, during the lifelong learning process, models with ranking-aligned compatible leaning could effectively acquire new knowledge, and the query encoder is adjusted to better

CONCLUSION AND FUTURE WORK
In this work, we study a common scenario in real-world search engines, where numerous documents are continuously emerging with potential distribution drift.To adapt the retriever to new distributions, we propose a memory-based lifelong learning method for first-stage retrieval (i.e., L 2 R).By employing the selection strategy of diverse support negatives for model updating, along with a ranking alignment objective for backward-compatible representation learning, L 2 R could continuously learn the retriever on unlabeled emerging documents both effectively and efficiently.Extensive experiments on our constructed benchmarks demonstrate the superiority of L 2 R over competitive lifelong learning baselines.
Our work presents an initial step towards solving the critical challenges in lifelong learning for first-stage retrieval.Due to page limitations, certain promising directions remain unexplored in this study.Firstly, it is worth investigating whether the methods proposed for domain adaptation still work well in the lifelong learning setting, as both address the distribution changes.Secondly, the current method does not yet have specialized techniques to handle queries related to booming topics, which presents an avenue for future research.In conclusion, we believe that our study, despite its limited scope, provides valuable and generalizable insights that could guide future research on this task.

Figure 1 :
Figure 1: Memory-based lifelong learning method for first-stage retrieval (L 2 R).

Figure 2 :
Figure 2: Evaluation on auxiliary metrics.Each column denotes a metric and each row denotes a dataset.

Algorithm 1 :
Overview of L 2 R.
1)  ′ ( | ) can adapt the model better to the new data since the embeddings of candidates are all currently learned including the new ones.So it could guide  ( | ), the backward-compatible inference we finally use, to better acquire new knowledge.2) In  ( | ), since the positive document and memory negative samples are ranked based on their existing embeddings up until session  −1,  ( | ) captures their relative rankings from the model at session  −1.This older model has seen the negatives from session 1 to −1 including the ones that have been removed from the memory, which could cover various types of negatives.Hence, by aligning with  ( | ),  ′ ( | ) can learn from the older more experienced model.Given the above mutual supervisions between the new and old model, our proposed ranking alignment objective ensures the representation compatibility without compromising the model performance, obtaining even better results (see more analysis in Section 6.3).

Table 1 :
Statistical information of LoTTE and Multi-CPR.random documents from the common and booming domains respectively in D 0 .Next, we construct 3 corpora {D 1 , D 2 , D 3 } for the following three sessions.Each corpus consists of 10%, 50%, and 5% of documents from the two common domains, a booming domain, and the remaining two domains respectively.With the documents in each session, we collect their connected queries from the new split train/dev/test sets of LoTTE to construct the training dataset and dev/test sets.Note that, under the setting where new-coming documents have no labels, the labeled relevant query-document pairs for model training remains the C 0 , but the dev/test sets Q dev the  and Q test

Table 2 :
Statistics of LL-LoTTE and LL-MultiCPR datasets.Let  , be the retrieval performance evaluated on the test queries of session  (i.e, Q test

Table 5 :
Evaluation results of the last session ( 3 ) with different buffer size on LL-LoTTE and LL-MultiCPR.All the methods run with representation compatibility.

Table 6 :
Ablation studies on the data selection strategy in L 2 R rank .Evaluation results of the last session ( 3 ) in LL-LoTTE and LL-MultiCPR are reported.

Table 7 :
Investigations on the alignment objectives.Evaluation of each session ( 1 −  3 ) in LL-LoTTE are reported.