Continual Learning for Generative Retrieval over Dynamic Corpora

Generative retrieval (GR) directly predicts the identifiers of relevant documents (i.e., docids) based on a parametric model. It has achieved solid performance on many ad-hoc retrieval tasks. So far, these tasks have assumed a static document collection. In many practical scenarios, however, document collections are dynamic, where new documents are continuously added to the corpus. The ability to incrementally index new documents while preserving the ability to answer queries with both previously and newly indexed relevant documents is vital to applying GR models. In this paper, we address this practical continual learning problem for GR. We put forward a novel Continual-LEarner for generatiVE Retrieval (CLEVER) model and make two major contributions to continual learning for GR: (i) To encode new documents into docids with low computational cost, we present Incremental Product Quantization, which updates a partial quantization codebook according to two adaptive thresholds; and (ii) To memorize new documents for querying without forgetting previous knowledge, we propose a memory-augmented learning mechanism, to form meaningful connections between old and new documents. Empirical results demonstrate the effectiveness and efficiency of the proposed model.


INTRODUCTION
Generative retrieval (GR) has emerged as a new paradigm for information retrieval (IR) [34].Without loss of generality, the GR paradigm aims to integrate all necessary relevant information in the collection into a single, consolidated model.With GR, indexing is replaced by model training, while retrieval is replaced by model inference.A sequence-to-sequence (seq2seq) model is jointly trained for both indexing and retrieval tasks: the indexing task aims to associate the document content with its identifiers (i.e., docids); the retrieval task requires that queries are mapped to relevant docids.
GR and dynamic corpora.Most existing work on GR assumes a stationary learning scenario, i.e., the document collection is fixed [6,12,46,47].However, dynamic corpora are a common setting for IR.In most real-world scenarios, information changes and new documents emerge incrementally over time.For example, in digital libraries, new electronic collections are continuously added to the system [48].And a medical search engine may continuously expand its coverage to provide information about emerging diseases, as we have seen with COVID-19 [24].An important difference between static and dynamic scenarios is that in the former scenario a GR system may be provided with abundant labels for training, but in the latter scenario very few labeled query-document pairs are typically available.Therefore, it is critical to study the continual learning ability of GR models before their use in real-world environments.
The continual document learning task comes with interesting challenges.For traditional pipeline frameworks for IR [16,21,41], indexing and retrieval are two separate modules.Therefore, when new documents arrive, their encoded representations can be directly included in an external index without updating the retrieval model due to the decoupled architecture.In GR, all document information is encoded into the model parameters.To add new documents to the internal index (i.e., model parameters), the GR model must be re-trained from scratch every time the underlying corpus is updated.Clearly, due to the high computational costs, this is not a feasible way of handling a dynamically evolving document collection.
A document-incremental retriever.Our aim is to develop an effective and efficient Continual-LEarner for generatiVE Retrieval (CLEVER), that is able to incrementally index new documents while supporting the ability to query both newly encountered documents and previously learned documents.To this end, we need to resolve two key challenges in terms of the indexing and retrieval task.
First, how to incrementally index new documents with low computational and memory costs?We introduce incremental product quantization (IPQ) based on product quantization (PQ) methods [20] to generate PQ codes for documents as docids, which can represent large volumes of documents via a small number of quantization centroids.The key idea is to incrementally update a subset of centroids instead of all centroids, without the need to update the indices of existing data.Specifically, given the base documents (that is, the initial collection of documents), we iteratively train the document encoder and quantization centroids with a clustering loss and a contrastive loss.The clustering loss offers incentives for representations of documents around a centroid to be close, while the contrastive loss enhances the document representation close to its own random spans.This helps learn discriminative documents and centroid representations, so as to easily generalize to new documents.Then, as new documents arrive, we introduce two adaptive thresholds based on the distances between new and old documents in the representation space, to automatically realize three types of update for centroid representations, i.e., unchanging, changing, and addition.Finally, we index each new document by learning a mapping from document content to its docid.
Second, how to prevent catastrophic forgetting for previously indexed documents and maintain the retrieval ability?We take inspiration from the given-new strategy in cognitive science, in which humans attach new information to already known, i.e., given, similar information, in their memory to enhance a mental model of the information as a whole [9,10,17].We propose a memoryaugmented learning mechanism to strengthen connections between new and old documents.We first allocate a dynamic memory bank for each session to preserve exemplar documents similar to new documents to prevent forgetting of previously indexed documents.Then, we train a query generator model to sample pseudo-queries for documents and supplement them while continually indexing new documents to prevent forgetting for the retrieval task.
Experimental findings.We introduce two novel benchmark datasets constructed from the existing MS MARCO [36] and Natural Questions [25] datasets, simulating the continual addition of documents to the system.Extensive evaluation shows that CLEVER performs significantly better than prevailing continual learning

PROBLEM STATEMENT
Task formulation.Given a large-scale base document set D 0 and sufficiently many labeled query-document pairs P D 0 0 , we can train an initial GR model  (•) via a standard seq2seq objective [42].Let the meta-parameters of the initial model be Θ 0 .The continual document learning task assumes the existence of  new datasets {D 1 , . . ., D  , . . ., D  }, from sessions arriving in a sequential manner.In any session  ≥ 1, D  is only composed of newly encountered documents { 1  ,  2  , . . .} without queries related to these documents.Let the model parameters before the -th update be Θ  −1 .For session , the GR model is trained to update its parameters to Θ  via the new dataset D  and previous datasets {D 0 , . . ., D  −1 }, and Θ  serves as input for the datasets {D 0 , . . ., D  }.
Evaluation.After updating GR models with new documents, we explore two types of test query set for performance evaluation.
Single query set.As illustrated in Figure 1 (a), under this condition, there is only one test query set Q test , and their relevant documents arrive in different sessions.However, we cannot directly compare the retrieval performance before and after incremental updates.The reason is that many widely-used ranking metrics [32] are based on ground-truth relevant documents, which change across sessions.Instead, we compare the overall performance VERT  of different methods on Q test in the same session  vertically, where  +  is a relevant document to the query  ∈ Q test in existing sessions {0, . . .,  }, and (•) denotes a widely-used evaluation metric for IR; see Section 4.3.
Sequential query set.As illustrated in Figure 1 (b), under this condition, the test query set Q test  is specific for each session , and the relevant documents appear in existing sessions {0, . . .,  }.We can directly compare different models across different sessions.Besides VERT, following [26,33], we apply (i) average performance (AP) to measure the average performance by the end of training with the entire existing data sequence, (ii) backward transfer (BWT) to measure the influence of learning a new session on the preceding sessions' performance, and (iii) forward transfer (FWT) to measure the ability to learn when presented with a new session.

METHODOLOGY
In this section, we introduce our Continual-LEarner for generatiVE Retrieval (CLEVER).Given an already constructed GR model, we first index newly arrived documents (Section 3.1), and then prevent forgetting of the retrieval ability during incremental indexing (Section 3.2). Figure 2 provides an overview of the method.

Indexing new documents
To incrementally index new documents, we need to encode new documents into docids with low computational cost, while learning associations between new documents and their docids.3.1.1Incremental product quantization.One popular docid representation is to leverage the product quantization (PQ) technique [20] to generate the PQ code as the docid.PQ is able to produce a number of centroids with low storage costs, contributing to representing large collections of documents.However, it is not designed for dynamic corpora.Therefore, we propose incremental product quantization (IPQ) based on PQ to represent docids.
The key idea is to design two adaptive thresholds to update a subset of centroids instead of all centroids, without changing the index of the updated centroids.IPQ contains two dependent steps: (i) construct the document encoder and base quantization centroids, from the base documents D 0 , and (ii) partially update quantization centroids, based on the relationship between new documents D  and old documents {D 0 , . . ., D  −1 }.
Building base quantization centroids.Given the base document set D 0 , we first leverage BERT [14] as the initial document encoder.Specifically, a special token  0 = [CLS] is added in front of the -th document   0 = { 1 , . . .,  |  0 | } in D 0 , and the encoder represents the document   0 as a series of hidden vectors, i.e., h 0 , h 1 , . . ., h |  0 | = Encoder( 0 ,  1 , . . .,  |  0 | ).We feed the [CLS] representation h 0 into a projector network [7,8], which is a feed-forward neural network with a non-linear activation function (i.e., tanh), to obtain the complete document representation x 0, of   0 .To better generalize to new documents, we propose a two-step iterative process to iteratively learn document encoder and quantization centroids, to enhance their discriminative abilities.In Step 1, centroids are obtained via a clustering process over document representations, and in Step 2, document representations are learned from centroids with a bootstrapped training process.
Quantization.Given the sub-vector x 0,  ∈ R / , we quantize it to the nearest centroid z 0 ,  .We select the centroid z 0 , (x 0,  ) which achieves the minimum quantization error and decide the   -th cluster in the -th group that x 0,  belongs to   =  (x 0,  ) = arg min  ∥z 0 , − x 0,  ∥ 2 .Finally, document representation x 0, is quantized as the concatenation of  centroid representations, i.e., x 0, = [z 0 Step 2: Bootstrapped training for document representations.High-quality document representation is the foundation of PQ to support effective clustering.However, based on the original BERT, the discriminative ability of these representations may be limited since the representation will focus more on the common words and thus is not differential with other representations [27,49].Therefore, we propose a bootstrapped training process based on BERT to learn discriminative document representations.The key idea is to utilize both the contrastive loss and the clustering loss for re-training the BERT encoder itself.
The contrastive loss helps to generate the document representation close to its own random spans while being far away from others [27].We first sample a set of spans at four levels of granularity for each document with length  in D 0 , including word-level, phraselevel, sentence-level and paragraph-level: (i) length sampling: We first sample the span length from a beta distribution for each level of granularity, i.e., ℓ span =  span • (ℓ max − ℓ min ) + ℓ min , where ℓ min and ℓ max denote the minimum and maximum span length of each level of granularity and  span is sampled by  span ∼ Beta(, ), where  and  are two hyperparameters, and (ii) position sampling: We randomly sample the starting position  ∼  1,  − ℓ span and the ending position end = start + ℓ span .In this way, the final span is denoted as span = [ start , . . .,  end−1 ].Given a mini-batch of  documents, we can obtain  whole document representations and their span representations and the contrastive loss is, , where is the number of spans sampled per granularity,  () is the index set of spans from   0 with size 4, sim(•) is the dot-product function and  is the temperature hyper-parameter, and s  is the span representation, which is computed via the average pooling operation over the output word representations given by the encoder, i.e., s  = AvgPooling(h start , . . ., h end−1 ).
The clustering loss computes the mean square error (MSE) between the document representations before and after quantization, enabling to cluster document representations around the centroid representations.Concretely, the MSE loss We then re-train the BERT encoder via L  + L  , and also adopt the [CLS] representation given by the retrained encoder as the document representation.
Repeating Step 1 and Step 2.
Step 1 and Step 2 are repeated iteratively for  epochs.Finally, we obtain the initial quantization centroids to build the PQ codes and the Encoder(•) learned on D 0 is fixed for later sessions.
Adaptively updating quantization centroids.With the arrival of new documents D  during session , we first utilize the learned Encoder(•) to obtain the document representations.Based on the representations of new and old documents, a simple method is to re-cluster them to obtain the novel PQ codes as the docids.However, this may incur high computational cost in updating all clustering results and re-training the GR model based on the updated docids.
Ideally, we have a way to balance the trade-off between update efficiency and quantization error.Here, we introduce a partial codebook update strategy for this purpose.Specifically, we design three types of update for centroid representations in each sub-codebook, contributing to efficiency in memory and computational load: • Unchanged old centroids: It is pointless to update the centroid when the features in some groups of new documents have a trivial contribution in their corresponding centroid update.• Changed old centroids: It is possible that some features of new documents have a vital contribution to a centroid update.• Added new centroids: We should add new centroids when new documents are significantly different from all old documents.To achieve the above update, we first divide the representation vector x , of each document    ∈ D  , into  sets of sub-vectors and add each sub-vector x ,  to the corresponding -th group.Then, for each sub-codebook, we compute the Euclidean distance [11] between the newly arrived sub-vector x ,  and its nearest centroid z  −1 ,  based on the last session  − 1, i.e., dist(x ,  , z  −1

𝑚,𝑘 𝑚
).Finally, we devise two adaptive thresholds, i.e.,  and , according to this distance, to achieve three types of update.
For each cluster in a sub-codebook,  is the average distance between each document sub-vector and the quantization centroid, where   −1 , is the set of document indices assigned to the centroid z  −1 , .And  is the maximum distance between each document sub-vector and the quantization centroid, denoted as, where rand_dist ∼  (0, ) is sampled from the continuous uniform distribution.Note that the condition  ≤  always holds.Therefore, as depicted in Figure 2(a), we can automatically decide the update type of each centroid representation as follows: ) ≤ , we need to update the centroid representation.We first update the set via   , =   −1 , ∪ {ind}, where ind is the index number of x ,  .Then each centroid can be updated by, ) > , we add a new cluster and thus there are  +1 clusters in the group.We directly use the document sub-vector as the centroid representation, i.e., z  ,+1 = x ,  .After applying the above update strategy for all  sub-codebooks, we obtain the specific codebook   at session : (i) for new documents in D  , we obtain their PQ codes U D   based on the   as the docids, and (ii) for old documents, their PQ codes will not be affected since we only operate on the centroid representations, instead of the index of the updated centroid.In the case of old documents around a centroid sharing the same representation, i.e.,  =  = 0, we directly change the centroid representation based on the new document sub-vector.3.1.2Indexing objective.To memorize information about each new document, we leverage maximum likelihood estimation (MLE) [35] to maximize the likelihood of a docid conditioned on the corresponding document, i.e., L ,+  = where , Θ  is the GR model parameters at the session , and  ∈ {1, . . ., |D  |}.

Preserving retrieval ability
During continual indexing of new documents, it is important for the GR models to prevent forgetting the retrieval ability.We are inspired by the fact that humans benefit from previous similar experiences when taking actions [9,10,17] and propose a memoryaugmented learning mechanism to build meaningful connections between new and old documents.Specifically, we first construct a memory bank with similar documents for each new session and replay the process of indexing them alongside the indexing of new documents.Then, we leverage a query generator model to sample pseudo-queries for documents and the resulting query-docid pairs are employed to maintain the retrieval ability.The overall learning process is visualized in Figure 2 (b).
Dynamic memory bank construction.The memory bank is allocated to store a tiny subset of old documents which are similar to new documents in the PQ space.We assume that two documents are similar if many dimensions of their PQ codes are the same.For each document in D  , we target to retrieve its similar documents at different levels.Concretely, we iteratively change its PQ code at different dimensions, which includes the following steps: (i) we first set the number  of PQ code dimensions that will be changed to 1; (ii) we randomly select  dimensions of the PQ code and assign different centroids to the selected dimensions to obtain the similar PQ code.We repeat this process  times; and (iii) we obtain the similar documents from the previous sessions if they are associated with the obtained PQ codes.The processes in (ii) and (iii) are repeated by increasing  with 1 to at most /6.
Finally, we group the similar documents of each document in D  to construct a specific memory bank B  at the session .Note that the memory bank is dynamically updated at each new session.
Rehearsing the indexing of old documents.For each new session , we aim to prevent forgetting previously indexed documents.Given the meta model parameters Θ  −1 before the -th update, we apply MLE over the memory bank B  to update the GR model, i.e., where U B   is the PQ codes of B  , and  ∈ {1, . . ., |B  |}.Constructing pseudo query-docid pairs.To prevent forgetting the retrieval ability during indexing new documents, we train a query generator model to sample pseudo-queries for documents and supplement the query-docid pairs during indexing.We finetune the T5 model [38] based on the query-document pairs P D 0 0 in the initial session, by taking the document terms as input and producing a query following [37].After fine-tuning, the model parameters of the query generator Θ  are fixed.
For each new session , we generate pseudo-queries for each document in D  and B  via Θ  and denote the obtained pairs of pseudo-queries and documents as P D

Overall training objective
In the training phase, we sequentially train the GR model on each session  by combining the objective for indexing and retrieval, i.e., min where  is a hyper-parameter.The elastic weight consolidation (EWC) [23] loss L    is used to regularize the model parameters, via the weighted distance between Θ  −1 and Θ  , where  is the Fisher information matrix [23], and   denotes each model parameter.

EXPERIMENTAL SETTINGS
Next, we summarize our experimental settings.The code can be found at https://github.com/ict-bigdatalab/CLEVER.

Benchmark construction
To facilitate the study of continual document learning for GR, we build two benchmark datasets, i.e., CDI-MS and CDI-NQ, from MS MARCO Document Ranking [36] and Natural Questions (NQ) [25], respectively.MS MARCO contains 367, 013 query-document training pairs, 3, 213, 835 documents, and 5, 192 queries in the dev set.NQ contains 307k query-document training pairs, 231k documents, and 7.8k queries in the dev set.We report the performance results on the dev sets as both MS MARCO Document Ranking and NQ leaderboard limit the frequency of submissions [33,46].
To mimic the new arrival of documents in MS MARCO and NQ, we first randomly sample 60% documents from the whole document set as the base documents D 0 , and leverage their corresponding relevance labels to construct the query-document pairs P D 0 0 .Then, we randomly sample 10% documents from the remaining document set as the new document set, and this operation is repeated for 4 times to obtain D 1 -D 4 .The test query set is defined as follows: (i) for a single query set, all dev queries are denoted as Q test , and (ii) for sequential query set, we sample 60%, 10%, 10%, 10% and 10% queries from the whole dev query set as Q test 0 , . . ., Q test 4 , respectively.Furthermore, we compare our model with an adaption of Ultron as the BASE method, wherein PQ technique is used to generate docids and the GR model is continually fine-tuned by directly mapping each new document to its docid.We also compare with DSI++ [33], which continuously fine-tunes DSI over new documents by directly assigning each new document an atomic docid, i.e., an arbitrary unique integer.We re-implement it since the source code has not yet been released.

Model variants.
To verify the effectiveness of IPQ, we implement variants with the memory-augmented learning mechanism.To build base quantization centroids, we have (i) CLEVER atomic , which uses arbitrary unique integers as docids, as used in DSI++; (ii) CLEVER  , which directly uses the original BERT base [14] to obtain document representations, and builds PQ codes as docids by the original PQ technique [20]; the codebook is fixed in all sessions; and (iii) CLEVER + , which extends CLEVER  by re-clustering the document representations obtained by BERT  as new documents arrive; the codebook is updated at each new session.
To adaptively update the quantization centroids, the variants are: (i) CLEVER + leverages the two-step iterative process to build discriminative base PQ codes; then, the codebook is fixed for new sessions, i.e., only adopting the "unchanged old centroids" type; (ii) CLEVER ++ extends CLEVER + by adding the  threshold, i.e., adopting "unchanged old centroids" and "changed old centroids;" (iii) CLEVER ++ extends CLEVER + by adding the  threshold, i.e., adopting "added new centroids" and "changed old centroids" to update the quantization centroids; and (iv) CLEVER ++ extends CLEVER + by re-building discriminative PQ codes for all documents as new documents arrive; the codebook is updated at each new session.
To verify the effectiveness of the memory-augmented learning mechanism, variants (while using IPQ) are: (i) CLEVER −  , which removes L    in Eq. 7 to re-train the GR model; (ii) CLE-VER − ( −) , which removes L , −  in Eq. 7 to re-train the GR model; (iii) CLEVER − () , which removes L ,  in Eq. 7 to re-train the GR model; and (iv) CLEVER  , which randomly selects some old documents to construct the memory bank with the same number of similar documents in CLEVER, which is an adaption of DSI++ [33], and then re-trains the GR model via Eq. 7.

Evaluation metrics
The evaluation metric (•) for IR in Section 2 is usually taken to be mean reciprocal rank (MRR@ ), recall (R@ ), hit ratio (Hits@ ) and top- retrieval accuracy (ACC@ ).Following [33,46,47,54], we show the continual results in terms of MRR@10 and HIT@10 for CDI-MS and CDI-NQ, respectively.By conducting further analyses, we find that the relative order of different models on other IR metrics is quite consistent with that on the MRR@10 and Hits@10.

Implementation details
For IPQ, the length  of PQ codes is 24, the number of clusters  is 256, and the dimension of vectors  is 768.For the contrastive loss in the document encoder, ℓ min and ℓ max for sampling phraselevel spans are 4 and 16, respectively.For sentence-level spans, ℓ min and ℓ max are 16 and 64, respectively.For paragraph-level spans, ℓ min and ℓ max are 64 and 128, respectively.The  and  in the beta distribution are 4 and 2, respectively, which skews sampling towards longer spans.The number of spans sampled per granularity  is 5.For the memory-augmented learning mechanism, the repeat time  is 10, the probability  is 0.2, the scale  is 0.2, and  is 0.5.
To train the document encoder in IPQ, we initialize the document encoder from the official BERT's checkpoint.We use a learning rate of 5 −5 and Adam optimizer [22] with a linear warmup over the first 10% steps.Long input documents are truncated into several chunks with a maximum length of 512.The hyper-parameter of  is 0.1.We train for 6 epochs on four NVIDIA Tesla A100 40GB GPUs.
The GR baselines and all variants of CLEVER, are based on the transformer-based encoder-decoder architecture, where the hidden size is 768, the feed-forward layer size is 3072, the number of transformer layers is 12, and the number of self-attention heads is 12, for both encoder and decoder.We implement the generative model in PyTorch based on Huggingface's Transformers library.We initialize the parameters of the encoder-decoder architecture from the official checkpoint of T5  [38].We use a learning rate of 3 −5 and Adam optimizer with the warmup technique, where the learning rate increases over the first 10% of batches, and then decays linearly to zero.The max length of the input is 512, the label smoothing is 0.1, the weight decay is 0.01, and the gradient norm clipping is 0.1.We train in batches of 8192 tokens on four NVIDIA Tesla A100 40GB GPUs.At inference time, we adopt constrained beam search [12] to decode the docids with 24 timesteps and 15 beams.To train the query generator, we also initialize the parameters from the official checkpoint of T5  [38], with a learning rate of 5 −4 .For each new document, we adopt beam search to decode the pseudo queries with utmost 32 timesteps and 10 beams.

EXPERIMENTAL RESULTS
In this section, we (i) analyze the retrieval performance on the CID-MS and CID-NQ datasets under both incremental and non-incremental settings, (ii) assess catastrophic forgetting and forward transfer abilities, and (iii) analyze the memory and computation cost.For (ii) and (iii), we conduct experiments on the CID-MS dataset under sequential query set setting in terms of VERT(%).

Baseline comparison
Incremental performance on a single query set.The performance comparisons between CLEVER and baselines on the single query set are shown in Table 1.BM25 exhibits better performance than DPR and the underlying reason may be that BM25 is a dataindependent probabilistic model, which renders it adaptable in the face of dynamic corpora.And the BASE method suffers a significant drop as new documents are added.By assigning new documents with atomic docids and using sampled old documents, DSI++ shows slight improvements over BASE.These results show that continual document learning for GR is a non-trivial challenge.
When we look at variants of CLEVER in terms of IPQ, we find that: (i) CLEVER  performs worse than CLEVER  which updates the embeddings for each individual docid (also used in DSI++), showing that it is difficult for the GR models to accommodate to new documents without updating docids.However, as shown in Section 5.4, for CLEVER  , its memory continues to grow with the increase of documents and the time consumption of each update step increases with the number of steps.(ii) CLEVER + and CLEVER ++ give the worst performance.The reason might be that re-clustering old and new documents changes the previously learned centroids and thus the docids of old documents, which makes the learned document-docid mapping invalid.(iii) The improvements of CLEVER + over CLEVER  demonstrate the need to learn discriminative document representations and quantization centroids.(iv) CLEVER ++ and CLEVER ++ outperform CLEVER + , showing that incorporating updates to old centroids and introducing new centroids could facilitate the assimilation of new documents.
When we look at variants of CLEVER with different learning mechanisms, we observe that: (i) CLEVER − () performs worse than CLEVER − ( −) and CLEVER −  , showing that constructing pairs of pseudo queries and docids and supplementing them during continual indexing contributes to preventing forgetting for the retrieval ability.(ii) CLEVER  shows the worst performance.Randomly selected old documents do not provide Table 1: Model performance under the single query set setting.We evaluate the performance of Q test in each session D 0 -D 4 in terms of VERT (%), respectively.* indicates statistically significant improvements over all baselines (p-value < 0.05).

CDI-MS (MRR@10)
CDI-NQ (Hits@10) Finally, CLEVER achieves the best performance.The results imply that applying an adaptive update strategy for PQ codes can assign effective docids to new documents without changing the old docids.And rehearsing old similar documents and generating pseudo queries, can actively absorb knowledge from new documents while preserving previously learned retrieval ability.
Incremental performance on a sequential query set.The performance comparisons between CLEVER and the baselines on the sequential query set are shown in Table 2 (recall that the metrics were introduced in Section 2).The relative order of different models under this setting in terms of VERT is quite consistent with that on the previous setting of a single query set.For the evaluation metrics that measure the performance across different sessions in terms of AP, BWT and FWT, the full version of CLEVER achieves the best performance, again demonstrating the effectiveness of the proposed IPQ and learning mechanisms.Non-incremental performance.To assess the performance of CLEVER before getting into the incremental aspect, we evaluate the performance of Q test on D 0 under the single query set setting.As shown in Table 3, we see that: (i) Compared with traditional IR models, CLEVER and existing generative retrieval methods achieve better performance, indicating the effectiveness of integrating different components into a single consolidated model.(ii) CLEVER achieves better results than existing generative retrieval models,  demonstrating that the two-step iterative process to learn discriminative PQ codes as docids contributes to the retrieval effectiveness.

Assessing catastrophic forgetting
To assess catastrophic forgetting of proposed methods, we show how the performance of the base session D 0 and first incremental learning session D 1 varies over the training process on the remaining sessions.For the indexing task, we evaluate the overall indexing accuracy of D 0 and D 1 , i.e., we take a document as input of the GR model, if the generated sequence exactly matches with the correct docid, we treat the document as a positive sample.Otherwise, the document is a negative sample.For the retrieval task, we evaluate the performance of Q test 0 and Q test 1 , i.e., MRR@10.See Figure 3.We observe that: (i) CLEVER  , CLEVER − ( −) as well as CLEVER − () suffer from catastrophic forgetting.Applying IPQ and the memory-augmented learning mechanism separately does not provide sufficient assurance for the model to perform well during continual document learning.(ii) DSI++ underperforms CLEVER by a large margin.A possible reason is that the atomic integers used in DSI++ as docids are difficult to quickly adapt to new documents and have a large impact on the docids of old documents, which may result in the loss of previously learned knowledge.(iii) Compared to CLEVER, the phenomenon of catastrophic forgetting is not as well mitigated in CLEVER − ( −) and CLEVER − () , which underlines the importance of rehearsing old documents and the generated pseudo-queries.And (iv) CLE-VER almost avoids catastrophic forgetting on both indexing and retrieval tasks, showing its effectiveness in a dynamic setting.

Assessing forward transfer
Positive forward knowledge transfer is an essential ability during continual document learning including indexing and retrieval.Therefore, in this section, we explore the forward transfer ability of the CLEVER model, i.e., transferring knowledge from old documents to new documents.For CLEVER without initialization from the previous sessions we write CLEVER init and for individually fine-tuning the GR model on each session we write INDIVIDUAL.Table 4 displays the results.We see that: (i) CLEVER consistently and significantly outperforms INDIVIDUAL in the last four sessions.The underlying reason may be that CLEVER transfers old knowledge to new settings when continuously indexing new documents.INDIVIDUAL learns the indexing and retrieval tasks from each new session independently, which has a small size of data.(ii) The performance improvements of CLEVER over CLEVER init further demonstrate the need for prior initialization in CLEVER, i.e., initializing new model parameters from the last parameters.

Effectiveness-efficiency trade-off
We evaluate the effectiveness and efficiency of different models on Q test 0 in terms of VERT(%) after training all sessions.Regarding training time, we compare the overall training time by the end of the sequential training.For memory footprint, we compute the disk space occupied by the model at the training end. Figure 4 shows the relative memory and training time, i.e., the memory ratio and the time ratio of these methods with respect to BASE, respectively.
Figure 4 (a) shows the effectiveness-memory trade-off.We find that: (i) Traditional IR models (BM25 and DPR) consume much larger memory footprints than generative IR models, without discernible advantages in retrieval performance.This result indicates the significant memory consumption of re-computing representations and re-indexing for new documents.(ii) Although CLEVER atomic and DSI++ are much more effective than the BASE model and CLEVER + , they suffer from severe memory inefficiency since they need the large softmax output space that comes with atomic docids and the embedding for each individual docid must be added to the model as new documents arrive.(iii) CLEVER performs best in effectiveness and is almost as efficient as the BASE model.CLEVER only occupies a small amount of additional memory compared to BASE, which does not grow over sessions.
Figure 4 (b) shows the effectiveness-training time trade-off.We observe that: (i) BM25 exhibits a swift training process.However, its performance may be deemed suboptimal due to its vulnerability to the vocabulary mismatch issue, as well as its inability to adequately encapsulate semantic information.(ii) The BASE model achieves training acceleration at the cost of compromised performance, which suggests that maintaining effectiveness is a non-trivial challenge for GR in dynamic corpora setting.(iii) CLEVER atomic and DSI++ sacrifice training time for effectiveness since the randomly initial embeddings require training from scratch.(iv) CLEVER + re-trains all the centroids every session, leading to some computational overhead.However, the performance is still not improved too much.(v) CLEVER performs best in effectiveness and requires similar training times as the BASE model.These results demonstrate that CLEVER can be well deployed in practical environments due to its high effectiveness and efficiency.

RELATED WORK
Generative retrieval.Recently, generative retrieval (GR) has been proposed as an alternative paradigm [34] for IR.Unlike the traditional "index-then-rank" paradigm [14-16, 21, 28, 29, 31, 41, 51], a single seq2seq model is used to directly map a query to a relevant docid.In this way, the large external index is transferred into an internal index (i.e., model parameters), simplifying the data structure for supporting the retrieval process.Besides, it enables the end-to-end optimization towards the global objective for IR tasks.GR is related to two key issues: (i) the indexing task: how to associate the content of each document with its docid, and (ii) the retrieval task: how to map the queries to their relevant docids.For the indexing task, previous efforts can boil down to two research lines.One is to generate docids for documents [53] including atomic identifiers (e.g., a unique integer identifier [46]), simple string identifiers (e.g., titles [5,12,44], n-grams [2,4] or URLs [54]) and semantically structured identifiers (e.g., clustering-based prefix representation [46] or PQ [54]).The other is to establish a semantic mapping from documents to docids.Various kinds of document content have been proposed to enhance the association [6,46,54], e.g., contexts at different semantic granularities [6,55] and hyperlink information [6].For the retrieval task, most approaches directly learn to map queries to relevant docids in an autoregressive way.Recently, some work has been adopted to generate pseudo-queries [47,54] and designed pre-training tasks [6] to tackle the limited availability of labeled data.However, current GR methods mainly focus on a stationary corpus scenario, i.e., with a fixed document collection.
Very recently, Mehta et al. [33] have shown that continually memorizing new documents leads to considerable forgetting of old documents.They directly assigned each new document an arbitrary unique integer identifier, and randomly sampled some old documents using experience replay [3] for incremental updates.However, learning embeddings for each individual new docid from scratch incurs prohibitively high computational costs, while the relationships between new and old documents may not be easily obtained from randomly-selected exemplars.And they only considered the sequential query set setting for performance evaluation.
In this work, the proposed IPQ technique is able to effectively represent new documents by updating a subset of centroids instead of all centroids, eliminating the need to update existing data indices.In IPQ, the partial codebook update strategy can be applied to other clustering-based docids, e.g., clustering-based prefix representation in DSI [46], which we leave as future work.
Continual learning.Continual learning (CL) has been a longstanding research topic to overcome the catastrophic forgetting problem of previously acquired knowledge, while continuously learning new knowledge from few labeled samples [43].Recently, CL has been considered in computer vision [40,45] and natural language processing [1,19], but few efforts have been devoted to IR so far.CL scenarios [30,39] can be divided into task increment, domain increment and class increment.In this work, we consider the practical setting of dynamic corpora with newly added documents.
Existing CL approaches [13] can be categorized into: (i) replay methods, maintaining a subset of previous samples and training models together with samples in the new session [18,52]; (ii) regularization-based methods, regularizing the model parameters to enable important parameters concerning the previous tasks to be protected when training on each new task [23,50]; and (iii) parameter-isolation methods, dynamically allocating a set of parameters for each task [1].Here, we take advantage of replay methods and regularization-based methods to memorize new documents.

CONCLUSION
In this work, we have focused on a critical requirement for generative retrieval (GR) models to be usable in practical scenarios, where new documents are continuously added to the corpus.In particular, we have presented a continual learning method to alleviate possible high computational costs for generating new docids, and leverage both past similar documents and pseudo-queries for consolidating knowledge.Extensive experiments have demonstrated the effectiveness and efficiency of our method.
Despite the promising results that GR has shown, its scalability remains a challenging issue, particularly concerning document addition, removal, and updates.These factors significantly impact the widespread adoption of GR in various applications.For the proposed CLEVER method, exploring the joint optimization of quantization methods in IPQ and GR models using supervised labels, and devising advanced thresholds for adaptively updating PQ codes, hold great potential for enhancing retrieval effectiveness.

Figure 1 :
Figure 1: Evaluation criteria.methods and effectively mitigates catastrophic forgetting in incremental scenarios, while outperforming traditional IR models and existing GR models in non-incremental scenarios.

Figure 2 :
Figure 2: (a) Encoding new documents into docids by updating a subset of quantization centroids.(b) Overall training objective for continual indexing while alleviating forgetting of the retrieval ability.
and P B   , respectively.Given the meta model parameters Θ  −1 of the GR model, we also apply MLE to maximize the likelihood of a relevant docid conditioned on each pseudo query in P D   and P B   , denoted as,L  )  , Θ  |   ; Θ  −1 ),(6)where (  ,   (  )  ) ∈ {P D   , P B   } ,  (  ) is the index of the relevant document to   and  ∈ {1, . . ., |{P D   , P B   }|}.   (  )  ∈ {U D   , U B   } is the relevant docid.

4. 2 . 1
Baselines.Traditional IR models.(i) BM25[41] is an effective sparse retrieval model and we re-index all previously seen documents upon the arrival of new documents.(ii) DPR[21] is a representative dense retrieval model with BERT-based dual-encoder architecture.We use the model trained on the first session D 0 to encode newly arrived documents and then add their encoded representations to the existing external index.Generative retrieval models.(i) DSI[46] encodes all information about the corpus in the model parameters and we adopt atomic docids in DSI.(ii) DSI-QG[56] leverages a query generation model to augment the document collection at indexing.(iii) NCI[47] utilizes a prefix-aware weight-adaptive decoder and we adopt NCI with DocT5Query for augmented queries.(iv) Ultron[54] applies a three-stage training pipeline and we adopt Ultron with PQ as the docid.Due to their limitations in accommodating dynamic corpora, we only evaluate the performance in non-incremental scenarios.

Figure 3 :
Figure3: The catastrophic forgetting phenomenon of GR models.on the CDI-MS dataset, we illustrate the indexing accuracy of D 0 and D 1 , and the retrieval MRR@10 of Q test 0 and Q test 1 under sequential query set setting.Table4: Forward transfer analysis on CDI-MS under sequential query set.We evaluate the performance of Q test 0 -Q test

Figure 4 :
Figure 4: Comparison on (a) effectiveness-memory trade-off and (b) effectiveness-training time trade-off.Up and left is better.The search is performed on GPU.

Table 2 :
Model performance under the sequential query set setting.We evaluate the performance of Q test 0 , . . ., Q test 4

Table 3 :
Model performance in non-incremental scenarios.We train on D 0 and evaluate the performance of Q test .Note that DSI++ trained on D 0 is DSI.* indicates statistically significant improvements over all baselines (p-value < 0.05).

Table 4 :
Forward transfer analysis on CDI-MS under sequential query set.We evaluate the performance of Q test 0 -Q test