Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies

Recently, a new paradigm called Differentiable Search Index (DSI) has been proposed for document retrieval, wherein a sequence-to-sequence model is learned to directly map queries to relevant document identifiers. The key idea behind DSI is to fully parameterize traditional ``index-retrieve'' pipelines within a single neural model, by encoding all documents in the corpus into the model parameters. In essence, DSI needs to resolve two major questions: (1) how to assign an identifier to each document, and (2) how to learn the associations between a document and its identifier. In this work, we propose a Semantic-Enhanced DSI model (SE-DSI) motivated by Learning Strategies in the area of Cognitive Psychology. Our approach advances original DSI in two ways: (1) For the document identifier, we take inspiration from Elaboration Strategies in human learning. Specifically, we assign each document an Elaborative Description based on the query generation technique, which is more meaningful than a string of integers in the original DSI; and (2) For the associations between a document and its identifier, we take inspiration from Rehearsal Strategies in human learning. Specifically, we select fine-grained semantic features from a document as Rehearsal Contents to improve document memorization. Both the offline and online experiments show improved retrieval performance over prevailing baselines.


INTRODUCTION
Document retrieval is a fundamental task in many real-world applications, such as Web search and question answering systems [32,43,48].It aims to identify a list of candidates from a large document repository given a user query.These candidates are then re-ranked to create a final list of results by computing a more precise ranking score for each document.The performance of the initial retrieval stage is crucial to the overall quality of the search systems.Traditional algorithms such as BM25 [47] usually utilize exact term matching signals through the use of an inverted index.However, this method can run into issues with the vocabulary mismatch [23,61] due to the independence assumption.
Major progress has recently turned to dense retrieval due to advances in deep learning especially representation learning techniques [24].These methods convert the semantic information in both queries and documents into dense vectors, and then use approximate nearest neighbor search algorithms [8] to perform efficient vector search [28].Although dense retrieval has been shown to be effective in practical applications, the "index-retrieval" pipeline makes it difficult to jointly optimize all heterogeneous modules in an end-to-end way.Besides, an explicit large index is needed to conduct a search over the whole corpus, leading to significant memory consumption and computational overhead.
Figure 1: (a) Elaboration Strategies: Given a document, a semantically meaningful name, e.g., document title, could help people better encode and recall it than a weak-semantically meaningful name, e.g., a string of integers.(b) Rehearsal Strategies: By selectively underlining or highlighting the details in the document (e.g., key passages and sentences), people are more likely to ensure information goes from shortterm memory to long-term memory than simply reading the document without underlining.
Recently, Tay et al. [52] proposed an alternative paradigm, called Differentiable Search Index (DSI).The key idea is to fully parameterize different components of index and retrieval with a single consolidated model, in which all information about the corpus is encoded in the model parameters.In essence, DSI adopts a generative scheme to directly predict the relevant document identifiers (docids) with a given query.DSI achieves this functionality by jointly optimizing two basic tasks: (i) the indexing task, learning a mapping from the document content to its identifier (docid).The index is stored in model parameters, and indexing is simply another kind of model training.(ii) the retrieval task, mapping queries to relevant docids.In this way, such a consolidated model can be optimized directly in an end-to-end manner towards a global objective.And DSI does not need to manage a complicated explicit index structure, largely reducing the memory and computational cost.
As envisioned in the recent proposal paper [37] and the original DSI [52], DSI needs to answer two major questions: (1) How to assign an identifier to each document, and then (2) How to learn the associations between a document and its identifier.As solved in [52], it used a single token (arbitrary unique integer) or a string of tokens which can be an arbitrary numeric string or a semantic numeric string via hierarchical clustering, as the docid.Besides, to bind a document to its docid, it utilized a straightforward seq2seq approach that takes the original documents as inputs and generates docids as outputs.Despite the superiority of the original DSI model over BM25 [47] on the NQ 100K dataset [29], some follow-up studies [53,63] and our work have shown that it still performs worse than state-of-the-art methods by a large margin.Such observation indicates that how to design a generative model for retrieval is still an open challenge for researchers.
When we look at the process of corpus encoding in DSI, we find it works like that human uses interconnected "neurons" to learn to identify patterns in data and then directly make predictions about what should come next.Therefore, in this work, we resolve to design DSI models inspired by Learning Strategies [57] in Cognitive Psychology [46,49].As defined in [57], Learning Strategies are behaviors and thoughts in which a learner engages and which are intended to influence the learner's encoding process [39].In a similar manner, we propose a novel Semantic-Enhanced DSI model, SE-DSI for short, to further optimize the solutions to the above two questions.Our approach advances original DSI in two ways: For the docids, we draw inspiration from Elaboration Strategies in human learning [7,10,13,26,51].As shown in Figure 1(a), naming a document with natural language having semantic relationships with it, would contribute to better encoding and recall for humans than an integer-based string.Therefore, we construct Elaborative Description (ED) as the docid from each document to identify it with explicit semantic meaning.Specifically, we leverage the query generation technique to generate the pseudo query as ED from the corresponding document.
For associations between documents and their docids, we draw inspiration from Rehearsal Strategies in human learning [31,[54][55][56][57].As shown in Figure 1(b), ones who underline important contents in a document are able to recall substantially more information and have higher long-term memory than ones who simply read the document without underlining.Therefore, we tailor-make two augmentation methods to generate Rehearsal Contents (RCs) at a different semantic granularity.The original document with coarsegrained semantic features and RCs with fine-grained semantic features can then be paired with the corresponding ED as training instances for better memorizing the documents.
Offline experiments on two representative document retrieval datasets, i.e., MS MARCO and NQ, show that the SE-DSI can perform significantly better than strong baseline solutions.We also simulate the zero-resource setting and show that SE-DSI works well even only with the document information.We also conduct an online evaluation on Baidu search1 through A/B test.The results show that SE-DSI can achieve significant improvements over existing methods in Baidu on the official site retrieval task.

PRELIMINARIES
For a better description of our model, we first briefly describe the basic idea of the original DSI model [52], unifying two basic modes of operation, i.e., indexing and retrieval in an end-to-end way.
Indexing: To memorize information about each document, Tay et al. [52] directly takes each original document   as input and generates its docid  as output in a straightforward Seq2Seq fashion.The model is trained with the standard T5 [45] training objective with the teacher forcing policy, i.e., where D is a given corpus and the docid  could be represented by three ways, including, (1) atomic docid, wherein each document is assigned an arbitrary integer.Each docid is a single token in the T5 vocabulary and the decoder learns a probability distribution over the docid embeddings.However, it is difficult to apply such docid to large-scale corpus since the size of the model embedding layer cannot be too large.(2) string docid, wherein each document is assigned an arbitrary tokenizable numeric string.The decoder generates docids token-by-token in an autoregressive fashion.Such a way frees the limitation for the corpus size that comes with unstructured atomic docid.(3) semantic numeric docid, wherein a simple hierarchical clustering algorithm is employed over all the documents and each document is assigned an identifier with the number of their corresponding clusters.The experimental results in [52] have also shown that the semantically structured docid performs better than the other two.However, all these integerbased docids have limited and implicit semantic meanings, which are not very consistent with human learning.
Retrieval: Given an input query  in the query set Q, a DSI model returns a docid by autoregressively generating the docid string  with the fine-tuned T5 on indexing.The model is also trained with the standard T5 training objective, where  is the generated docid for   .A potentially-relevant ranked docids can be easily obtained with beam search [36].Tay et al. [52] proposes two main strategies for training DSI models.The first one is to first fine-tune T5 to perform indexing, followed by using the trained model for retrieval.The second one is to fine-tune T5 to perform both indexing and retrieval together in a multi-task setup.Through their experimental analysis, the second one performed significantly better.The multi-task learning is, Once such a DSI model is learned, it can be used to retrieve candidate documents for a test query   in an end-to-end manner,   =  (  ,  0 ,  1 , . . .,   −1 ), where   is the -th token in the docid string and the generation stops when decoding a special EOS token.The generated string might not always be a valid docid if allowed to generate any token from the vocabulary at every decoding step.Hence, a constrained beam search strategy [14] is employed to force each generated docid string to be in a predefined candidate set.

OUR APPROACH
In this section, we introduce the SE-DSI model, a novel semantic enhanced DSI method designed for ad-hoc retrieval.

Overview
Formally, suppose D = { 1 ,  2 , ...} denotes a corpus, where   is an individual document assigned a docid .In DSI, docids are predicted using model parameters only.This way, it shares a similar way to human recall or retrieval the information that was previously encoded and remembered in the brain [20,21,44].Therefore, we introduce a novel Semantic-Enhanced DSI model (SE-DSI) to advance original DSI, inspired by problem-solving strategies labeled by some psychologists, i.e., Learning Strategies [25,46,56,57].
Basically, the SE-DSI first constructs Elaborative Description (ED) from documents as docids to represent them with explicit semantics (Section 3.2).Then, multiple coarse-fined contents from each document at different granularity are selected as Rehearsal Contents (RCs) (Section 3.3).In this way, we learn to build associations between original documents augmented with RCs and their corresponding EDs (Section 3.4).The overall architecture of SE-DSI is illustrated in Figure 2.

Elaborative description
Compared to designing an arbitrary integer or a string of integers as docids for documents, a more natural way for us humans is to describe the documents in natural language.In Elaboration Strategies, it is well known that for many memory tasks, learning with semantic elaboration, facilitates long-term memory and recall more than learning without semantic elaboration [7,10,13,26,31].Semantic elaboration can be defined as the process of stating a to-beremembered stimulus, e.g., a story or picture, in natural language having semantic relationships with it, instead of non-nameable stimuli with weak semantics [51].These motivate us to construct ED as the docids for documents.
It is intuitive that asking annotators to produce meaningful names for all documents in a large-scale corpus is time-consuming and requires increasingly sophisticated domain knowledge.To reduce the manual efforts of writing elaborative identifiers from scratch, we propose to generate ED by a query generation technique.Specifically, we leverage the off-the-shelf DocT5query model [42], to generate pseudo queries as the docids, which are likely to be representative or related to the contents of documents.For each document   in the given corpus D, we directly feed it to the DocT5query model, to generate a set of representative queries with random sampling strategy.By conducting analysis on the two retrieval datasets used in this study, we find that concatenating more generated queries as the docid for generation, leads to degraded retrieval performance.The possible reason is that the concatenated text is relatively longer than a query and a generative model is prone to hallucinate unintended content especially when the target sequence gets longer [15,27].
In this work, we leverage the top 1 generated query as the ED for each document   , i.e.,   .Unfortunately, according to the experimental results, we find that about 5% and 3% EDs of documents are not unique in MS MARCO and NQ respectively.It is reasonable that different documents may share the same ED if they share very similar essential information, which is similar to human learning: humans prefer to remember semantically similar documents with the same name.Following [12], we ignore the ED repetition problem at the training phase.In the inference phase, since both datasets set the number of most ground-truth relevant documents as 1, we propose to solve the repetition problem in a simple way.Firstly, we leverage beam search to generate a ranked ED list.Then, we obtain the corresponding documents of EDs to form the final ranked document list.If an ED corresponds to multiple documents, we return all of them in a random order, while keeping the relative order of documents corresponding to other EDs.

Rehearsal contents
To help ensure information goes from short-term memory to longterm memory, a very useful rehearsal strategy is to selectively underline or highlight multiple important parts when reading a new text [46].This helps people reduce lengthy text into a comprehensible and manageable size that is central to understanding the piece and easy to memorize.Inspired by this learning strategy, we propose to select multiple important parts in a document as , passage-level and sentence-level information) with the corresponding docid, respectively.In the retrieval phase, the docids are generated from the query, and a rank list of potentially-relevant documents is returned via beam search.
RCs to shorten the original document.And the original documents augmented with RCs are used to memorize the original document.Specifically, the RCs should fulfill the following conditions: Informative: The RCs should contain the important information of the original document, enabling the model to learn to comprehend and encode the document into the parameters.
Fluency: The RCs should be fluent and readable for the model to acquire the text encoding ability.
Diversity: The RCs should contain different granularity of semantic units (e.g., the sentence-and passage-level), so as to achieve elaboration of the document for storage enhancement.
To achieve these goals, we propose to generate coarse-fined RCs at different granularity from each original document to rehearse it.Given a document, we select the important language units, i.e., passages and sentences, to condense it into RCs.Specifically, we tailor-make two data augmentation methods to generate RCs: Leading-style.We first introduce a simple but effective way to data augmentation method.It is based on a simple fact: writers are likely to state major points at the beginning of the document and readers prefer to read the beginning part first.This leads to an intuitive idea: we can directly use the leading passages and sentences of each original document as its RCs.Specifically, for each document, we directly use the first  passages and the first  sentences as the passage-and sentence-level RCs, respectively.
Summarization-style. We propose to incorporate the important information from the local context (e.g., sentence-level) and the broader context (e.g., paragraph-level).We leverage the document summarization technique to highlight multiple important parts that can reveal the essential topics of the document.We adopt a widely-used assumption, which denotes that a part is important in a document if it is highly related to many important parts [60].We leverage a representative graph-based extractive summarization model TextRank [38], which uses co-occurrence information between words in the document to measure the importance of each part based on the PageRank [30] algorithm.Specifically, for each document, we extract  important passages and  sentences as the passage-and sentence-level RCs, respectively.
Afterward, we can obtain a set of passage-and sentence-level RCs (denoted as    and    , respectively) for each document   ∈ D. The original document   rehearsed by its RCs can then be paired with the   of   as training instances to learn the mapping relationships between a document and its ED.Each RC shares the ED with the original document, contributing to enhancing the memorization of the document from multiple perspectives.

Training and inference
In the training phase, given a corpus D, a set of pairs {   ,   }, {   ,   } and {  ,   } for each document   ∈ D, and the labeled query-ED pairs {  ,   } for each   , we follow the multi-task learning strategy in the original DSI model, i.e., where  denotes our SE-DSI model.To specify which task the model should perform (i.e., indexing and retrieval), we add a taskspecific prefix "Query" to the input query   , and "Document" to the    ,    and   before feeding it to the model.In the inference phase, to ensure the decoded ED is valid, we employ a constrained Beam Search strategy [36] to force each generated string to be in a pre-defined candidate set, i.e., the EDs of all the document in D. Specifically, we define our constraint in terms of a prefix tree where nodes are annotated with tokens from the predefined candidate set.

OFFLINE EXPERIMENTAL SETTINGS 4.1 Datasets
Following [52,53,63], we conduct offline experiments on two publicly available retrieval datasets, including, (1) MS MARCO Document Ranking dataset (MS MARCO) [40] is a large-scale benchmark dataset for web document retrieval.Following [52], to evaluate how models perform at different scales, we construct three sets from MS MARCO to form our testbed, namely MS MARCO 10K, MS MARCO 100K and MS MARCO Full.For MS MARCO 10K, we first randomly sample 14,763 and 1330 query-document pairs in the training set and dev set, respectively.Similarly, for MS MARCO 100K, we randomly sample query-document pairs from the training set and dev set, respectively.Besides, we refer to MS MARCO Full as the original dataset with about 3.21M documents.(2) Natural Questions (NQ) [29] contains 307K query-document pairs, where the queries are natural language questions and documents are gathered from the Wikipedia Pages.Following [52], we randomly sample

Evaluation metrics
Following the original DSI model [52] and some follow-up studies [53,63], we take Hit ratio (Hits@ ) and Mean Reciprocal Rank (MRR@N) as the evaluation metrics.Hits@N is the proportion of the right ranked document in the top  ranking list, where  ={1,10}.MRR calculates the reciprocal of the rank of the first  retrieved relevant documents, where  ={3,20}.

Models
Traditional document retrieval methods.We consider two representative methods, including sparse retrieval and dense retrieval.(i) BM25 [47] is a term-based sparse retrieval method.We implement it with the Anserini open-source toolkit [4].(ii) Rep-BERT [59] is a BERT-based two-tower model trained with in-batch negative sampling.We implement it with the released code.We sample 1 negative sample for each positive sample.The batch size is 30 and learning rate is 1e-5.The max input length of the document and the query is 512 and 20, respectively.
DSI methods.We also apply several existing DSI methods.For docids described in Section 2, we consider the unique arbitrary string and semantic numeric string.Since the effect of the single token is worse than these two ones, reported in [52], we ignore this type.For the indexing strategy, we choose two effective methods, including learning (document, docid) pairs and (pseudo query, docid) pairs, reported in [52,53,63].For the implementation of DSI methods, we use the same settings as our SE-DSI model.(i) DSI-ARB takes the original documents as input and outputs the corresponding unique ARBitrary string docids in [52].(ii) DSI-SEM takes the original documents as input and outputs the corresponding SEMantic numeric string docids in [52].(iii) DSI-QG takes a set of pseudo Queries Generated by the original documents with a query generation model as input, and outputs semantic numeric docids.It can be viewed as the adaption of [53,63].
Model variants.We refer to our SE-DSI model with leadingand summarization-style augmentation methods as SE-DSI  and SE-DSI  , respectively.We also implement two variants of SE-DSI, namely SE-DSI  and SE-DSI  .SE-DSI  takes as input the original document and outputs its ED.SE-DSI  achieve RCs by randomly sampling several passages and sentences from the document, where the number of passages and sentences follows the leading-style augmentation method.

Implementation Details
Elaborative Description.For MS MARCO, we use the released pseudo queries generated by docT5query [42] as ED.For NQ, following [53], we directly leverage the docT5query model to generate 10 queries for each document.The maximum length of a pseudo query is fewer than 20 for both MS MARCO and NQ.
Rehearsal Contents.We first split each document by spacy's sentencizer [6].Following [42], we regard 5 successive sentences as one passage and skip two sentences to obtain the next passage.After iterating in this way, we can obtain a sequence of passages.According to our statistics, the percentage of documents with fewer than 3 passages is 3% in MS MARCO, and 4% in NQ.For the leadingstyle augmentation method in RCs, we set the number of the leading passages  and the leading sentences  to 3 and 6, respectively.Note for the document with fewer than 3 passages, we set  as 1, while for the document with fewer than 6 sentences, we use all the sentences.For summarization-style augmentation method, we set the number of important passages  and important sentences  as 1 and 6, respectively.Specifically, we leverage the summa API [3] to implement the TextRank model.
Training and Inference.Since the original code is not publicly available by the authors [52], we implement and train our model and existing DSI models by ourselves.We employ the Transformerbased encoder-decoder architecture as our model, where the hidden size is 768, the feed-forward layer size is 12, the number of selfattention heads is 12, and the number of Transformer layers is 12.We initialize the parameters of our model with T5-base(0.2B)[5].Note existing DSI methods are also based on T5-base.We use Adam optimizer with a linear warm-up over the first 10% steps.The learning rate is set to 5e-5, the label smoothing is 0.1, the weight decay is 0.01, the sequence length is 512, the max training steps is 50K and the batch size is 30.We train our model on four NVIDIA Tesla A100 40GB GPUs.At inference time, we adopt constrained beam search to decode the ED with 20 beams.

Main results
The comparison between our SE-DSI and baselines on MS MARCO and NQ 100K datasets is shown in Table 2 and Table 3.
Performance of sparse retrieval and dense retrieval methods: (1) BM25 is a strong baseline that performs pretty well on most datasets.By automatically learning text representations and semantic relationships between queries and documents, RepBERT can achieve better results than BM25.(2) The performance gap gets larger as the size of the dataset increases.The reason might be that the dense retrieval methods trained with more data can improve the performance.However, the performance of BM25 does not change regularly with the size of the dataset.
Performance of DSI baselines: (1) DSI-ARB and DSI-SEM perform better than BM25 on NQ 100K, which is consistent with the results in the original model [52].However, in accordance with some follow-up studies [63], DSI-ARB and DSI-SEM perform worse than sparse retrieval and dense retrieval baselines by a large margin on MS MARCO.The reason might be that it is hard for the model to learn associations between documents and integer-based string identifiers with limited semantic information.This again Table 2: Experimental results on the MS MARCO dataset.* , † and ‡ indicate statistically significant improvements over the best performing generative retrieval baseline DSI-QG, BM25, and RepBERT, respectively ( ≤ 0.05).
(2) The performance improvements of DSI-SEM over DSI-ARB, indicating imbuing the target space with semantic structure can facilitate greater ease of optimization [52].
(3) The performance improvements of DSI-QG over DSI-SEM, show that bridging the gap of input data between indexing and retrieval helps the model better learn the association between query and docid.However, documents usually contain rich semantics and it may not be optimal to only encode pseudo queries and ignore documents.Performance of our SE-DSI: (1) SE-DSI  performs better than SE-DSI  significantly on all the datasets.Besides the original document, SE-DSI  also introduces randomly sampled passages and sentences, which does help enhance the document memorization.This result demonstrates that the corpus encoding process in DSI is similar to the rehearsal strategy to a certain extent.
(2) SE-DSI  can outperform the baseline methods in terms of almost all the metrics, showing that employing ECs and EDs simulating the human learning process, can better contribute to indexing and retrieval.(3) Our method performs worse than RepBERT on MS MARCO Full and NQ 100K in terms of Hits@10.The reason might be that RepBERT leverages the pair-wise loss considering the relationship between a positive and a negative document, while SE-DSI directly learns the query-ED relationship (but this helps it performs the best in terms of Hits@1).(4) Among the two of Memory and inference efficiency: SE-DSI  has a significant reduction of memory footprint and inference time of document retrieval compared to dense retrieval models.(i) The major memory computation of SE-DSI  is a prefix tree of the document identifiers and the number of model parameters, as opposed to a large document index and a dense vector for each document in dense retrieval.For example, the memory footprint of our model is reduced by about 31 times compared to RepBERT.(ii) The heavy retrieval process is replaced with a light generative process over the prefix tree, instead of the time-consuming step of searching over a large-scale corpus.For example, the inference speed of SE-DSI  is significantly improved by about 2.5 times compared to RepBERT.Other variants of SE-DSI have the same phenomenon.

Analysis on elaborative description
In this section, we compare the proposed EDs to existing integerbased docids.As shown in Table 2 and Table 3, we can find that SE-DSI  performs better than DSI-ARB and DSI-SEM on both MS MARCO and NQ 100K.These results indicate the effectiveness of representing a document with our proposed ED as the docid, which is a natural language text containing enhanced semantic meanings.
Case.We conduct case studies to see how EDs as docids affect performance.Specifically, we take one example from the MS Table 5: Experimental results of zero-shot retrieval settings on MS MARCO 100K and NQ 100K.* indicates statistically significant improvements over the best performing baseline DSI-QG ( ≤ 0.05).

Methods
MS MARCO 100K NQ 100K MRR@3 MRR@20 Hits@1 Hits@10 MRR@3 MRR@20 Hits@1 Hits@10  4, we can see that: Given the same query, SE-DSI  ranks the ground-truth documents at the 4−ℎ, while DSI-SEM can not rank it in top 5 (actually 10−ℎ).Since the semantic numeric docid, i.e., "63260", is hard to reflect the semantics of the document, while ED as the docid, i.e., "Average cost of Disneyland" is easier to be representative of the document.

Analysis on rehearsal contents
Here, we analyze whether RCs can help document memorization compared to the existing method which only takes the original document as the input on MS MARCO 100K.Specifically, for each document, firstly, we only feed the SE-DSI with the documents, the sentences, and the passages, respectively.Then, we feed the SE-DSI with the mixture of the documents and sentences, and that of the documents and passages, respectively.Here, we obtain the sentences and passages via the summarization way.
As shown in Table 6, we can see that: (1) Rehearsing the original documents with two granularity, i.e., w/ Doc+Sent and w/Doc+Psg, outperforms that with only one granularity, i.e., w/Doc, w/Psg and w/Sent.This indicates that it is insufficient to only encode the document content with single granularity.(2) The better results of w/Sent over w/Psg denotes that reducing the gap of input format between indexing and retrieval contributes to the final performance.However, both of them can not outperform w/doc, due to the loss of rich semantics in documents.(3) SE-DEI  achieves the best results, again indicating that our method learning with the underlined important contents of the documents can comprehensively encode the documents, and further contribute to the retrieval.
Case.We also conduct some case studies to better understand how RCs affect the performance.We take the document (D3240834) in Table 4 as an example, and show the predicted EDs from SE-DSI  and SE-DSI  , which encode the documents in different ways, i.e., RCs and original documents, respectively.As shown  7, we can observe that: Given the query, SE-DSI  and SE-DSI  rank the ground truth at the 1−ℎ and 4−ℎ, respectively.This result shows that augmenting key information does help document memorization and distinguish similar documents.

Zero-shot setting
We further conduct zero-shot retrieval on MS MARCO 100K and NQ 100K.For a fair comparison, we only compare our model with existing DSI methods.Specifically, zero-shot retrieval is performed by only performing indexing without the retrieval task [52], i.e. the ground-truth query-document pairs are not provided in the training phase.As shown in Table 5, we can observe that: (1) DSI-QG slightly outperforms SE-DSI  on NQ 100K.That is probably because DSI-QG takes as input the pseudo-queries in indexing, which is similar to the input data in retrieval.(2) SE-DSI  can outperform DSI-QG significantly for MS MARCO 100K dataset in terms of MRR@3 (0.4472 vs. 0.2668).These results further validate that ED and RCs help the model to encode all the information about the corpus into the model parameter and SE-DSI works like a human with a knowledgeable brain.

ONLINE EXPERIMENTS
Beyond the offline experiments, we conduct an online evaluation on a popular Chinese search engine, i.e., Baidu search engine.

Task definition
In practice, the user may specify his/her information needs through a query for official sites.Official sites are defined as Web pages that have been operated by universities, departments, or other administrative units.It does not apply to websites operated by individuals, such as students or faculty.For example, given a query "北京协 和医院(Peking Union Medical College HOSP)", the user tends to find its official site, corresponding to the site URL "www.pumch.cn".Such an authority-sensitive retrieval scenario requires high reliability and authority.Therefore, Baidu search sets up the site retrieval task, which is used to understand query intents on official sites, and further guide the search engine to recall relevant official sites.Since the total number of the official site URL set is moderate, and the update frequency is lower than other retrieval scenarios, it is suitable to apply the DSI paradigm for official site retrieval.

Datasets and evaluation metrics
Datasets.The official site attributes are as follows.(i) Site URL is an address for a site.(ii) Site name is a descriptive name that will appear in the Internet Information Services management interface.(iii) Site Domain is the identity of one or more site addresses.(iv) ICP record is a registration name used for the Chinese Ministry of Industry and Information Technology (MIIT).(v) Web page is a hypertext document on the World Wide Web.For example, for the site URL "www.pumch.cn",its site name is " 北京协和医院(Peking Union Medical College HOSP)", the domain is "pumch.cn",and ICP record is "中国医学科学院北京协和医院(Chinese Academy of Medical Sciences and Peking Union Medical College)".All data are collected from real search logs.
Evaluation metrics.Since the goal is to capture the positives in the top- results, we take Recall@k as evaluation metrics, where k={3,20}.Specifically, we consider two evaluation settings for Recall, (i) Site-level Recall@k: the predicted site URL is completely consistent with the ground-truth site URL.(ii) Domain-level Recall@k: the predicted site URL and the ground-truth site URL are in the same site domain.For example, given the ground-truth site URL "www.pumch.cn",if the predicted URL is "www.pumch.cn", it is correct on both levels.If the predicted site URL is "jobs.pumch.cn", it would be wrong at the site level, while be correct at the domain level.We show the relative Recall (ΔRecall@), which is the difference value between the proposed method SE-DSI and the baseline.Δ Recall@ > 0 means SE-DSI is better than the baseline.

Baselines
There are two dense retrieval methods previously used in Baidu: (i) DualEnc is an Ernie-based [50] dual-tower architecture model.It needs to learn a query encoder and a site encoder with (query, site attributes) pairs, where the site attributes use the site name, ICP record, and web page contents.(ii) SingleTow is a single-tower method, including an Ernie-based encoder and a feed-forward layer, in which the weight is initialized with the site representations learned from DualEnc.During training, it takes the query as input, and the output logits of the feed-forward layer are passed through a softmax function, generating a probability distribution of sites.The probability of each site serves as the relevance score.During inference, DualEnc needs both queries and site attributes as input, while SingleTow only needs queries as input.

Implementation details
For model architecture, our SE-DSI is initialized with Ernie-GEN [58], an enhanced multi-flow seq2seq pre-training and fine-tuning For elaborative description, since some sites are not associated with web pages in practical, we directly use the unique site URLs as the docids.For rehearsal contents, we use the leading passages and sentences of each web page for the leading-style augmentation, where the number of the leading passages and sentences is 2 and 6, respectively.For the summarization-style augmentation method in RCs, we extract important passages and sentences from each web page, and set the number of important passages and sentences as 1 and 6, respectively.Specifically, we leverage the textrank4zh[1] to implement the TextRank for the Chinese language.
To learn the associations between the site attributes and site URLs, if the site has all site attributes, we train SE-DSI  with (site name, site URL) pairs, (ICP record, site URL) pairs, (web page contents, site URL) pairs.Further, for SE-DSI  and SE-DSI  , we replace the (web page contents, site URL) pairs with (RCs, site URL) pairs.To map each query to its relevant site URL, we train SE-DSI models with (query, site URL) pairs.All experiments are conducted on the Baidu PaddleCloud platform [2].During inference, SE-DSI uses the prefix tree of sites to decode the ED with 5 beams.

Online A/B experimental results
As shown in Table 8, in general, SE-DSI  outperforms DualEnc and SingleTow in terms of all metrics significantly.The reason might be that, (1) DualEnc optimizes the model in the manner of directly matching the query and the site attributes.Therefore it needs high-quality site attributes to train the site encoder.However, many sites lack attributes, and web pages usually have noisy information, which may hurt performance.(2) SingleTow works better than DualEnc by a large margin.The reason may be that site attributes are encoded into the model in the form of a matrix, contributing to better interaction with the query.(3) For SE-DSI, the site representation is in the form of model parameters, making the query interact with global information, which is more flexible and deeper than explicit similarity functions.(4) SE-DSI  and SE-DSI  work better than SE-DSI  , which shows that learning with important contents of the web pages facilitates the process of encoding the corpus, and further contributes to the retrieval.
Case.We conduct case studies to analyze the difference between SE-DSI  and baselines.Specifically, we take one example from  9, we can see that: given the same query, DualEnc can not rank the ground-truth site URL in the top 3. SingleTow ranks the groundtruth at the 3-th, while our SE-DSI  ranks it at the 1st.Side-by-side comparison.Besides, we also conduct a side-byside comparison between SingleTow and the combination method of SE-DSI  and the SingleTow in terms of overall satisfaction and high-quality authority.Human experts judge whether the combination method or the SingleTow gives better final results.Here, the relative gain is measured with Good vs. Same vs. Bad (GSB) as , where # (or #) indicates the number of queries that the combination method provides better (or worse) final results.As shown in table 10, we can find that it has achieved significant positive gains in terms of both aspects.
Inference speed.We analyzed the end-to-end inference time of the retrieval phase: (i) Compared to DualEnc, the running speed of SE-DSI  , which is proportional to the beam size, has been significantly improved by about 2.5 times.(ii) The running speed of SE-DSI  is about the same as SingleTow, which classifies sites with one softmax operation.(iii) In general, the running speed of SE-DSI  can meet the requirements of industrial applications.

RELATED WORK
Sparse retrieval.The key idea of sparse retrieval methods is to utilize exact matching signals to design a relevance scoring function.Specifically, these models consider easily computed statistics (e.g., term frequency, document length, and inverse document frequency) of normalized terms matched exactly between the query and document.Among these models, BM25 [47] is shown to be effective and is still regarded as a strong baseline of many retrieval models nowadays.To enhance the semantic relationships, several works utilize word embeddings as term weights [22,62].Dense retrieval.To solve the vocabulary mismatch problem in sparse retrieval [23,61], many researchers turn to dense retrieval models [33,59], which first learn dense representations of both queries and documents, and then approximate nearest neighbor search [9,11] is employed to retrieve.Further, pre-trained models are used to enhance dense retrieval [28,41].Differentiable search index.Differentiable Search Index (DSI) [52] is gaining increasing attention, which retrieves documents by generating their docid using a generative model.It presents an endto-end solution for document retrieval tasks and allows for better exploitation of the capabilities of pre-trained generative models.
For the docids, the original DSI proposed that the docid could be represented by a single token (atomic integers) or a string of tokens, which can be an arbitrary string or a semantic numeric string [52].

Aspect ΔGSB
Overall satisfaction +2.99% High-quality and authority +11.52%Some later works followed this way to define the docids [53,63].Though the semantic numeric docid enables that semantically similar documents to share prefixes, it is insufficient and implicit to reflect the semantic meaning of the document.This way, it is suboptimal to map docids into a suitable semantic space.To further enrich the semantic information, researchers proposed to leverage Wikipedia page titles [14,17,18] as the docids for Wikipedia-based tasks.However, such methods depend on certain special document metadata.To mitigate this limitation, some works proposed leveraging all n-grams in a passage as its possible docid [12,16].But it is costly to enumerate all occurrences of n-grams in the corpus.Here, we propose to construct EDs from documents to represent them, containing sufficient semantic information.
For the associations between documents and docids, the original DSI model proposed to take document tokens as input and generate docids as output [52].Though simple and effective, documents of long length might be hard for the model to capture and result in poor performance.Later, some researchers proposed to only use multiple short pseudo queries generated from the documents as the input [53,63], and then pair them with the semantic numeric string [52].However, only encoding pseudo queries may lose some essential information.Differently, we propose to select multiple important parts in the document, jointly with the original document, to improve document memorization.

CONCLUSION
In this work, we pointed out that designing a proper generative model to "memorize" the whole corpus for document retrieval remains a challenge.Inspired by learning strategies, we have proposed SE-DSI to advance the original DSI, which takes the input of the original document augmented with RCs containing important parts and outputs the ED with explicit semantic meanings.The offline experimental results on several representative retrieval datasets demonstrated the effectiveness of our SE-DSI model.The online evaluation again verified the value of this work.
As a novel document retrieval paradigm, the performance of DSI models remains a large room to be improved.In future work, we would like to focus on the following directions, (1) Scenario: the document corpus is usually dynamic in real-world search engines; (2) Architecture: there is potential in exploring to use other model architectures or yet to come larger autoregressive models; (3) Learning: how to define learning strategies and identifiers, etc.

Figure 2 :
Figure 2: An overview of our SE-DSI model.(a) We employ a query generation module to obtain ED from a document as its docid.(b)In the indexing phase, we propose to pair the original document and Rehearsal Contents (i.e., passage-level and sentence-level information) with the corresponding docid, respectively.In the retrieval phase, the docids are generated from the query, and a rank list of potentially-relevant documents is returned via beam search.

Table 7 :
For the same document (D3240834) in Table 4, ECpassage and EC-sentence are key passages and sentences of the document.Given the query, SE-DSI  and SE-DSI  return the top-5 beam.Correct results are marked bold.EC-passage: Disney's Theme Parks had an operating cost of 571 million dollars divided by their 11 parks and being open 365 days a year, on average their operating cost per day. . .EC-sentence: How much does it cost Disney to run Disneyland per day including California Adventure Disney?Query: How much is a cost to run Disneyland?# SE-DSI  SE-DSI  1 Average cost of Disneyland Cost of Disneyland tickets 2 Cost of Disneyland tickets Admission rate for Disneyland 3 Cost of locker at Disneyland Disney ticket price 4 Disney ticket price Average cost of Disneyland 5 Admission rate for Disneyland Cost of locker at Disneyland in Table

Table 1 :
[19,34,35,52,53]tasets.#Docdenotesthenumber of documents.#Traindenotesthenumber of the query-document pairs in training set.#Devdenotes the number of queries in dev set.The dev set is used for evaluation.The dataset statistics are shown in Table1.We use the original validation set of MS MARCO and NQ for evaluation following[19,34,35,52,53], since both MS MARCO and NQ leaderboard limit the frequency of submission.

Table 4 :
An example from the MS MACRO 100K dev set.Given a query (QID:320792), which is relevant to D324083, SE-DSI  and DSI-SEM return the top-5 beams.Correct results are marked bold.In 2015, Disney earned US$16,162 billion... the operating cost of a single theme park is likely to be... it spends a lot on a daily basis, that could easily be 15-20% ...
MACRO 100K dev set, and show the top-5 retrieval results by SE-DSI  and DSI-SEM, which uses EDs and semantic numeric docids, respectively.As shown in Table *

Table 8 :
Online A/B experimental results under the automatic evaluation.All the values are statistically significant (-test with  < 0.05).

Table 9 :
An example of official site retrieval.Given the user query, DualEnc, SingleTow and SE-DSI  return the top-3 results.Correct results are marked bold.

Table 10 :
Human evaluation results in terms of ΔGSB.All the values are statistically significant (-test with  < 0.05).