T2Ranking: A large-scale Chinese Benchmark for Passage Ranking

Passage ranking involves two stages: passage retrieval and passage re-ranking, which are important and challenging topics for both academics and industries in the area of Information Retrieval (IR). However, the commonly-used datasets for passage ranking usually focus on the English language. For non-English scenarios, such as Chinese, the existing datasets are limited in terms of data scale, fine-grained relevance annotation and false negative issues. To address this problem, we introduce T2Ranking, a large-scale Chinese benchmark for passage ranking. T2Ranking comprises more than 300K queries and over 2M unique passages from real-world search engines. Expert annotators are recruited to provide 4-level graded relevance scores (fine-grained) for query-passage pairs instead of binary relevance judgments (coarse-grained). To ease the false negative issues, more passages with higher diversities are considered when performing relevance annotations, especially in the test set, to ensure a more accurate evaluation. Apart from the textual query and passage data, other auxiliary resources are also provided, such as query types and XML files of documents which passages are generated from, to facilitate further studies. To evaluate the dataset, commonly used ranking models are implemented and tested on T2Ranking as baselines. The experimental results show that T2Ranking is challenging and there is still scope for improvement. The full data and all codes are available at https://github.com/THUIR/T2Ranking/


INTRODUCTION
Passage ranking is a crucial component of information retrieval systems.The promising performance of passage ranking leads to satisfaction of search users and benefits multiple IR-related applications, e.g., question answering [1] and reading comprehension [17].Typically, passage ranking encapsulates two coherent stages, i.e., passage retrieval and passage re-ranking.The goal of passage ranking is to compile a search result list ordered in terms of relevance marked as positive while others are regarded as negative.A recent dataset, i.e., DuReader retrieval , attempts to ease this issue by asking annotators to manually check and relabel the passages in the top retrieved results pooled from multiple retrievers.In order to ensure high-quality training and evaluation of passage ranking models, we construct and release a new Chinese dataset, named T 2 Ranking, comprising of more than 307K question-based queries and over 2.3M passages extracted from 1.8M web documents.Specifically, we sample search queries from user logs of the Sogou search engine, a popular search system in China, and perform query preprocessing, such as filtering pornographic queries and non-interrogative queries, and removing similar queries, to obtain a clean and high-quality query set with 307K queries (50K for the test set).For each query, we extract the content of corresponding documents from different search engines and remove vertical results (e.g., image search results and video search results) and duplicate results for the following process.To ensure the semantic integrity of each passage, we train and use a passage segment model to access passages from each document, which gives us around 1.3 passages per document.We then use a passage clustering approach to discard highly similar passages and generate the query-passage pool.Moreover, we also record query types and other auxiliary resources of documents to facilitate extending studies (e.g., multimodal tasks and out-of-domain (OOD) tasks).For a given query and its corresponding passages, we hire expert annotators to provide 4-level relevance judgments of each query-passage pair and adopt an active learning-based data sampling to improve the efficiency and quality of annotation.All hired annotators are full-time staff engaged in annotation work.
We carry out comprehensive analyses and present comprehensive statistics of the proposed dataset.Additionally, we conduct comprehensive experiments to evaluate the performance of multiple passage retrieval models as well as passage re-ranking models, on T 2 Ranking.The experimental results show that T 2 Ranking is a highly challenging task and there is still potential for further performance improvement.
In summary, we make the following contributions: • We build a large-scale Chinese dataset, named T 2 Ranking for passage ranking (retrieval and re-ranking).T 2 Ranking contains more than 300K queries and over 2M unique passages, and also comes with fine-grained relevance annotations, along with query types, document titles and XML files as multimodal information.• We leverage multiple strategies to ensure the high quality of our dataset, such as using a passage segment model and a passage clustering model to enhance the semantic integrity and diversity of passages and employing active learning for annotation method to improve the efficiency and quality of data annotation.• We conduct extensive experiments to evaluate the performance of existing passage retrieval and re-ranking models on T 2 Ranking.Experimental results show room for further improvement which might be brought by more sophisticated models in the future.

RELATED WORK
There are several benchmark datasets developed for passage ranking.For datasets that have relevance annotations for all querypassage pairs, both passage retrieval and passage re-ranking tasks can be tested.Other datasets, however, only focus on passage reranking tasks, providing relevance annotations only for querypassage pairs in which the passages have been extracted from the initial result lists recalled by the first-stage retrievers.We use FR to denote the first stage of passage ranking, i.e., passage retrieval and SR to denote the second stage of passage ranking, i.e., passage re-ranking as shown in Table 1.
Commonly used datasets for passage ranking are constructed for the English community.Trec Complex Answer Retrieval (CAR) [6] uses topics, outlines, and paragraphs extracted from Wikipedia.For the training set, a passage is considered relevant if it is found within the Wikipedia pages of the topic and non-relevant otherwise.The test set, comprised of 113 complex topics, has 50 passages per topic that are manually annotated.TriviaQA [11] gathers questionanswer pairs from 14 trivia and quiz-league websites and passages from Wikipedia and web documents.MS-MARCO [16] is widely utilized due to its large scale.Unlike Trec Car and TriviaQA, queries in MS-MARCO are sourced from user-generated queries, which are question-based, from the Bing search engine 3 .The Passages are extracted from realistic web documents returned by the same search engine.Then human editors are recruited and instructed to create a natural language answer with the correct information extracted strictly from the passages provided given particular queries.The relevance levels of passages in both TriviaQA and MS-MARCO are determined in a binary fashion, based on whether or not the passages contain facets of the true answer to a given query.
For the Chinese community, there exist several datasets designed for training and evaluating passage ranking models.Drawing upon the Sogou search engine, three datasets have been established, namely Sogou-SRR [29], Sogou-QCL [30] and TianGong-PDR [27].Sogou-SRR (Search Result Relevance) consists of 6K queries and corresponding top 10 search results.For each search result, the screenshot, title, snippet, HTML source code, parse tree, URL as well as a 4-grade relevance score and the result type are provided.Sogou-QCL is a large-scale dataset compromised of 537K queries and more than 9 million Chinese web pages.Rather than humangenerated relevance judgments, relevance levels of query-result pairs are assessed based on click labels.Queries from Tiangong-PDR are collected from Sogou's search logs, while passages are obtained from Web pages data from the Sina news website 4 .Moreover, fourgrade human-assessed relevance labels for each query-passage pair are available.Besides, mMarco-Chinese [3] is constructed via machine translation from MS-MARCO.However, these datasets are not large-scale and/or human-generated.Recently, Qiu et al. [20] propose a new dataset, named DuReader retrieval , for benchmarking the passage retrieval models from Baidu search 5 .Similar to MS-MARCO, queries in DuReader retrieval are question-based, and human-generated answers are collected to access the relevance levels of passages.Long et al. [15] build Multi-CPR which is a multi-domain dataset for passage ranking.Queries and passages for Multi-CPR are gathered from three different vertical search systems: E-commerce, Entertainment Video, and Medical.Rather than being extracted from web documents, passages in Multi-CPR refer to titles of search results, such as product titles in E-commerce search, resulting in shorter passage lengths.Human annotators have been recruited to judge the relevance level (binary) of the query-passage pairs.For each query, the most semantically relevant passage is marked as positive, while the others are marked as negative.

TASK DEFINITION
In this section, we formally define the tasks in T 2 Ranking.Our proposed dataset focuses on two stages of passage ranking, namely, passage retrieval and re-ranking.This aligns with the pipeline of modern information retrieval systems, which follows the retrievalthen-re-ranking paradigm.
The goal of passage retrieval is to retrieve candidate passages in response to a given query.Given a query , a retrieval model is used to retrieve a candidate set of passages K = {p The main challenge in passage retrieval lies in efficiently retrieving the relevant passages for a query, given the vast number of passages in the corpus.Following retrieval, re-ranking is proposed to derive a permutation over K, such that the more relevant passages are ranked higher in the list.In contrast to the retrieval task, the reranking task demands that models have a strong capability for relevance modeling, which is capable of capturing subtle semantic differences between relevant passages in the candidate set K.

DATASET CONSTRUCTION
In this section, we present the construction details of T 2 Ranking.We begin by introducing the overall pipeline of dataset construction, which includes query sampling, passage extraction, and relevance annotation.We then provide important technical details used in the data construction, such as model-based passage segmentation, clustering-based passage de-duplication, and active learning-based data sampling.

Overall Pipeline
The overall pipeline of constructing T 2 Ranking involves several steps, including query sampling, document retrieval, passage extraction and fine-grained relevance annotation.Query sampling.We sample real user queries from the query pool of Sogou and perform pre-processing (e.g.de-duplication and normalization of redundant spaces and question marks) to obtain a clean query dataset.Then, we filter out pornographic, noninterrogative and resource-request type queries and queries that might include user information from T 2 Ranking using an intent analysis algorithm, to ensure that the dataset consists only of highquality, question-based queries.Note that resource-request-type queries are used to search for specific music, film resources, etc. Document retrieval.We retrieve a comprehensive set of documents for each query from popular search engines such as Sogou, Baidu, and Google, taking advantage of their vast resources and expertise in indexing and ranking web content.This helps to reduce the issue of false negatives, as each system covers different parts of the web and can return different relevant documents, hence improving the overall coverage of our dataset.Passage extraction.The construction of passages in T 2 Ranking involves segmentation and de-duplication.Rather than using a heuristic approach to segment passages from a given document (e.g, conventionally determining the start or the end of passages by line breaks), we employ a model-based method for passage segmentation to maximize the preservation of complete semantics in each passage (detailed in Section 4.2).Additionally, we introduce a clustering-based technique to enhance the efficiency of annotation and maintain the diversity of the annotated query-passage pairs (detailed in Section 4.3).This approach effectively removes nearly identical passages that are retrieved by a particular query.The resulting segmented and de-duplicated passages are subsequently merged into the passage collection for T 2 Ranking.Fine-grained relevance annotation.All hired annotators are experts in providing annotation for search-related tasks and have engaged in labeling work for a long time.At least three annotators provide 4-level fine-grained annotations for each query-passage pair.Specifically, if the annotations are inconsistent among the first three annotators for a particular pair (three annotators provide three different scores), a fourth annotator will be asked to access it.In cases where all four annotators are inconsistent, the querypassage pair is considered to be too ambiguous to determine the required information and will be excluded from the dataset.The final relevance label for each query-passage pair is determined by major voting.Following the criteria of TREC benchmarks [5], we also define the instructions of 4-level relevance annotation as: • Level-0.There is a complete mismatch between the content of the query and the passage.• Level-1.The passage is relevant to the query, but it does not meet the required information needs of this query.• Level-2.The passage is relevant to the query and partly satisfies its information needs.• Level-3.The passage content is customized to satisfy the information needs of the query and precisely contains the answer to the query.We show several examples in Table 2.The fine-grained 4-level annotation enables accurate evaluation of passage re-ranking tasks.Notably, for the retrieval task, we consider Level-2 and Level-3 passages as relevant passages, and all other passages are regarded as irrelevant passages.
Notably, when processing the test queries, we utilize the strategy of annotating all query-passage pairs after the passage segmentation process, which attempts to mitigate the problem of false negatives in our test set and hence provides a more precise evaluation of the retrieval and re-ranking performance.For the training queries, we employ the aforementioned clustering-based method to du-duplicate the passages which are then presented to the recruited expert annotators to obtain 4-level fine-grained annotations.This strategy not only enhances the efficiency of annotation but also maintains diversity in the annotated query-passage pairs.Besides, the success of the active learning strategy motivates us to rich the information involved in our training samples by choosing informative query-passage pairs for annotation.The key idea behind active learning is that by allowing the model to select which training samples it wants to learn from and focus on samples that are most valuable for improving its performance, leading to more efficient annotation.In T 2 Ranking, we design an active learning-based method to annotate the training data in an iterative manner (detailed in Section 4.4).Overall, the data construction pipeline of T 2 Ranking is formally defined in Alg. 1.

Model-based Passage Segmentation
Typically, in existing datasets, the passages are segmented from documents according to a natural paragraph or sliding window with a fixed length.However, the natural paragraph-based segmentation usually results in an excessively long passage containing multiple topics, considering most web documents are not well-written.Besides, the sliding window-based segmentation often leads to a lack of complete semantics in a passage [4,20], thereby reducing the reliability of the dataset for the evaluation of the passage retrieval and re-ranking.
To address this issue, we propose a model-based method for passage segmentation.A segmentation model is trained on well-written L ∪ H (q, p); end end return L end web documents using the sequence labeling task.Specifically, we use the Sogou Baike6 , Baidu Baike7 and Chinese Wikipedia8 as the training data, given that these web documents are generally well-written and their natural paragraphs are clearly defined.An example English version Wikipedia is shown in Figure 1.Given a web document d = {  } |d | =1 , we utilize a segmentation model (•) to determine whether a given word   should be separated.Formally, the sequence labeling task can be defined as where the ŷ is the predicted score for segmentation.The true label   represents whether the word   is the last word of a paragraph.The segmentation model (•) is trained based on the loss defined in Eq. 2. If ŷ ≥ , then the passage is segmented from its document by the -th word. is a hyperparameter that controls the degree of segmentation.The smaller the value of , the more passages are segmented.

Clustering-based Passage De-duplication
Annotating a large number of highly similar passages on the web would be redundant and meaningless.In this paper, we propose a clustering-based method for passage de-duplication, which leads to more efficient annotation.Specifically, we employ a hierarchical clustering algorithm, Ward [26], to unsupervisedly cluster similar passages together.The passages in the same cluster are considered nearly duplicated.Consequently, we select only one passage from each cluster for annotation.It is worth noting that we only conduct the de-duplication in the training set.For the queries in the test set, we annotate all the passages obtained from the passage segmentation model to alleviate the false negative issue as much as possible.Intuitively, passages that are nearly identical under a specific query provide little information gain to a ranking model compared to passages with significant differences.Practically, false negatives within the same cluster as an annotated true positive can be easily filtered by a cross-encoder [21].Therefore, we employ the

Active Learning-based Data Sampling
In practice, we observe that not all training samples can further enhance the ranking model's performance.Training samples, that can be easily predicted accurately by a model, are unlikely to provide useful information for model training.
To address this issue, we borrow the light of active learning [22], using a model to choose more informative training samples for further annotations.Active learning is a framework that enables models to participate in the data annotation process.The aim of active learning is to minimize the amount of annotated data required while maintaining or improving model performance.Formally, active learning is an iterative process where the model makes predictions on a pool of unannotated samples.The samples with the highest uncertainty or informativeness are selected for annotation by annotators, and the annotated samples are added to the training data.The model is then updated with the newly annotated The framework of active learning is illustrated in Figure 2. Concretely, a query-passage re-ranking model, specifically a cross-encoder, is trained using data constructed from the initial stage.In the second stage, unannotated query-passage pairs are obtained and evaluated for relevance by the cross-encoder.Pairs with high confidence scores are filtered out as they do not provide significant information for further performance improvement, while pairs with low confidence scores, which are typically considered noise samples, are also eliminated.The remaining pairs are submitted to annotators for fine-grained annotation.The annotated query-passage pairs are then added to the training set and the cross-encoder is updated with newly acquired samples.

DATA STATISTICS
This section presents the data statistics of T 2 Ranking.Query.Table 3 provides a summary of the statistics of queries in T 2 Ranking.The maximum and mean lengths of queries in the training and test sets are nearly identical.We further analyze the domain distribution of queries in the training and test sets, as demonstrated in Figure 3. Domain tags are provided by the Sogou search engine.
The query domain distribution in the training and test sets is consistent, and the queries cover a broad range of domains.We also demonstrate the diversity level of queries by resorting to the metric, intra-list similarity (ILS) [31] which can be defined as where BERT [13] is a pre-trained language model that is often used as the backbone model for various tasks [7-9, 12, 18].A lower ILS score indicates a lower similarity between queries in the benchmark, thus indicating a higher level of diversity.We calculated the ILS scores of T

EXPERIMENTS AND RESULTS
Consistent with modern information retrieval systems, the retrievalthen-re-ranking paradigm is utilized in our experiments.In this section, we examine the performance of commonly-used retrievers and re-rankers on T 2 Ranking.

Retrieval Performance
Baselines.Existing retrieval models can be broadly divided into sparse retrieval models and dense retrieval models.Sparse retrieval models focus on exact matching signals to design a relevance scoring function, with BM25 being the most prominent and widelyutilized baseline due to its promising performance.Additionally, dense retrieval models leverage deep neural networks to learn low-dimensional dense embeddings for queries and documents.Generally, most existing dense retrieval methods adhere to the cascade training paradigm [15,20,21].Therefore, to facilitate easier comparison in future studies on our dataset, we simplify the training process as illustrated in Figure 5 as in [15,20].Specifically, we utilize the dual-encoder (DE) as the architecture of dense retrieval models, which is illustrated in Figure 6(a).The following methods are employed as our baselines to evaluate the retrieval performance on T 2 Ranking.
• QL (query likelihood) [19] is a representative statistical language model that measures the relevance of passages by modeling the generation of a query.• BM25 [23] is a widely-used sparse retrieval baseline.
• DE w/ BM25 Neg is equivalent to DPR [12], which is the first work that uses the pre-trained language model as the backbone for the passage retrieval task.• DE w/ Mined Neg enhance the performance of DPR by sampling hard negatives globally from the entire corpus as in ANCE [28] and RocketQA [21].
• DPTDR [25] is the first work that employs prompt tuning for dense retrieval.Among them, QL and BM25 are sparse retrieval models, whereas the others are dense retrieval models Implementation details.BM25 is implemented by Pyserini [14] with default parameters.The dual-encoder models are implemented Metrics.The following evaluation metrics are used in our experiments to examine the retrieval performance of baselines on T 2 Ranking: (1) Mean Reciprocal Rank for the top 10 retrieved passages (MRR@10), ( 2) Recall for the top- retrieved passages (Recall@).Notably, for the retrieval task, we consider Level-2 and Level-3 passages as relevant passages, and all other passages are regarded as irrelevant passages.For a comprehensive comparison, we report Recall@50 and Recall@1K on the test queries.Following the evaluation settings of MS-MARCO and DuReader retrieval , MRR is defined as the average of the reciprocal ranks of the first relevant passage for a set of queries.The MRR is a value between 0 and 1, with a higher value indicating that the system is better at ranking the most relevant passage higher in the list.Meanwhile, Recall is defined as the fraction of relevant passages that are retrieved among all relevant passages, also with a value between 0 and 1, where a higher value indicates that the system is better at retrieving all relevant passages.MRR and Recall measure different aspects of retrieval performance.MRR@ and Recall@ can be depicted as: @ = where I(•) is a indicator function.The  in Eq. 5 denotes the position of the first relevant passage in the retrieved candidates of query .The   and  K   represent the relevant passages of query  and the position of passage  in the candidate list K  .Retrieval performance.We report the retrieval performance of baselines in Table 5.Compared to the traditional sparse retrieval method BM25, dual-encoder models significantly boost the retrieval performance on our dataset.The improvement can be attributed to the integration of two distinct sources of knowledge, i.e., latent knowledge obtained through unsupervised pre-training of language models on a massive corpus and relevance knowledge acquired through supervised training on our large-scale annotated dataset.Equipped with the strategy of negative mining proposed in recent studies [28], the retrieval performance of dual-encoder models could be further improved on T 2 Ranking.It is worth noting that the Recall@ metrics observed in T 2 Ranking are lower than those reported in other benchmarks with coarse-grained annotations.For instance, the Recall@50 of BM25 is .601and .700 on MS-MARCO-DEV Passage and DuReader retrieval , respectively, and 0.4918 on our dataset.In the test set of T 2 Ranking, we have a greater number of passages annotated with fine-grained relevance labels, leading to a 4.74 average positive paragraph per query, which makes the retrieval task more difficult and eases the false negative problem to some extent.This highlights the challenging nature of T 2 Ranking and the potential for further improvement in the future.

Re-ranking Performance
Baselines.Due to the smaller number of passages considered by re-rankers, they tend to use the cross-encoder architecture rather than the dual-encoder architecture.The cross-encoder approach allows for a more detailed interaction between queries and documents, resulting in better performance, although at the expense of lower efficiency.We report the re-ranking performance of the cross-encoder model, which is trained on the hard negatives mined from the entire corpus, as depicted in Figure 5.The architecture of cross-encoder is illustrated in Figure 6(b).Implementation details.The cross-encoder is implemented in the same experimental environment as the dual-encoder, with a maximum input length of 288.Negatives are sampled from the top 256 passages retrieved by the dual-encoder, and a positive-tonegative ratio of 1:128 is set.The cross-encoder is then trained for 5 epochs with a learning rate of 3e-5.Metrics.To evaluate the re-ranking performance of the crossencoder, we use two ranking metrics: MRR@10 and nDCG@.In the test set of T 2 Ranking, the average number of annotated passages per query is 15.7, with a maximum of 100 annotated passages.We report nDCG@20 and nDCG@100 on the test queries.nDCG@ normalizes DCG@ by dividing DCG@ by the iDCG@, which is the DCG@ of ideal ordering of the passages.DCG@ discounts the graded relevance value of a passage according to the rank that it appears at, which can be defined as: @ = @ @ , where () is the graded relevance of passage .
Re-ranking performance.The re-ranking performance of the cross-encoder is shown in Table 6.The results indicate that reranking the candidates retrieved by the dual-encoder significantly outperforms re-ranking the candidates retrieved using the BM25 method.The improved performance is attributed to the higher recall rate achieved by the dual-encoder method, which is consistent with previous studies conducted on other benchmarks [15,20].The re-ranking performance on T 2 Ranking, however, is lower compared to other benchmarks [15,20].This can be explained by the presence of more fine-grained annotated relevant passages and queries with higher diversities in T 2 Ranking, which makes it a more challenging benchmark but also provides a more accurate reflection of re-ranking performance.

CONCLUSION
In this study, we introduce T 2 Ranking, a large-scale benchmark for Chinese passage ranking that involves both retrieval and re-ranking tasks.To construct a high-quality dataset, we leverage various strategies, including model-based passage segmentation, clusteringbased passage de-duplication and active learning-based data sampling.Specifically, we adopt a model-based method for passage segmentation in T 2 Ranking, which aims to maximize the preservation of complete semantics in each passage.To balance the efficiency of annotation with the diversity of annotated query-passage pairs, we incorporate a clustering-based technique in T 2 Ranking to remove highly similar passages retrieved by a specific query, which helps streamline the annotation process without compromising the overall quality of the dataset.The adoption of an active learning strategy in the construction of T 2 Ranking enhances the efficiency of annotating more informative training samples.The active learning framework enables the dataset to be continuously updated with the most valuable samples while minimizing the number of annotations required to achieve optimal performance.Furthermore, to ensure high-quality annotation, expert annotators are involved in the implementation of a 4-level fine-grained annotation scheme for both the training and test sets in T 2 Ranking.This scheme allows for more nuanced modeling of IR models during training and a more precise evaluation of the models during testing.In summary, T 2 Ranking encompasses over 300K queries and more than 2M unique passages, with around 2.4 million query-passage pairs annotated with fine-grained relevance labels by expert annotators.
To the best of our knowledge, T 2 Ranking is the largest Chinese benchmark with fine-grained annotation for passage ranking.We believe that this dataset will make a significant contribution to the IR community and the advancement of IR technology.

q
}  =1 from a large corpus G = {p  }  =1 , where  ≪ .In particular, a passage consists of a sequence of words p = {  } |p | =1 , where |p| represents the length of passage p.Similarly, a query is a sequence of words q = {  } |q| =1 .

Figure 1 :
Figure 1: Illustration for a web document from Wikipedia which is well-written with clearly defined paragraphs.

Figure 2 :
Figure 2: Illustration for the framework of active learning.

Figure 3 :
Figure 3: Domain statistics for the training and test queries in T 2 Ranking.

Figure 4 :
Figure 4: Pie chart of the annotation distribution.

Figure 5 :
Figure5: Illustration for the training process of baselines used in our experiments.First, we train a dual-encoder with BM25 negatives, which is similar to DPR[12].Second, we train the dual-encoder and cross-encoder with the global negative sampling strategy proposed in several studies[15,20].

Figure 6 :
Figure 6: Illustration for the architecture of dual-encoder and cross-encoder.

Table 2 :
Examples for annotation of query-passage pair.
Algorithm 1: The pipeline of dataset construction.Input: Query pool Q, document pool D, passage segmentation model (•), cross-encoder  (•) and expert annotator H Output: Fine-grained relevance labels L DatasetConstruction(Q, D, H ) begin Q = Sample(Q); % sampling a set of queries Q; L = ∅, P = ∅; % initialising a label set and a passage set; for q ∈ Q do %retrieving a set of documents D for query  via multiple search engines; D = MultiSearchEngines(q, D); for d ∈ D do P ∪ (d);% passage segmentation;

Table 3 :
Statistic of queries in T 2 Ranking.
2Ranking, as well as those of several popular datasets, such as MSMARCO, Multi-CPR and DuReader retrieval .The results are shown in Table4.From the table, it is evident that the queries in T2Ranking are more diverse, as indicated by a lower ILS score.We display the distribution of the 4-level relevance annotations in Figure4.In the training set, on average, each query is annotated with 6.25 passages, while in the test set, each query is annotated with an average of 15.75 passages.

Table 4 :
ILS scores of different datasets.Lower ILS scores refer to higher diversity levels of queries.

Table 5 :
Performance of retrieval models on the test set of T 2 Ranking.We use the off-the-shelf Chinese BERT base to initialize the dual-encoder.The maximal length of queries and passages are set to 32 and 256, respectively.The negatives are sampled from the top 200 passages recalled by BM25 or DE w/ BM25 Neg.The ratio of positive:negative is set to 1:1.We train the dual-encoder for 100 epochs with a learning rate of 3e-5.

Table 6 :
Performance of cross-encoder with mined negatives on the test set of T 2 Ranking.