NIR-Prompt: A Multi-task Generalized Neural Information Retrieval Training Framework

Information retrieval aims to find information that meets users' needs from the corpus. Different needs correspond to different IR tasks such as document retrieval, open-domain question answering, retrieval-based dialogue, etc., while they share the same schema to estimate the relationship between texts. It indicates that a good IR model can generalize to different tasks and domains. However, previous studies indicate that state-of-the-art neural information retrieval (NIR) models, e.g, pre-trained language models (PLMs) are hard to generalize. Mainly because the end-to-end fine-tuning paradigm makes the model overemphasize task-specific signals and domain biases but loses the ability to capture generalized essential signals. To address this problem, we propose a novel NIR training framework named NIR-Prompt for retrieval and reranking stages based on the idea of decoupling signal capturing and combination. NIR-Prompt exploits Essential Matching Module (EMM) to capture the essential matching signals and gets the description of tasks by Matching Description Module (MDM). The description is used as task-adaptation information to combine the essential matching signals to adapt to different tasks. Experiments under in-domain multi-task, out-of-domain multi-task, and new task adaptation settings show that NIR-Prompt can improve the generalization of PLMs in NIR for both retrieval and reranking stages compared with baselines.


INTRODUCTION
Information retrieval (IR) is the fundamental task to find the target document from the large-scale resources to meet the users' information needs and has been applied to many downstream tasks such as document retrieval (DR) [46], open-domain question answering (QA) [9,90], retrievalbased dialogue (RD) [83].Traditional information retrieval methods such as BM25 exploit wordto-word exact matching to score the relevance between texts and rank them but these methods lack the capture of semantics.Recently, deep neural network has been introduced into information retrieval such as dense retrieval for retrieval [28] and cross-attention for reranking [10].It can capture the semantic relationship between texts and achieves significant improvement when there is enough in-domain training data.However, previous studies have shown that the multi-task generalization ability of neural information retrieval (NIR) models is poor [61,68], even worse than traditional word-to-word exact matching methods.In general, most NIR models can only achieve good performance when trained and evaluated on a single specific dataset, while having poor generalization ability to different domains and tasks [15,68].
To address the aforementioned issue in neural information retrieval models, we begin by defining and evaluating multi-task generalization.In this study, the multi-task generalization of a NIR model is categorized into three levels: 1) In-domain multi-task: The model has access to various datasets encompassing multiple IR tasks such as QA, RD, and DR.Evaluation is performed on each provided dataset.This level assesses the model's capability to learn from diverse IR tasks and apply that knowledge effectively within the respective domain of each task.2) Out-of-domain multi-task: The model can access the same datasets as in the in-domain multi-task level.However, it is evaluated on unseen datasets from different domains.This level examines the model's ability to transfer learned knowledge to new domains, for instance, transitioning from financial QA to medical QA.
3) New task adaptation: This level involves making a task invisible to the model during training, such as QA, while training the model on other tasks like RD and DR.The model is provided with few-shot examples from the unseen task to adapt and generalize to this new task.This level particularly focuses on assessing the model's capability for new task generalization, which is the most challenging level for IR models.
Then, we point out that the core sub-task of the state-of-the-art (i.e.PLM-based) NIR models is text matching.Text matching aims to estimate the relevance score between two texts for a specific task.Generalization in neural information retrieval can be converted to the generalization of text matching.The foundation for the NIR models to estimate the relevance score between texts in text matching is the matching signal.Although there are certain differences in the matching signals for different retrieval tasks, there are matching signals that are shared across tasks and we call them essential matching signals.Since paraphrase identification (PI) [14] and natural language inference (NLI) [6] are also two mainstream text matching tasks (but are not IR tasks) and can also provide the shared essential matching signals, we also incorporate them into the analysis of matching signals.We take three most common and unambiguous essential matching signals across tasks, such as query term importance in DR [19,49], lexical overlap features in PI [12], and bigram counts in QA [9].Semantic matching signals measure the semantic association between texts beyond just word overlap.Many studies prove that introducing semantic information into DR [65] and PI [48] can improve the performance, especially for QA that dense retrieval [86] is better than lexical retrieval by training on sufficient in-domain data, e.g.DPR [28] and ORQA [33].Inference matching signals infer implication relationship.NLI, QA, and RD have to infer the implication relation between the two texts.For example, RD needs to confirm the reply is a logical consequence of the post and dialogue history, and QA needs to pick out the answer that the question hardly contains.From the applying perspective, the difference between text matching tasks is the different fusion and utilization of these matching signals.However, the current mainstream end-to-end fine-tuning paradigm makes the model overemphasize the task-specific matching signals and domain biases but loses the ability to capture the essential matching signals that can be used across different matching tasks and domains for IR, which reduces generalization to different tasks and domains.Thus, if one matching model can capture the essential matching signals shared across tasks and combine them according to a specific matching task, its multi-task generalization ability will be improved.However, the main challenges to obtaining a good generalized text matching model lay in two folds, 1) how to capture the essential matching signals?2) how to exploit these signals to apply the model to different matching tasks?Typical IR tasks mainly involve two stages, the first is retrieval that aims to recall the subset of candidate documents from the huge amount of resources, and the second is reranking that aims to further rank the retrieved subset more finely.In this paper, we propose a generalized NIR model training framework called NIR-prompt for both retrieval and reranking in IR.NIR-prompt captures and exploits essential matching signals based on the idea of decoupling the process of signal capturing and signal combination via prompt learning.Specifically, NIR-Prompt transforms text matching tasks into the form of [MASK] prediction by constructing the prompt template added to input texts, which is more in line with the pre-training tasks of PLMs.For retrieval, the prompt tokens instruct the PLM to output token embedding at [MASK] to map queries and documents to the latent semantic space and then calculate the relevance score based on embedding similarities.For reranking, the two texts are concatenated with the constructed prompt template, and the relevance score between the texts is estimated by predicting the probability distribution of the output word at [MASK].NIR-Prompt consists of an Essential Matching Module (EMM) and a Matching Description Module (MDM).MDM maps the description of different matching tasks to a few prompt tokens through prompt engineering.EMM is trained on mixed datasets consisting of different text matching tasks, and the prompt tokens from the MDM are used as the task-adaptation tokens to guide the learning of essential matching signals in EMM to adapt them to different tasks.Diversifying text matching tasks help the model explore the essential matching signals instead of overemphasizing the data sample bias and task-specific signals to capture the shared information from tasks and address the first challenge.The prompt tokens of each task are added to the corresponding tasks to distinguish different tasks and adapt essential matching signals to the specific tasks to answer the second challenge.Besides, we design a simple but effective method to construct discriminative tokens for new tasks by combining the learned prompt tokens, which to some extent indicate the correlation between different matching tasks.
Experimental results on eighteen public datasets and BEIR (the heterogeneous benchmark for testing the generalization ability of retrieval models) [68] show that our method yields better in-domain multi-task, out-of-domain multi-task, and new task adaptation performance for the retrieval stage, reranking stage, and entire information retrieval pipeline compared to the traditional fine-tuning paradigm.The results also indicate that NIR-Prompt has a stronger ability to distinguish tasks and utilize essential matching signals shared by multiple tasks.Both of them are beneficial for improving the multi-task generalization ability.
To sum up, our contributions are as follows: • We propose that various text matching tasks have shared matching signals that can be used across different matching tasks and domains for information retrieval.These signals are essential for generalization in neural information retrieval models.• We propose a novel framework named NIR-prompt to implicitly capture and combine the essential matching signals to improve the generalization ability of NIR models for the retrieval stage, reranking stage, and the entire information retrieval pipeline.• We collect eighteen datasets from diverse text matching tasks, which can serve as a benchmark for the multi-task generalization ability of NIR models.We evaluate our method on these datasets in three settings including in-domain multi-task, out-of-domain multi-task, and new task adaptation.Besides, we also test the zero-shot ability of our method on BEIR to further demonstrate the positive effect of our method on the generalization ability of NIR models.Code and datasets will be released at https://github.com/xsc1234/NIR-Prompt/tree/main/.

PRELIMINARIES ABOUT NEURAL INFORMATION RETRIEVAL
The mainstream neural information retrieval pipeline includes two stages: retrieval and reranking.In neural retrieval, dense retrieval is the most commonly used method that considers both efficiency and retrieval performance, so in this paper, we focus on dense retrieval for the retrieval stage.Neural reranking aims to model the relevance between texts with more fine-grained interaction function, and the most commonly used methods are generally based on the cross encoder, such as Transformer [69], so in this paper, we focus on the interaction-based models for the reranking stage.Details about them will be introduced as follows.

Dense Retrieval
Dense retrieval is the first stage in the information retrieval system that efficiently and accurately obtains candidate subsets from the massive document base.Dense retrieval model is a dual-encoder structure that encodes the query and document into dense embeddings and estimates the relevance score by measuring the similarity between the two embeddings.For a text pair ( , ), the relevance score can be computed by: = ( ( ), ( )), where and are the encoders for and respectively, and is the similarity function such as inner-product and Euclidean distance.Based on this, the approximate nearest neighbor search algorithm (ANN-search [25]) for embeddings is used to search from the document base efficiently.

Reranking
Reranking is the finer ranking stage for the candidate subsets retrieved from the first stage, which models the interaction between texts more complexly.Recently, with pre-trained language models (PLM) applied to various natural language processing tasks, the mainstream structure for reranking becomes the PLM-based model with concatenated text pairs.For a text pair ( , ) the relevance score can be computed by: = ( ( • )), where • is the concatenation operator, is the interactive function such as self-attention [69], estimated relevance score or category according to the interaction of the two texts.

NIR-PROMPT
In this section, we describe the core idea of NIR-Prompt and the technical details of using it to build multi-task generalized neural information retrieval pipeline, including dense retrieval (Retriever-Prompt) and reranking (Reranker-Prompt).
Neural information retrieval pipeline of NIR-prompt.NIR-prompt consists of two stages, one is Retriever-Prompt for retrieval the other is Reranker-Prompt for reranking.In each stage, task prompts are used as the task description to combine the essential matching signals and generalize to multiple tasks.

Basic Idea
The basic idea of NIR-Prompt is shown in Figure 2. NIR-Prompt consists of an Essential Matching Module (EMM) and a Matching Description Module (MDM).EMM captures common and essential matching signals for various text matching tasks in information retrieval, such as exact matching signals, semantic matching signals, and inference matching signals.MDM obtains the descriptions of different tasks in PLM and uses the descriptions to guide the learning and combination of essential matching signals in EMM to adapt to different text matching tasks and domains.Specifically, in MDM, the descriptions of each task are mapped to a few prompt tokens by prompt engineering.These tokens contain the information of each task in PLM and can be used as the differentiating marks of each task to guide the model to adapt to different tasks in multi-task learning.In EMM, prompt tokens obtained in MDM are added to the input text of the corresponding specific task and PLM is trained on mixed datasets consisting of different tasks.High-diverse matching tasks prevent the model from fitting the data sample bias on a specific task so that the model can focus on learning the common and essential matching signals that can be used across domains and tasks.Besides, prompt tokens obtained in MDM help the multi-task model better combine essential matching signals to adapt to different tasks, both of them are beneficial for improving its multi-task generalization ability.

Overall Framework
Neural information retrieval pipeline consists of retrieval and reranking stages.In neural retrieval, dense retrieval is the most commonly used method that considers both efficiency and retrieval performance, so in this paper, we focus on dense retrieval for the retrieval stage.Neural reranking aims to model the relevance between texts with more fine-grained interaction function, and the most commonly used methods are generally based on the cross encoder, such as Transformer [69], so in this paper, we focus on the interaction-based models for the reranking stage.NIR-Prompt framework can be used to train the neural information retrieval pipeline (dense retrieval and reranking) based on the prompt learning paradigm so as to achieve multi-task generalization of the entire IR system.Our method in the retrieval stage is called Retriever-Prompt and in the reranking stage is called Reranker-Prompt.The pipeline of NIR-prompt is shown in Figure 3 and the relevant technical details will be introduced below.

Matching Description Module
Matching description module (MDM) is used to generate the description (i.e.prompt tokens) for each task that reflects the knowledge of the PLM, which is used to guide the combination of the essential matching signals in the essential matching module to generalize to different tasks.The architecture of the matching description module for dense retrieval (a) and reranking (b) is shown in Figure 4, which consists of a frozen PLM and trainable prompt encoders.Prompt encoders map the description knowledge of the specific matching task in PLM into prompt tokens and update the parameters to generate the optimal tokens in training.Different tasks have different prompt tokens and are trained separately.These tokens play an important role in the forward computation of transformer layers, reflecting the descriptive knowledge of different tasks in PLM [35], which can help the model describe different tasks and guide the learning and combination of essential matching signals in the essential matching module.To facilitate the introduction of these methods, we give the formal template of the input texts below.In Retriever-Prompt, for a pair of input texts ( 1 , 2 ), the template of the texts is: The prompt tokens corresponding to each task are 1 , , 2 and for the input texts 1 and 2 respectively.They are the descriptions of the task.Specifically, 1 ( 2 ) define the properties of 1 ( 2 ) such as the query, passage, question, etc. ( ) prompts PLM to generate the suitable representation for 1 ( 2 ) by optimizing the embedding of the token at [MASK].There is a huge gap between the dual encoder architecture in retrieval and the traditional prompt learning methods.It is meanly because that dual encoder architecture requires the two texts to be encoded independently, while traditional prompt learning methods need to input two texts to a pretrained language model (PLM) jointly and prompt the PLM to give the answer (cross encoder architecture).To solve this challenge, the training of MDM in Retriever-Prompt is updating the parameter of prompt encoders to generate these prompt tokens to optimize the embedding of the token at [MASK] exploiting the knowledge from the PLM.The embedding of the token at [MASK] is used as the text representation in dense retrieval and the optimization objective can be the function commonly    ,

Document Retrieval
The query: The passage: Representation for document retrieval is: Question Answering The question: The passage: Representation for question answering is: Retrieval-based Dialogue The first sentence: The second sentence: Representation for retrieval-based dialogue is: used in dense retrieval, such as contrastive loss [79].The training details will be introduced in the following.
In Reranker-Prompt, different from retrieval, the reranking in IR is an interaction-based process that captures the matching relationship between texts in a more fine-grained way.Some text-pair classification tasks with more complex textual interaction information based on cross encoder such as Paraphrase Identification and Natural Language Inference are also considered in reranking to introduce more shared matching signals.Given a pair of input texts ( 1 , 2 ), we construct the input template to prompt PLM to judge the matching relationship between the input texts by predicting the word at the token of [MASK].Specifically, the template for ( 1 , 2 ) for reranking is: where 1 , 2 and are the prompt tokens. 1 and 2 define the properties of 1 and 2 respectively, prompts PLM to judge the matching relationship of text pairs and fill the result into [MASK].For the classification task, a mapper is used to map the class label to the output word at [MASK].For the relevance score estimation task, the probability of the output word at [MASK] is used as the relevance score between 1 and 2 .
As for the methods of obtaining the value of prompt tokens in matching description module, we design three methods including manual prompt, continuous prompt, and hybrid prompt (specifically for reranking).

Manual
Prompt.Taking inspiration from LAMA [51], which utilizes manually crafted cloze templates to investigate knowledge in PLMs, the concept of manual prompts has been introduced for various NLP tasks like text classification [59] and text generation [60].Manual prompts exclusively consist of natural language.In Retriever-Prompt, specific manual prompts have been devised for different text matching tasks, as shown in Table 1.The prompts 1 and 2 are exploited to inform the PLM about the type of 1 and 2 , respectively.and are the sentences that ask questions about the representation of the text according to the specific task.The representation of the texts 1 and 2 is the embedding of tokens . In Reranker-Prompt, on the other hand, showcases distinct manual prompts for different text matching tasks in Table 2. 1 and 2 describe the attributes of the two texts individually, while poses questions pertaining to the relationship between the two texts, varying depending on the task.The anticipated answer is expected to be outputted at [MASK], possibly containing words like "yes" or "no".In comparison to continuous prompts, this approach employs natural language as prompts without necessitating additional training on prompt tokens, albeit at the expense of relying on human expertise, resulting in sub-optimal performance.

Continuous
Prompt.Building upon the principles of P-tuning [39], we present a novel approach for the optimization of prompt tokens in a continuous space.In our method, prompts are represented as trainable continuous vectors.Unlike P-tuning, our focus lies in enhancing the model's multi-task generalization capabilities specifically in text matching scenarios.The objective of acquiring prompt tokens for each task is not primarily to enhance prompt learning performance, but rather to let the prompt tokens give better descriptions of each specific text matching   The first text: The second text: Do these two texts mean the same thing?Natural Language Inference Premise: Hypothesis: Can the hypothesis be concluded from the premise?
task.These descriptions are used to distinguish different tasks and adapt the essential matching signals to suit the specific requirements of each task.Consequently, in Retriever-Prompt, different from P-tuning, we introduce several improvements: Prompt encoder: After pre-training, the word embedding in PLM has been very discrete.If the trainable prompt tokens are directly initialized with random distribution, and then optimized with stochastic gradient descent (SGD), it can only update parameters in a small neighborhood and is easy to fall into local optimum but cannot reach the global optimal point [1,39].To address this limitation, a bidirectional Long Short-Term Memory (LSTM) network is employed in P-tuning to optimize all prompt token embeddings in a fully continuous space [39].Different from it, our approach utilizes separate LSTMs for 1 and ( 2 and ) respectively (as shown in Figure 4).The rationale behind this choice stems from the assumption made by LSTMs that there exists a certain dependency and sequential relationship among the prompt tokens.In other words, in a bidirectional LSTM, the tokens within 1 and ( 2 and ) are interdependent.However, this interdependency, while present, unnecessarily restricts the representation of the embeddings of prompt tokens.It is because (1) The tokens of 1 and can only be jointly generated by a single LSTM, and the number of trainable parameters is limited.(2) must obey the sequential relationship with 1 , which limits the value range of the embedding of tokens in .Therefore, the representation of continuous prompt is limited by a single LSTM.By employing separate LSTMs for 1 and ( 2 and ), we mitigate the interdependency between these three prompts, thus reducing the restriction on representation.
Fix the prompt tokens: In P-tuning and many other prompt learning methods, prompt tokens are just added to input text and are alterable like other tokens in self-attention.Different from this, we let the embedding of trainable prompt tokens remain fixed in layers 1 to in the self-attention.In these layers, prompt tokens can affect the calculation of PLM on input text, but the input text cannot affect prompt tokens.In the layers to , prompt tokens are alterable.It is because if we do not impose any restrictions on prompt tokens, the embedding of the prompt tokens is affected by other input words and updates in every layer.This is not conducive to obtaining an abstract description of the specific text matching task, because there is no guarantee that the embeddings of prompt tokens are only determined by the prompt encoder due to the influence of other input words, which may cause the embedding to fit the data instead of the task.On the other hand, it is inconsistent with the mask prediction task in the pre-training process of PLM if we fix prompt tokens in each layer, which may lead to sub-optimal results when using the language model to optimize the token at [MASK].Therefore, an intermediate layer needs to be determined to balance the two to achieve optimal performance.Based on our experiment, we find that the 11ℎ layer is an optimal boundary layer.We balance the description of the task and the interaction with other words in the text.In this way, prompt tokens can achieve better control of PLM and describe the task more comprehensively.
The parameters of prompt encoders are updated through back-propagation thereby adjusting the embedding of the prompt tokens.Let be a randomly initialized vector used as the input ) are the encoders of the corresponding prompt.Each encoder consists of a bidirectional LSTM and multilayer perceptron.These four encoders are optimized to obtain the continuous prompts respectively.The embeddings of 1 , , 2 and are obtained by: ( The objective function of the prompt encoder is the in-batch contrastive loss and the details of training will be introduced in Section 3.5.
In Reranker-Prompt, same as the continuous prompt for retrieval, LSTM is used to optimize the 1 , 2 and respectively and the prompt tokens in first layer are fixed during the calculation of PLM.The most notable difference compared to retrieval is that the token of [MASK] is not used to represent the text but to output the word such as "yes" or "no" to determine the relationship between two texts and the relevance score can be obtained from the probability distribution of vocabulary.The details of training will be introduced in Section 3.5.

Hybrid Prompt for
Reranking.We design the hybrid prompt for reranking specifically.In hybrid prompt, for , instead of using a prompt encoder to generate a continuous vector, we express it as a natural language: "Do these two sentences match?".It is because, in the template of continuous prompt, text matching is converted into predicting the word at [MASK].If the result is "yes", these two texts match, otherwise they do not.In order to predict the word, the parameters of the three prompt encoders for 1 , 2 , and need to be updated.However, there is no prior information about the task other than training data.This leads to that although the prompt encoder can minimize loss during training, the task it describes may not be text matching.It just makes prompts control PLMs to extract features from the input texts and output the corresponding "yes" or "no", which is not conducive to generalizing to other datasets.Expressing as task-relevant natural language can control the task closer to text matching and facilitate 1 and 2 to describe different matching tasks, which is beneficial for improving generalization ability.The template contains both manual prompt ( ) and continuous prompt ( 1 , 2 ).We call this method hybrid prompt.

Essential Matching Module
Essential matching module is used to capture essential matching signals across tasks and combine them with task descriptions obtained in MDM to adapt to different tasks via multi-task learning on the mixed datasets consisting of multiple tasks.In this paper, we propose that various IR tasks have their shared matching signals such as exact matching, semantic matching, and inference matching, etc. and we call them essential matching signals.The traditional end-to-end fine-tuning paradigm makes the model overemphasize the task-specific signals and domain biases but loses the ability to capture essential matching signals that can be used across different matching tasks and domains for IR.The architecture of the matching description module for dense retrieval (a) and reranking (b) is shown in Figure 5, which consists of the trainable PLM and the task prompt tokens obtained from the matching description module.Multi-task learning can capture the information shared across tasks [38].In EMM, the PLM is trained on mixed datasets consisting of multiple high-diversity tasks to avoid overfitting the domain and task bias to capture the essential matching signals.During   guide the learning and combining of essential matching signals to adapt to different tasks.In the inference for different tasks, the task prompt tokens can be added to the input texts to control EMM to perform the specific task.

Training
Because of the different structures of the dual encoder in dense retrieval and cross encoder in reranking, we use in-batch contrastive loss and cross-entropy as loss functions for the training of Retriever-Prompt and Reranker-Prompt respectively.The loss function for the training of MDM and EMM in Retriever-Prompt is the in-batch contrastive loss.Specifically, given the queries Q and their corresponding positive documents D in a mini-batch during training, the negative documents for query ∈ Q are the positive documents of other queries in D. Given B = { , + , − ,1 , − ,2 , ..., − , }, the input texts to PLM can be constructed as Equ.(1): , where 1 , 2 , and are obtained from Equ. (2).The representation of the text is [ ] obtained from the output hidden states of PLM.The similarity between and can be defined as: The loss function for B is: and the total loss function L is: where m is the number of queries.The optimization of trainable parameters * is: It is worth noting that in the training of matching description module, the trainable component is the prompt encoder and PLM is frozen, in the training of essential matching module, the trainable component is PLM and the prompt encoder is frozen.The loss function for the training of MDM and EMM in Reranker-Prompt is cross entropy loss.Specifically, let M be the pre-trained language model.The vocabulary of M is V. Y is the label set in this task and is each label.We can design a verbalizer to map the label to the word in the vocabulary and mark it .If the [MASK] is filled by "yes", the label is 1, if it is filled by "no", the label is 0. M ( |) is used to represent the probability of filling [MASK] by ∈ V. Relationship between and is modeled as: ) .
The loss function of our method is cross entropy: Same as the training for retrieval, in the training of matching description module, the trainable component is the prompt encoder and PLM is frozen, in the training of essential matching module, the trainable component is PLM and the prompt encoder is frozen.
After training, we can get the matching function , and the matching process can be simplified to: = ( 1 , 2 ), ∈ {0, 1}, indicates whether 1 and 2 match.In ranking tasks such as DR, QA and RD, we use the probability that the word of [MASK] is predicted to be "yes" minus the probability that it is predicted to be "no" as the relevance between texts:

EXPERIMENTS
In this section, we introduce the experiment settings and the performance of NIR-Prompt on the retrieval stage, reranking stage, and entire neural information retrieval pipeline.

Research estions.
To evaluate the generalization ability of NIR models at a fine-grained level, we define multi-task generalization into three levels.Figure 6 describes the three levels in detail, which indicates the generalization across datasets, domains, and tasks.They can be evaluated by in-domain multi-task performance, out-of-domain multi-task performance, and new task adaptation respectively.Based on these levels, we propose three corresponding research questions on the generalization of NIR models and the following experiments will answer these questions below.
• In-domain Multi-task Performance: Can the information of different retrieval tasks be shared and facilitate each task?Train: Given mixed datasets of multiple tasks, the model performs     6.
The first part is the six datasets selected from five tasks including document retrieval (DR), opendomain question answering (QA), retrieval-based dialogue (RD), paraphrase identification (PI), and natural language inference (NLI) (shown in Table 4 and Table 5).This part is used to train the NIR-Prompt model and evaluate the in-domain multi-task performance and new task adaptation.The second part is the eleven datasets selected from the same tasks as the first part, which is used to evaluate the out-of-domain multi-task performance.The third part is BEIR [68], a public benchmark to evaluate the zero-shot ability of retrieval models.As for the tasks in the retrieval and reranking stages, the former considers the information retrieval tasks including document retrieval (DR), open-domain question answering (QA), retrieval-based dialogue (RD), and BEIR, the latter considers these information retrieval tasks as well as the sentence pair classification tasks including paraphrase identification (PI) and natural language inference (NLI    which is a gap with the in-batch training method in dual encoder architecture in the retrieval stage.In addition, mainstream studies [3,27,52] perform PI and NLI based on the cross-encoder architecture in reranking, so we follow them and only use PI and NLI in reranking.

Implementation Details.
We use BERT-base (109M) model 3 as the pre-trained model in experiments.In the training of EMM and MDM, for retrieval, we use the gradient descent method Adam with learning rate 10 −5 and batch size 32 to train the model, for reranking, the batch size is 15.We train the models for 20 epochs and evaluate the performance after each epoch.When there is no improvement in performance for 3 consecutive epochs, the training stops early.The lengths for continuous prompt are 6, 6, 5, 5 for 1 , 2 , and respectively.As for the data arrangement in the mixed datasets, for retrieval, the data in a batch comes from the same task because we use in-batch contrastive loss in training.The batches of each task are arranged alternately.For reranking, the amount of data from each task in a batch is sampled in balance.The training and inference are performed on a Tesla V100 32GB GPU.
• Training and evaluation for retrieval.For QA, we construct q-a pairs in the training set as described in DPR [28].For RD, we take the two adjacent texts in the dialogue as positive samples.For DR, we use the labeled sample pairs in the dataset as positive samples.The negative samples for these three tasks are obtained from the in-batch sampling, which are the positive samples for other queries.In evaluation, for each query of QA and DR, we use the set of labeled positive and negative samples in the dataset as its candidate list.For each query of RD, we sample 50 samples as its candidate list.
In evaluation, the candidate document set for each query is the entire corpus.
• Training and evaluation for reranking.For QA, RD, and DR, the construction method of positive samples is the same as that of retrieval.As for the negative samples, for QA and DR, the negative samples are obtained from the labeled text pairs from the datasets, for RD, the negative samples are randomly sampled from the corpus.For PI, the size of public datasets is small, so we select two datasets including MSRP and QQP.For NLI, we convert this task into a binary classification task that whether hypotheses can be inferred from the premises.
In evaluation, we use BM25 to retrieve items for each query and use the reranking model to rerank them.Specifically for DR, the candidate document set for each query is the labeled documents obtained from the dataset, which is much smaller than the entire corpus.• Evaluation for information retrieval pipeline.For the evaluation of the neural information retrieval pipeline, we use NIR-Prompt and traditional fine-tuning paradigm to retrieve the candidate document set based on the entire corpus and rerank them respectively.
4.1.5Baselines and Measures.Our baselines consist of task-specific and multi-task models for dense retrieval and reranking trained by the traditional fine-tuning paradigm.
The task-specific models (i.e.Fine-tuning ) are specifically trained on the dataset corresponding to each task listed in Table 4 and Table 5, and tested on the corresponding task.For dense retrieval, the implementation details of Fine-tuning are consistent with DPR.For reranking, the implementation details of Fine-tuning are consistent with MonoBert.DPR uses [CLS] token of BERT to represent query and document as dense vectors and use the in-batch contrastive loss as the learning objective.MonoBer concatenates query and document as the input and feeds the embedding of [CLS] into a feed-forward network to judge the relevance.Both of them are based on the traditional fine-tuning paradigm of PLMs.In this paper, we only discuss the advantages of NIR-Prompt over the fine-tuning paradigm.More optimization techniques such as hard negative sampling and distillation are not specifically discussed, because they can also be applied to our method.
In order to capture the essential matching signals, our method needs to be mixed-trained on multiple datasets.Therefore, for a fair comparison, we also use a multi-task training method for the traditional fine-tuning method on multiple datasets.Specifically, for the fine-tuning method in dense retrieval and reranking (DPR and MonoBert), we introduce two multi-task training methods.

Fine-tuning
, which uses the traditional fine-tuning paradigm to train PLM on the mixed datasets without any task-specific marks.Fine-tuning , which adds the task-specific marks to the input text to be used as task differentiation in multi-task training.Besides, the following are the multi-task training methods specifically for reranking worth to be considered.MT-DNN [38] adds the task-specific feed-forward networks for each task, and we reproduce it on our mixed datasets.This method introduces additional parameters, and the number of parameters increases with the number of tasks, but NIR-Prompt does not need any task-specific layers during inference.There is also a multi-task learning framework that converts each task into a unified question-answering format [8,45,53,55].We reproduce this framework on our mixed datasets using T5-base (220M) 4and call them MTL 5 .Since T5 has been pre-trained on multiple supervised downstream tasks that will be tested in our experiment, which is unfair for comparison, we choose T5 1.1, which is only pre-trained on unsupervised datasets.BM25 [57] also has strong multi-task generalization ability, we use it as one of the baselines.We also use the parameter-efficient multi-task training and few-shot learning methods such as HyperFormer [27], ATTEMPT [3] and AdapterFusion [52] as the baselines.We reproduce them on our datasets.In multi-task training, there are some tricks about data sampling, loss construction, and task scheduling, which can be used by both baselines and NIR-Prompt, so we do not compare them in detail.
In this paper, we want to show that the NIR model trained with NIR-Prompt (based on the idea of decoupling the process of signal capturing and signal combination) can get better multi-task generalization ability than the traditional fine-tuning paradigm.Some methods [10,16,17,32,42,47,76,80] are not given special consideration because they are just variants based on this finetuning paradigm, which are inconsistent with our motivation and can also be incorporated into our method.We train RetroMAE [77] (a state-of-the-art IR model on both in-domain and out-ofdomain generalization) by our method and the traditional fine-tuning paradigm respectively on the mixed datasets to show the compatibility of our method to SOTA IR Models.
For our method, Retriever-Prompt , Retrieve-Prompt correspond to manual and continuous prompt for dense retrieval.Reranker-Prompt , Reranker-Prompt and Reranker-Prompt ℎ correspond to manual, continuous and hybrid prompt for reranking.
As for the metrics for evaluation, we use Accuracy and F1-score to evaluate NLI and PI [11].Accuracy aims to measure the proportion of samples that the reranking model correctly classifies the matching relationship between two texts.F1-score is the combination of recall (the distinguishing ability of positive samples) and precision (the distinguishing ability of positive samples) and reflects the robustness of the model.For QA and RD, these two tasks usually only return the first-ranked sample to the user, so P@1 and MRR are suitable for them [73].P@1 measures the proportion of queries for which the correct answer is ranked first.MRR measures the position of the first correct answer in the ranked list returned by the model.For DR, we use NDCG [41] that can measure the relevance of documents and ranking position in the returned list.

4.2.1
In-domain Multi-task Performance.We do multi-task mixed training for retrieval and reranking on Table 4 and Table 5 respectively and test their performance on the corresponding test dataset.The results are shown in Table 7(a), 7(b) and 7(c).For retrieval, reranking, and the entire neural information pipeline, NIR-Prompt shows better performance than the traditional fine-tuning paradigm.Compare multi-task models with the task-specific model (Fine-tuning 1 ), the performance of fine-tuning paradigm drops significantly but our method surpasses it, which shows that in the in-domain setting, the traditional end-to-end fine-tuning paradigm cannot effectively utilize the information shared between tasks to promote each other, but interfere with each other, and our method solves this problem.The experiment result can answer the first research question that NIR-Prompt can capture the shared information between tasks and exploit it to facilitate each other.This is mainly due to our idea of decoupling the process of signal capturing and signal combination, which can help the multi-task model distinguish and adapt to different tasks.

Out-of-domain
Multi-task Performance.We test the retrieval and reranking model trained on Table 4 and Table 5 on their unseen datasets (Part 2 of Table 6).The results shown in Table 8(a), 8(b) and 8(c) indicate that NIR-Prompt has better out-of-domain multi-task performance than traditional fine-tuning paradigm.Highly diverse tasks mixed training makes the model avoid overfitting the dataset bias and capture the essential matching signals that can be used across tasks and domains.The prompt tokens are used as the task description to guide the learning of essential matching signals during mixed training and adapt these signals to different tasks and domains, which is a more reasonable approach to conforming the knowledge in PLM.These factors improve the generalization performance of NIR-Prompt.As for the traditional fine-tuning paradigm, the performance of the multi-task model drops more seriously than the task-specific model Fine-tuning , which indicates that the multi-task model trained by fine-tuning paradigm cannot distinguish each task well and tasks interfere with each other.On the contrary, NIR-Prompt has the stronger task-distinguishing ability and enables tasks to utilize shared information to promote each other.The experiment result can answer the second research question that NIR-prompt can capture the essential matching signals that can be used across domains to improve the out-ofdomain generalization ability.

New Task
Adaptation.Furthermore, we investigate the few-shot learning capability of the multi-task model when presented with new tasks.Specifically, following the leave-one-out method, for the set of three (or five) specific text matching tasks, we choose two (or four) of them for mixed training, which we refer to as the multi-task model.The remaining task is then designated as the new task for the multi-task model.This allows us to assess the model's ability to capture the essential matching signals shared across tasks in multi-task learning and adapt the learned signals to unseen tasks with only a limited number of examples available.In NIR-Prompt ℎ , the prompt tokens of the new task are obtained by the weighted sum of the prompt tokens of the other four tasks, and the weight of each task is a trainable variable in few-shot learning.The w/o fusion means do not fuse the prompts of other tasks and trains directly from scratch under few-shot settings.In order to better demonstrate the promotion effect of other tasks on low-resource learning, we also use Fine-tuning to directly perform few-shot learning on each task.For the new task, we select 32 positive and 32 negative samples for training and observe the performance of the model on the testing set.The experimental results are shown in Table 9(a), 9(b) and 9(c).The models based on multi-task get better few-shot learning performance than Fine-tuning .Besides, Multi-task based on NIR-Prompt has better few-shot learning ability in new tasks than that based on other methods.This indicates that the information of other tasks can improve the adaptability of the model to new tasks and NIR-Prompt performs best.In addition, the prompts of other tasks contain descriptions that adapt essential matching signals to tasks, and the integration of them can lead to a good initialization of the description for the new task, which further promotes few-shot learning.The experiment result can answer the third research question that the multi-task learning in the , Vol. 1, No. 1, Article .Publication date: December 2023.
Table 7. In-domain multi-task performance of the model trained on the mixed datasets listed in Table 4 for dense retrieval and reranking.Boldface indicates the best results of multi-task models and the results over Fine-tuning are denoted as ' * '. Results with significant performance improvement with p-value ≤ 0.05 compared with all multi-task models of baselines are denoted as '+'.
(a) In-domain retrieval

Performance on BEIR.
We also evaluate the performance of NIR-Prompt and the traditional fine-tuning paradigm on BEIR, a heterogeneous benchmark for testing the generalization ability of retrieval models.In this experiment, we test the generalization performance of dense retrieval, reranking, and the neural information retrieval pipeline on BEIR respectively.The models used for evaluation are trained on Table 4 for dense retrieval and Table 5 for reranking, which are consistent with Section 4.2.1 and 4.2.2.The result shown in Table 10 indicates that NIR-Prompt shows better performance than the traditional fine-tuning paradigm in dense retrieval, reranking, and neural information retrieval pipeline, which further demonstrates the generalization ability of our method.

Model Analysis
The quality of learned prompt tokens is very important for the multi-task generalization of text matching models.We first measure the specificity of the information stored in the prompt tokens by evaluating on handcrafted 'QA vs. PI' task.Then, we explore the connections between task prompt tokens and their composability.

4.
3.1 Ability to Distinguish Tasks.We construct handcrafted 'QA vs PI' datasets to verify it and the examples come from given QA datasets (Trec, WQ, and NQ).For each question, we provide it with two candidate texts, one is the passage containing the answer (a positive sample in original QA datasets), denoted as , and one is the question itself, denoted as .Since is exactly the same as the question, while can not be the answer to the question, so it can only reflect the matching relation in PI task.In contrast, contains the answer to the question and can be used to denote the matching relation in QA task.Due to the extra answer information in , its matching score under PI task will be lower than which is exactly the same as the question.Therefore, by altering task marks between QA and PI and comparing the matching scores of two candidates, we can evaluate the ability of the multi-task model to distinguish tasks.Table 11 indicates that NIR-Prompt can improve the ability of the multi-task model to distinguish different tasks.Even though other multitask learning methods distinguish tasks by different means and are trained on datasets of multiple tasks, they still tend to the exact matching signals but cannot distinguish tasks well.This is also the key reason why NIR-Prompt works better.

Relationships between
Tasks.We explore the relationships between tasks using the cosine similarity of task prompt token embeddings learned in Section 4.2.1 and its heatmap is shown in Figure 7(a).We can see that NLI is similar to PI because they both focus on exact matching signals, and DR is similar to QA because both exact and semantic matching signals are important in these tasks.The similarity between different tasks is generally consistent with our prior knowledge of the task.The heatmap of the fusion weights of tasks for each new task obtained in Section 4.2.3 is shown in Figure 7(b).This weight distribution is consistent with the embeddings similarity distribution of the prompt tokens between tasks, which further supports the rationality of fusing different tasks for new task adaptation.

4.3.3
Compatibility to SOTA IR Models.Our method is a general training framework that can be combined with existing IR models to further improve their performance in generalization.In this section, we explore the effectiveness of our method based on RetroMAE [77], a state-of-theart IR model on both in-domain and out-of-domain generalization that has been pre-trained on the large self-supervised corpus.Specifically, we use the traditional fine-tuning method and our , Vol. 1, No. 1, Article .Publication date: December 2023.
Table 9. New-task adaptation performance of the models.Each task in this table is a new task not included in mixed datasets for multi-task (leave-one-out).Results with significant performance improvement with p-value ≤ 0.05 compared with all multi-task models of baselines are denoted as '+'.
(a) New-task adaptation retrieval

.3).
There is no similarity calculation between the same tasks, so the diagonal elements are zero.In this section, we analyze the impact of different values of on performance.Figure 8 and 9 show the performance varies with on various tasks (int Table 4 and 5) in retrieval and reranking stage respectively.When is 0, the value of the prompt tokens changes with the calculation of self-attention in each layer, just like all previous prompt engineering methods.When is greater than 0, the value of prompt tokens is fixed in 1 to layers.Figure 8 and 9 show that fixing the prompt tokens is always better than or equal to the previous engineering methods.Based on the experimental results on various tasks, the performance is the best when is 11.

Affect of Arrangement of Samples.
In this section, we explore the affect of the arrangement of samples in the mixed datasets consisting of multiple tasks.Specifically, we compare three arrangement strategies of samples including Random Arrangement (randomly shuffle the mixed datasets), Task Batching Alternately (all data in a batch comes from the same task and each task appears alternately in the batch), Task Batching Randomly (all data in a batch comes from the same task and each task appears randomly in the batch) and Task-balanced Batching (the data in a batch sampled balanced from each task).We report the average performance of the models trained under these three different arrangement strategies on each dataset of in-domain and out-of-domain settings.
The results are shown in Figure 13.In Retriever-Prompt, the loss function is in-batch contrastive loss, so ensuring that the data in a batch comes from the same task is beneficial for the model to learn the difference between positive and negative samples in the corresponding task.In Reranker-Prompt, the loss function is cross entropy, maintaining the balance between different tasks in a batch is beneficial for the model to balance each task during training and learn essential matching signals across tasks.

RELATED WORK
In this section, we review the previous studies on neural information retrieval methods based on text matching.Besides, we also introduce previous work on prompt learning, it is because the core of the matching description module in our method is to obtain the description of different matching tasks to guide the learning and combination of essential matching signals.Prompt learning is the paradigm to exploit the knowledge in PLM to complete the task, and the templates constructed for the input texts can reflect the description of the task in PLM.As for the capturing of essential matching signals, multi-task learning is an effective way to capture the signals shared between various tasks and we apply it in the essential matching module of our method.We introduce previous studies on multi-task training and prompt learning.We also introduce some recent studies on the out-of-domain generalization ability of neural information retrieval models.

Neural Information Retrieval Based on Text Matching
In this subsection, we will review some related studies on two stages in the neural information retrieval pipeline (i.e.retrieval and reranking) from the perspective of three essential matching signals including exact matching, semantic matching, and inference matching.

Neural Retrieval
Based on Text Matching.Neural retrieval is the first stage in the information retrieval system that efficiently and accurately obtains candidate subsets from the massive document base.In exact matching, DrQA [9] combines TF-IDF weighted bag-of-words vectors and bigram to represent the text.This method shows good performance in open-domain QA.In semantic matching, models based on single semantic text representation such as DSSM [65], CDSSM [62], and ARC-I [63] represent text as a dense vector to get the matching score.These models ensure retrieval efficiency and are often used in DR and QA.More recently, PLMs have achieved stateof-the-art neural retrieval performance and become the mainstream neural retrieval method that , Vol. can be used to represent different matching signals based on the training on in-domain datasets.DPR [28] uses BERT to encode query and passage into high-dimensional vectors and train the model with contrastive learning to obtain a dense vector retriever.ANCE [80] reveals the effect of hard negatives in the training of dense retriever and propose a method to refresh the corpus index during training and retrieve the hard negatives for queries in the training set.RocketQA also optimizes the utilization of samples during training by dense retrieval from the three levels including expanding batch size, data enhancement, and denoising negative sampling.Some methods such as Condenser [17] propose use the unsupervised pre-training to further enhance the representation ability of the PLM for text.ARR [88] uses the reranker as a discriminator and exploits adversarial training to improve the ability of dense retrieval.TAS-B [22] distills knowledge from reranker to dense retrieval and boosts performance.

Neural Reranking
Based on Text Matching.Reranking is the finer ranking stage for the candidate subsets retrieved from the first stage, which models the interaction between texts more complexly.In exact matching, DRMM [19] considers query item importance and is suitable for DR.In semantic matching, interaction-based text matching models such as DeepMatch [40], ARC-II [23], MatchPyramid [48], ESIM [11] and Match-SRNN [73] can describe the fine-grained semantic matching relationship between two texts.They are often used in short text matching tasks such as PI, NLI, and RD.In inference matching that needs models to infer new information from the text, asymmetric neural text matching models such as DeepRank [49] and RE2 [84] are suitable.Recently, PLM-based framework such as Monobert [10] that concatenates two texts and uses the self-attention mechanism for deep interaction achieves state-of-the-art performance in reranking.The previous neural information retrieval models are only suitable for one of the specific tasks according to their specifically designed structures and mechanisms.Even though PLMs can be applied in multiple text matching tasks, only task-specific models that are fine-tuned on specific tasks can achieve good performance.When using the traditional fine-tuning method to train PLMs on mixed datasets of multiple matching tasks, the performance of the model drops seriously [15].Our approach focuses on capturing the essential information that can be used across tasks and domains and adapting them to the different tasks and domains to improve the generalization ability of NIR models.

Multi-Task Learning for Natural Language Processing Tasks
In this section, we review multi-task learning in natural language processing tasks from the perspective of traditional multi-task learning and multi-task prompt learning.

Traditional
Multi-Task Learning.In order to enable a single model to handle multiple tasks effectively, a common strategy is to train the model on mixed datasets that encompass various tasks.One straightforward approach is to directly combine the datasets of each task without incorporating any task-specific marks, and then fine-tune pre-trained language models (PLMs) on these mixed datasets.For example, Alon et al. [67] propose a method called MultiQA.They train models on multiple datasets without task-specific tokens and find that it leads to robust generalization in reading comprehension tasks.Another approach, introduced by Jean et al. [43], is a multi-task retrieval model designed for knowledge-intensive tasks.They utilize the mixed datasets provided by KILT [50] without using any task-specific tokens.Some approaches include additional taskspecific components that contain parameters tailored to each task.Liu et al. [38] propose MT-DNN, which employs a shared transformer encoder to encode multiple natural language understanding (NLU) tasks and incorporates task-specific layers for each task.Additionally, there are methods that transform multiple tasks into a unified question-answering format, such as MQAN [45], Uni-fiedQA [29], and T5 [55].Li et al. [34] apply this method to information extraction, while Chai et al. [8] and Puri et al. [53] use it in text classification.
The main focus of the above methods is not information retrieval.Furthermore, some of these methods fail to fully utilize differentiating marks, while others introduce additional task-specific layers or require extremely large-scale models.In comparison, NIR-Prompt aims to leverage multitask training to capture essential signals that can be applied across tasks and domains, and combine these signals to adapt to different tasks and domains, thereby improving the multi-task generalization capability of PLMs in neural information retrieval.Notably, when using the NIR-Prompt model for prediction, there is no need to add any task-specific layers.

Multi-task Prompt
Learning.In recent times, there has been an increasing trend in utilizing parameter-efficient methods for multi-task learning and transfer learning across various works.[72] conduct an extensive study of the transferability across various NLP tasks.[66] investigates the transferability of prompts across different tasks.PANDA [89] proposes a metric to accurately predict prompt transferability and a novel method to transfer the knowledge from source to target prompt via knowledge distillation.AdapterFusion [52] proposes a two-stage learning algorithm that leverages knowledge from multiple tasks.HyperFormer [27] learns adapter parameters for all layers and tasks by generating them using shared hypernetworks.ATTEMPT [3] exploits attentional mixtures of soft prompts for parameter-efficient multi-task learning.

Prompt Engineering
Prompt learning is an innovative tuning strategy for pre-trained language models (PLMs) that converts multiple natural language processing (NLP) tasks into [MASK] prediction format using language models.This is achieved by incorporating a template into the input texts.The creation of an effective template is crucial for successful prompt learning.Generally, there are two main categories for template creation: manual template engineering and automated template learning [37].
In manual template engineering, Petroni et al. [51] design templates to extract knowledge from PLMs.Schick et al. [59] propose Pattern-Exploiting Training (PET), which combines few-shot learning with templates and transforms specific tasks into cloze tasks.Although manual template creation can address many tasks, it has certain limitations: 1) it requires extensive knowledge and expertise to design templates, and 2) manually created templates may not always be globally optimal [35].To overcome these challenges, several automated template learning methods have been proposed, such as token-based gradient searching [64], mining training corpus [24], Prefix Tuning [35], and P-tuning [39].Recently, there have been studies focusing on pre-training prompts on multiple tasks, such as SPoT [71], PPT [18], ExT5 [2], FLAN [74], ZeroPrompt [81], and zero-shot for task generation [58].Different from previous research, our work does not primarily aim to improve the performance of prompt learning.Instead, our focus is on utilizing prompt learning to enhance the generalization capability of neural information retrieval models.Specifically, we employ prompt learning to separate the process of capturing signals and combining them in text matching.The key role of prompt learning in our method lies in obtaining task descriptions for various matching tasks within PLMs and utilizing these descriptions to combine essential matching signals for different tasks and domains.Additionally, we explore the relationships among task prompt tokens and discover that new matching task prompts can be constructed by linearly combining other learned task prompt tokens.

Out-of-domain Generalization of Neural Information Retrieval
The generalization of neural information retrieval models across domains (i.e.out-of-domain multitask performance of our research questions) has recently received attention from researchers.In general, the out-of-domain generalization of neural information retrieval is poor, even weaker than the traditional word-overlap-based method such as BM25 [57].A thorough examination of the generalization of dense retrieval is performed [56] and discusses the key factors that affect the performance of zero-shot dense retrieval including the vocabulary overlap, query type distribution, and data scale.BEIR [68], the heterogeneous benchmark for testing the generalization ability of retrieval models is designed to promote the relevant research.MoDIR [78] introduces a momentum method to learn the domain-invariant by adversarial training on the source and target domain.DDR [87] disentangles the retrieval model to Relevance Estimation Module (REM) for modeling domain-invariant matching patterns and several Domain Adaption Modules (DAMs) for modeling domain-specific features of multiple target corpora to propose an adaptable dense retrieval framework.However, both of these methods require the data from the target domain to be acquired in training.They are unsupervised domain adaptation, not generalization, which is inconsistent with our settings (no target domain data required).

CONCLUSION
In this paper, we point out that although there are some differences among the various information retrieval tasks, there are still essential matching signals shared by the various tasks, such as exact matching, semantic matching, and inference matching.If the model can capture and exploit these signals, the generalization ability of the model across tasks and domains will be improved.With this intuition, we propose a neural information retrieval training framework called NIR-Prompt consisting of Essential Matching Module (EMM) and Matching Description Module (MDM) based on the idea of decoupling the process of signal capturing and signal combination.MDM uses the method of prompt learning to obtain the description of different tasks in the pre-trained language model.EMM is trained on diverse mixed datasets and combined with the guidance from the task descriptions in MDM to capture essential matching signals and adapt these signals to different tasks.Based on this, a generalized neural information retrieval pipeline consisting of retrieval and reranking is constructed.The experimental results on eighteen public datasets and a heterogeneous benchmark for testing the generalization ability of retrieval models show that our method yields better in-domain multi-task, out-of-domain multi-task, and new task adaptation performance for dense retrieval, reranking, and the entire neural information retrieval pipeline compared to the traditional fine-tuning paradigm.

Fig. 1 .
Fig. 1.Matching texts and signals for "What are the four phases of the cell cycle?" in different tasks.

Fig. 2 .
Fig. 2. The basic idea of NIR-Prompt.The Matching Description Module (MDM) exploits prompt learning to map the characteristics of each task to prompt tokens and use them as the task description.The Essential Matching Module (EMM) combines the description and the data of each task and mixes various tasks into mixed datasets.PLM performs multi-task learning on the mixed datasets to implicitly capture the essential matching signals shared across tasks and adapt them to the description of each task.Task description combines the essential matching signals to generalize to multiple tasks.
include the content that matches the query?Question and Answer Question: Passage: Does the passage include the answer of the question?Retrieval-based Dialogue The first text: The second text: Can the second text reply to the first text?Paraphrase Identification

Fig. 5 .
Fig. 5. Essential matching module (EMM) in retrieval and reranking stage.Task prompts are obtained from the prompt tokens of matching description module.EMM Optimizes transformer to capture the essential matching signals and adapt to different tasks under the control of prompt tokens.

4. 1 . 2
Datasets.The datasets for the experiments consisting of three parts are shown in Table

Fig. 7 .
Fig. 7. Task relationships shown by the prompt tokens.The sum of each row is 1, and the value is rounded to two decimal places.The value in (a) represents the distribution of cosine similarity between tasks in each row and the other tasks.The value in (b) represents the distribution of the fusion weights of other tasks for each new task in row (obtained by Section 4.2.3).There is no similarity calculation between the same tasks, so the diagonal elements are zero.

Fig. 8 .
Fig.8.Performance on various tasks varies with in retrieval.
We will introduce the Matching Description Module and Essential Matching Module in Retriever-Prompt and Reranker-Prompt respectively.

Table 1 .
Manual prompts in Retriever-Prompt.Handcra ed prompts are described in natural language.

Table 2 .
Manual prompts in Reranker-Prompt.Handcra ed prompts are described in natural language.
the training, task prompt tokens obtained from MDM contain the description of different tasks and , Vol. 1, No. 1, Article .Publication date: December 2023.

Table 3 .
Experiment results in each se ing multi-task learning on the training set.Evaluate: Test its performance on the testing set of each dataset.•Out-of-domain Multi-task Performance: Can our method capture generalizable essential matching signals and avoid fitting domain-specific biases?Train: Given mixed datasets of multiple tasks, the model performs multi-task learning on the training set.Evaluate: Testing the zero-shot learning ability of the multi-task model on out-of-domain datasets.•New Task Adaptation: Can essential matching signals facilitate new task adaptation in lowresource scenarios?Train: We use leave-one-out to perform few-shot learning on new tasks.Evaluate: Test its performance on the testing set of each dataset.

Table 4 .
Details of mixed training datasets for dense retrieval.

Table 5 .
Details of mixed training datasets for reranking.

Table 6 .
Datasets in experiments.Level 1, 2 and 3 correspond to in-domain multi-task, out-of-domain multitask and new task adaptation.

Table 8 .
Out-of-domain performance of the model on unseen datasets listed in Part 2 of Table6for dense retrieval and reranking.Boldface indicates the best results of multi-task models and the results over Finetuning are denoted as ' * '. Results with significant performance improvement with p-value ≤ 0.05 compared with all multi-task models of baselines are denoted as '+'.

Table 10 .
Result on BEIR.FT is the traditional fine-tuning paradigm, RP 1 is Retriever-Prompt and RP 2 ℎ is Reranker-Prompt .Boldface indicates the best results for each stage.Results with significant performance improvement with p-value ≤ 0.05 compared with traditional fine-tuning baselines are denoted as '+'.The evaluation metric is NDCG@10.

Table 11 .
Illustrate the ability to distinguish tasks.Prompt) to train RetroMAE on mixed datasets in Table4respectively and compare their in-domain and out-of-domain generalization ability on the datasets in Table4 and 6.The experimental results are shown in Table12.Compared with the traditional fine-tuning method, our method can further improve the in-domain and out-of-domain ability of RetroMAE.4.3.4AnalysisofFixing the Prompt Tokens.In Section 3.3.2,we propose that in the training of continuous prompt tokens, we let the embedding of trainable prompt tokens remain fixed in layers 1 to in the self-attention.The intention of this design is to enable the trained prompt tokens to , Vol. 1, No. 1, Article .Publication date: December 2023.

Table 12 .
In-domain and out-of-domain generalization ability comparision between RetroMAE trained by traditional fine-tuning method (RetroMAE ) and Retriever-Prompt (RetroMAE ).Boldface indicates the best results of multi-task models and the results over RetroMAE are denoted as ' * '. Results with significant performance improvement with p-value ≤ 0.05 compared with RetroMAE are denoted as '+'.

Table 13 .
Average performance of the models trained under different arrangement strategies.