Pre-training with Aspect-Content Text Mutual Prediction for Multi-Aspect Dense Retrieval

Grounded on pre-trained language models (PLMs), dense retrieval has been studied extensively on plain text. In contrast, there has been little research on retrieving data with multiple aspects using dense models. In the scenarios such as product search, the aspect information plays an essential role in relevance matching, e.g., category: Electronics, Computers, and Pet Supplies. A common way of leveraging aspect information for multi-aspect retrieval is to introduce an auxiliary classification objective, i.e., using item contents to predict the annotated value IDs of item aspects. However, by learning the value embeddings from scratch, this approach may not capture the various semantic similarities between the values sufficiently. To address this limitation, we leverage the aspect information as text strings rather than class IDs during pre-training so that their semantic similarities can be naturally captured in the PLMs. To facilitate effective retrieval with the aspect strings, we propose mutual prediction objectives between the text of the item aspect and content. In this way, our model makes more sufficient use of aspect information than conducting undifferentiated masked language modeling (MLM) on the concatenated text of aspects and content. Extensive experiments on two real-world datasets (product and mini-program search) show that our approach can outperform competitive baselines both treating aspect values as classes and conducting the same MLM for aspect and content strings. Code and related dataset will be available at the URL \footnotehttps://github.com/sunxiaojie99/ATTEMPT.


INTRODUCTION
Dense retrieval models [9-11, 26, 28, 29] have achieved compelling performance with pre-trained language models (PLMs) [5,24] as the backbone.Most studies on dense retrieval focus on unstructured data consisting of plain text, while little attention has been paid to structured item retrieval such as product and people search.In these scenarios, additional aspect information beyond the query or item content is critical for relevance matching, such as brand-nike, affiliation-Stanford.However, little work has explored how to use them effectively in dense retrieval models.
A typical way of leveraging aspect information for multi-aspect retrieval is to refine the item representations with an auxiliary aspect prediction objective [12].Specifically, for each aspect of an item, the item content is used to predict its annotated value IDs during training.This approach has two major disadvantages: 1) It considers the values of an aspect as isolated classes and learns the embeddings of value IDs from scratch, ignoring their semantic relations.For example, among the category values, "Hunting & Fishing" is more related to "Sports & Outdoors" while unrelated to "Pet Supplies".However, such semantic relations may not be captured sufficiently if we treat them as independent classes.2) It does not use query/item aspects such as category, brand, color, etc. during test time, which limits the potential retrieval gains.Although it may be costly to obtain query aspects during online service, item aspects can be extracted offline and it is easy to also use them during inference if they are already used in training.
In this paper, we propose a method of pre-training with Aspect-contenT TExt Mutual PredicTion (ATTEMPT) to address the above limitations.Specifically, ATTEMPT leverages aspect values as text strings and concatenates them with the content using leading indicator tokens in between.For more effective retrieval, rather than simply conducting undifferentiated MLM on the concatenated aspect and content text, we specifically design an aspect-content mutual prediction objective.It keeps the entire aspect/content tokens and predicts the masked ones in the content/aspects.Also, to suit the scenario where the overhead of obtaining the query aspects online is high, we set the query aspect text to empty during inference.Our method has several advantages over the common approach: 1) In ATTEMPT, the text of an aspect value reuses the token embeddings from the powerful PLMs so the semantic relations between values can be naturally captured.2) Being concatenated with the content, the item aspects can also take effect for relevance matching during test time.3) The aspect-content mutual prediction objective promotes sufficient interactions between the aspect and content at the token level, producing better item representations for retrieval, which is confirmed by extensive experimental results.
As far as we know, there are no suitable large-scale public datasets for multi-aspect retrieval.We construct such a dataset by crawling the item categories from their pages to complement the aspects in the Amazon ESCI dataset [19].Our experiments on this refined dataset and a real-world commercial mini-program dataset show that ATTEMPT can significantly outperform the competitive baselines both predicting the classes of aspect values and conducting the same MLM for aspect and content strings.

RELATED WORK
There are three threads of work related to our study.(1) Multiaspect Retrieval.Some work has exploited multi-aspect information to rank products or entities before PLMs appear [1,2,21].In the era of PLM [6], there has been limited research on multiaspect retrieval until Kong et al. [12] first attempts to do so.They learn aspect embeddings by predicting their value IDs with item contents and fuse them to yield an item embedding.Later, Shan et al. [23] proposed a fine-tuning method that uses the local aspectlevel matching signals to enhance the global query-item embedding matching.(2) Multi-field Retrieval.How to effectively leverage multiple fields (e.g., title, body, etc.) in a document has been a longstanding research topic.The most famous method is BM25F [22].Methods leveraging multi-fields have also been proposed before and after PLMs appeared [3,18,27,30].The multi-fields are unstructured text in nature and the essential issue is how to weigh them differently during matching.Aspects, unlike fields, usually have a fixed value set which is much smaller than the space of field text.Thus, their core challenges are different.(3) Pre-trained Models for Dense Retrieval.Many studies have explored promoting the capabilities of PLMs for dense retrieval including introducing extra training objectives [4,13,[15][16][17], special masking schemes [25], and model architecture changes [7], etc.Our method is grounded on the basic dual BERT encoders [5].
Clothing, Women, Shoes, Athletic, Running adidas adidas Women Cloud foam Pure Running Shoe White 6.5 US…

METHODOLOGY 3.1 Preliminary
For a query q or a candidate item  , we represent the content text (e.g., query string, title, description) as   and the aspect text (e.g., values for brand, color, and category) as   .Assuming  or  has  aspects,   is further denoted as   1 , ...,    .For each aspect   (1 ≤  ≤ ), it has a finite vocabulary of aspect values, denoted as    .Previous work [12] incorporates aspect information by predicting the IDs corresponding to the annotated values of each aspect   within the space    .In contrast, we propose to pre-train the encoder by conducting mutual prediction between text   and   .

ATTEMPT
To model the semantic relationship between various values of an aspect naturally, we treat the aspect values as text strings and concatenate them with the content text.For sufficient capture of the interactions between item aspects and contents, we introduce mutual prediction objectives as illustrated in Figure 1.
Encoder Input.To indicate different types of text segments, we prepend an indicator token [  ] (1 ≤  ≤ ) and [] to the aspect text    and the original content   , e.g., an encoder input is . When a query/item does not have certain aspect information, the corresponding aspect text will be empty.In this case, the indicator tokens could still learn some implicit representations of the query/item content.Note that during relevance matching, we always keep the query aspect text empty to suit the practical retrieval scenarios where the overhead of obtaining query aspects is high and also avoid potential semantic drift.Table 3 will show that the query-side indicator tokens ( ) alone learned during pre-training are beneficial for retrieval.Since the other parts of ATTEMPT are exactly the same between  and , we take  as an example for illustration.

Content Masked Language Modeling (MLM).
To capture the interactions between the content tokens without any auxiliary information, ATTEMPT conducts MLM on the item content.It randomly masks tokens in the content text and predicts the masked tokens with the context-dependent representations encoded by Transformer layers [5].The corresponding loss function is: where t  denotes the text produced by randomly masking some tokens in the text   , ( t  ) denotes the masked tokens, and t  \ ( t ) denotes the remaining tokens in t  .
Aspect-to-Content MLM Prediction.We take the entire aspect text as context when predicting the masked tokens in the content text.Under this context, the prediction of masked content tokens has extra evidence for consideration and can act differently than content MLM alone.The aspect-to-content (a2c) loss L 2 is: where ⊕ means concatenation.In particular, the leading tokens [  ] (1 ≤  ≤ ) and [] in the input will not be masked.Content-to-Aspect MLM Prediction.The idea of content-toaspect prediction is similar to the aspect classification in [12], both of which use the original content to predict the aspects.However, ATTEMPT predicts the masked words in the aspect text rather than the value classes (IDs), which encodes the aspect information in a softer manner.Specifically, the content-to-aspect (c2a) loss is: Overall Learning Objective.By introducing L 2 and L 2 , ATTEMPT can incorporate the aspect information into the item representation sufficiently through bidirectional interactions.In summary, our overall pre-training objective is: where  is a hyper-parameter.

EXPERIMENTAL SETUP 4.1 Datasets
We conduct model comparisons on two real-world datasets: Multi-Aspect Amazon ESCI Dataset (MA-Amazon).Amazon ESCI Product Search [19] originally has multilingual real-world queries, product information such as brand, color, title, description, etc., and 4-level relevance labels: Exact, Substitute, Complement, and Irrelevant.We only use the English part and enrich the dataset by collecting multi-level product categories from the item pages.We merge all the items and get a corpus of 482K unique items, which is used for pre-training.For fine-tuning, we divide the original training set into training and validation sets by queries, and keep the test set, yielding 17K, 3.5K, and 8.9K queries respectively.As in [19], we treat Exact as relevant and the other labels as irrelevant during training and for recall calculation.MA-Amazon only has item aspect information, and the coverage of brand, color, and category of levels 1-2-3-4 are 94%, 67%, and 87%-87%-85%-71%, respectively.Alipay Search Dataset.Alipay is a mini-program (app-like service) search dataset with binary manual relevance annotations.The pre-training query/item corpus has 1.3M/1.8Mdistinct queries/items with aspect information i.e., brand (44%/0.6%coverage on query/item) and three-level categories (91%-90%-56%/90%-90%-62% coverage for category 1-2-3 of query/item).The finetuning dataset consists of 60K/3.3K/3.3Kunique queries in the training/validation/test set.Note that the queries for validation and testing do not appear in the pre-training query corpus.

Baselines
We compare ATTEMPT with the following pre-training methods (-C means that the input takes the same concatenation strategy for aspect and content text as ATTEMPT): (1) BIBERT [14,20]: BIBERT, the backbone of ATTEMPT, is a prevalent dense retrieval Table 1: Overall performance.The best results are in bold.† indicates significant differences between ATTEMPT and the best baselines in the first/second/third group.
Method MA-Amazon Alipay r@100 r@500 ndcg@50 r@100 r@500 ndcg@50 BIBERT 0. method for plain text.It employs MLM [5] to pre-train the encoder using the content text of query/item.( 2) Condenser [7]: It adds a short circuit between the tokens except CLS of the lower layer and the higher layer of BERT [5] to enhance the final CLS representation. (

Implementation and Evaluation Details
We implemented ATTEMPT and all the baselines by ourselves.For all the methods, the encoder is shared for both queries and items.Pre-training.The maximum token length is 156, The learning rate and epoch for the MA-Amazon/Alipay dataset are set to 1e-4/5e-5 and 20/10, respectively.We initialize the BERT parameters with Google's public checkpoint and use Adam optimizer with a linear warm-up.For all -C baselines and ATTEMPT, the mask ratios are set to 0.15/0.3for item/query content to account for the shorter query length.They all have the same mask ratio between aspect and content text except for BIBERT-C(A) and ATTEMPT, where the mask ratio for aspect text is 0.6. in Eq.4 is set as 1.0.We fine-tune the pre-trained model checkpoints every two epochs and select the best one on the validation dataset.Fine-tuning.On both datasets, all models are trained for 20 epochs with the Tevatron toolkit [8].We use a learning rate of 5e-6 and a batch size of 64.All methods are trained with softmax cross entropy loss with in-batch negatives and one hard negative.Note that we have not used auxiliary classification objective for MTBERT and MADRAL since no significant improvements are achieved.Metrics.We report recall@100, recall@500 and ndcg@50.When calculating ndcg on MA-Amazon, following [19], we set the gains of E, S, C, and I to 1.0, 0.1, 0.01, and 0.0, respectively.We perform two-tailed t-tests (p-value ≤ 0.05) to see significant differences.

Main Results
The overall performance is shown in Table 1.We have the following observations: (1) Generally, methods using aspect information outperform those that don't, confirming the importance of aspects in  and MADRAL).When the aspect text has a larger mask ratio than the content (in BIBERT-C(A)), the retrieval performance will be boosted.This shows that the aspect text should be taken special care to encourage sufficient learning.(3) When aspect text concatenation is incorporated with methods using aspect values for classification (MTBERT and MADRAL), the retrieval performance does not always become better.This could be because the input aspect text becomes the shortcut for the models to predict its corresponding class ID.When the pre-training data is large (e.g., on Alipay), such relation is more likely grasped by models, deterring the learning of beneficial interactions.(4) More powerful pre-training method (Condenser) sometimes perform better than methods using aspects (MTBERT and MADRAL on Alipay).Note that the benefit from the advanced pre-training techniques is orthogonal to the aspect information and they can be combined for even better performance.
We leave the study of this in future work.(5) Overall, our ATTEMPT achieves the best performance on both datasets, showing the efficacy of its pre-training objective specifically proposed for the concatenated text of aspect and content.

Further Analysis
We also probe ATTEMPT from various perspectives to verify its effectiveness.For reproducibility, our analysis is based on MA-Amazon.The only exception is the ablation study of query/item aspects since only Alipay has both of them.
Ablation Study of Aspects.We study the effects of various aspects in ATTEMPT (brand, color, and category from level 1 to 4) in Table 2.We find that: (1) When use each aspect alone, only the category information enhances model performance.This might because brand and color are often included in the item content already while the category is extra meta information.The observation that category matters the most is consistent with [12].(2) Combining all aspects outperforms using category only, indicating that brand and color may take a better effect when interacting with the category.(3) More levels of category information will lead to better performance except that three and four levels have similar results.While adding more category levels provides richer information, the reduced coverage (refer to Section 4.1) might limit the benefits.Ablation Study of Loss Function.We remove each of the three losses from the overall loss to see how important it is.In Table 2, we find that: (1) The bidirectional prediction losses are beneficial to ATTEMPT, and excluding either leads to a performance drop.
(2) The Aspect-to-Content(a2c) prediction is the most helpful, indicating that using aspects as context for content MLM prediction is a feasible way to infuse the aspect information into an item.(3) The performance also drops a lot when the vanilla MLM loss is eliminated, indicating the original content semantics without being affected by external information are also important.
Combination with Advanced Fine-tuning Techniques.AGREE [23] is a recently proposed fine-tuning method that incorporates a local aspect-query matching loss with the original global queryitem matching loss.AGREE has not studied how to utilize query aspects, which suits MA-Amazon well since it does not have query aspects.Since AGREE concatenates the item aspects with content, it is easy to integrate AGREE during fine-tuning after pre-training with ATTEMPT.The last block in Table 2 shows the performance of AGREE alone and combining both.It shows that based on better fine-tuning techniques, ATTEMPT can achieve better performance.Notably, combining AGREE with methods that conduct aspect classification will not necessarily lead to better performance (Check MTBERT-C and MADRAL-C in Table 1).
Ablation Study of Query/Item Aspects.We examine the influence of the query and item aspects in Table 3.It shows that both query aspects and item aspects contribute to retrieval performance and the item aspects are more important.Since we only use item aspects during relevance matching, query aspects only take effect during pre-training and could have fewer contributions.

CONCLUSION
In this paper, we propose an effective pre-training method that uses aspects as text strings and conducts mutual prediction between the aspect and content text for multi-aspect retrieval.In contrast to previous approaches that treat aspect values as categorical IDs, ATTEMPT can capture the semantic relation between aspects by their text strings and perform finer-grained interactions between item aspect and content by mutual prediction.Our experiments on two real-world datasets show that ATTEMPT can outperform multiple competitive baselines significantly.Moreover, we release our enriched Multi-aspect Amazon Product Search dataset to encourage research on multi-aspect dense retrieval.

Figure 1 :
Figure 1: The mutual prediction MLM in ATTEMPT.The aspect and content texts are colored green and purple.
) BIBERT-C: It only differs from BIBERT in the encoder input.It uses the aspect text in the same way as ATTEMPT during pretraining and fine-tuning.(4) BIBERT-C(A): It refines BIBERT-C by assigning a higher mask ratio specifically for the aspect text, which is consistent with ATTEMPT.(5) MTBERT [-C] [12]: It conducts  additional aspect classification tasks on the CLS during pre-training.(6) MADRAL [-C] [12]: It initiates extra multiple aspect embeddings and learns them by predicting the value classes of each aspect and fuses them to produce the final item representation.

Table 2 :
Study of various component choices on MA-Amazon.† indicates significant improvements over BIBERT.

Table 3 :
Ablation study of query/item aspects on Alipay.† indicates significant improvements over BIBERT.