A Multi-Granularity-Aware Aspect Learning Model for Multi-Aspect Dense Retrieval

Dense retrieval methods have been mostly focused on unstructured text and less attention has been drawn to structured data with various aspects, e.g., products with aspects such as category and brand. Recent work has proposed two approaches to incorporate the aspect information into item representations for effective retrieval by predicting the values associated with the item aspects. Despite their efficacy, they treat the values as isolated classes (e.g.,"Smart Homes","Home, Garden&Tools", and"Beauty&Health") and ignore their fine-grained semantic relation. Furthermore, they either enforce the learning of aspects into the CLS token, which could confuse it from its designated use for representing the entire content semantics, or learn extra aspect embeddings only with the value prediction objective, which could be insufficient especially when there are no annotated values for an item aspect. Aware of these limitations, we propose a MUlti-granulaRity-aware Aspect Learning model (MURAL) for multi-aspect dense retrieval. It leverages aspect information across various granularities to capture both coarse and fine-grained semantic relations between values. Moreover, MURAL incorporates separate aspect embeddings as input to transformer encoders so that the masked language model objective can assist implicit aspect learning even without aspect-value annotations. Extensive experiments on two real-world datasets of products and mini-programs show that MURAL outperforms state-of-the-art baselines significantly.

In recent years, dense retrieval methods have been extensively studied in both Information Retrieval (IR) and Natural Language Processing (NLP) communities [9].On the shoulders of pre-trained language models (PLMs), they have achieved compelling performance.However, they are mostly studied for unstructured data and have not investigated how to effectively leverage the aspect information of structured data, such as category for products and affiliation for people.For example, in Figure 1, the query "sports gloves" targets gloves for sports use so kitchen gloves should be avoided.It is obvious that the category of the four items could help to differentiate various types of gloves and improve retrieval performance.Unfortunately, it remains largely unexplored to effectively leverage such aspect information in dense retrieval.
Recently, Kong et al. [13] has proposed two effective models for multi-aspect dense retrieval, i.e., MTBERT and MADRAL.These methods follow a typical paradigm of learning aspect embeddings with an auxiliary objective of predicting their associated values [2,3].A concrete example is that the embedding of aspect "category" for i4 in Figure 1 will be learned by predicting its value, i.e., "Household Supplies".Although effective, they consider the values of an aspect as isolated classes and neglect the potential correlation between various values, which could result in sub-optimal performance.Taking the items in Figure 1 for instance, although they fall into four separate categories, the first three are relevant to the user query "sports gloves" while the last is not.The auxiliary objective of predicting their categorical IDs treats each category equally and may not capture their fine-grained relations.
Noticing this issue, we propose to leverage the aspect information at even finer granularities, such as the word and token levels, in addition to the previously considered phrase-level granularity.Then, for the items in Figure 1, when we break their category phrases into small pieces, the relation between the first three will be clearer since they all have sports-related descriptions such as exercise, sport, tennis, etc.Moreover, from a linguistic perspective, coarser granularities such as sentences and phrases convey more specific information while finer units usually carry more general information [22].Since different granularities could express various levels of intent, we incorporate multiple granularities of aspect annotation prediction to assist query/item representation learning.
Our model is named MURAL, short for a MUlti-granulaRityaware Aspect Learning model.It incorporates separate aspect embeddings before the content tokens and after CLS as inputs to the transformer layers (shown in Figure 3).Then, on the top layer, the aspect embeddings are supervised with value predictions at various levels of granularities (e.g., phrase, word, and token).MURAL has several advantages over state-of-the-art methods, i.e., MTBERT and MADRAL (See Figure 2): First, in contrast to MTBERT which mixes the information from item aspects and the overall content semantics in CLS, MURAL represents the two types of information separately and allows for more interactions between them with a gating mechanism.Second, in contrast to MADRAL which only learns the aspect embeddings with the value prediction objective during pre-training, MURAL also guides the aspect embeddings to learn from the masked language model loss.This could assist implicit aspect learning even when there are no annotated values for an item aspect.Last and most importantly, by incorporating the aspect information across various granularities, MURAL could capture the semantic relations between the aspect values at different levels, contributing more to the retrieval performance.
We conduct extensive experiments on two real-world search datasets with rich aspect information.Experimental results show that our method outperforms competitive baselines significantly on both datasets.It is remarkable that our model achieves compelling performance even without the supervision of aspect annotations, which means that useful implicit representations can be learned by MURAL even when the aspect information is not used.Ablation studies on different granularities show that each granularity can contribute to the multi-aspect retrieval performance and combining them all lead to much better results.

RELATED WORK
Dense Retrieval.Dense retrieval models typically use a bi-encoder structure for independent query and item encoding, with relevance measured through a simple similarity function (such as dot product).Karpukhin et al. [10] initializes the encoder with BERT and combines it with in-batch negatives, achieving better performance than early models.After that, researchers began to explore various fine-tuning techniques to train a better dense retriever, including hard negative mining [23,31], knowledge distillation [28], and multi-vector representation [11,19,33].For example, Xiong et al. [31] proposed to dynamically mine hard negatives during training by periodically refreshing the index.Luan et al. [19] captures information of items from different perspectives by using the first k document token embeddings as the item representation.Based on this, Zhang et al. [33] added k special tokens before the item input to obtain the multi-vector representation.These multi-vector methods aim to extract multiple underlying semantic information from the item.In contrast, our method explicitly considers explicit multi-aspect information modeling.Additionally, our method outputs only a single representation vector for each item, saving space and time for indexing items.
Recently, Kong et al. [13] introduced two methods for incorporating explicit aspect information into a single representation vector.The first method employs CLS embeddings to simultaneously perform aspect classification tasks for multiple aspects.The second method adds an attention network to the PLM, enabling it to separately model multiple aspects, followed by aspect fusion.Their differences with our method will be introduced in Section 4. Multi-Field Retrieval.The effective utilization of multi-field information (e.g., title, keyword, description) in documents has been studied for long.Before PLM appears, many neural ranking models were proposed to effectively leverage item structure [4,17,32].For example, Zamani et al. [32] aggregated field-level representations to obtain item representations and employed a matching network for final relevance score prediction.In the PLM era, research has continuously focused on the utilization of multi-field information [26,27].For example, Shan et al. [26] proposed the field-level local matching loss, calculated based on the query and each document field representation.Sun et al. [27] treated aspect as text and proposed an effective pre-training method to capture the bi-directional interactions between aspect and content texts.The difference between multi-aspect and multi-field is that fields contain an infinite textual value space, usually composed of variable-length unstructured text.Conversely, an aspect has a defined set of finite values, acting as "labels" for structured items.Given this, they face different core challenges, and effectively utilizing multiple aspects' information is a valuable research direction.Pre-trained Bi-encoder.Researchers have explored pre-training models for retrieval with the bi-encoder architecture [6,7,14,18,20,30].For example, Gao and Callan [6] added extra head layers atop the Transformer, with shortcut connections between early outputs and the head, enhancing the CLS embedding of the encoder.Lu et al. [18] pre-trained an auto-encoder with a weak decoder for document representation learning.Differing from these pretraining methods targeted at unstructured data, we investigate how to infuse explicit aspect information into the encoder representation during pre-training.In the future, we will explore how to integrate our approach with existing research.

PRELIMINARIES
Dual Encoding.The standard PLMs, e.g., BERT [5], take a token sequence  = ( 1 , ...,   ) as input, and generate contextualized representations as: where  denotes the hidden size, and  0 = [] is a special token added to the beginning.The representation h( 0 ) is commonly used as the final representation for the input  .In dense retrieval, the biencoder architecture is widely adopted, where the query  and item  are separately encoded using the PLM to obtain their respective representation vectors [16].Then, a simple scoring function is used to calculate the similarity between these two vectors.Aspect Learning.In dense retrieval, aspect learning involves using aspect information to enhance retrieval performance when queries or items are associated with varying aspects (e.g., brand, color, category in product search).In addition to the content text  = ( 1 , ...,   ) (e.g., query, item title), a query or item can be associated with multiple aspects, and we denote the set of these aspects as For simplicity, when the context is clear, we omit the subscript  in   .For each aspect , there exists a finite vocabulary of aspect values, represented as   , along with a corresponding embedding table   ∈ R |  | × that contains embeddings for each value of aspect . Figure 2 shows the aspect learning in two state-ofthe-art multi-aspect dense retrievers [13].Both approaches utilize content text as the encoder input.Specifically, MTBERT reuses CLS to represent aspects, whereas MADRAL constructs embeddings for the || aspects by attending to the final layer of content tokens.Both methods train the aspect embeddings by predicting the corresponding value annotation ID in   for each of the || aspects.Multi-Granularity.Different granularities of text strings capture semantic information at varied levels.Coarse grains such as sentences or phrases often express more specific intent than finer

Attention Attention Attention
The aspect value embedding tables brand category color Value Annotation Prediction for Aspect Learning grains like words or tokens.Therefore, relying solely on phraselevel value prediction, as previous methods do, might not yield effective aspect representations.For example, if a product category value is "handmade products", its word-level granularity values would be "[ℎ, ]", and its token-level granularity values would be "[ℎ, ##, ]".Formally, we denote the set of granularities as , where each  (with  ∈ ) represents a specific granularity.In this paper, we use three granularities:  = {ℎ, , }.We use    to represent the value vocabulary obtained by decomposing aspect 's values at granularity .The corresponding aspect value embedding table becomes We list the frequently used notations in Table 1.

METHODOLOGY
In this section, we propose a MUlti-granulaRity-aware Aspect Learning model (MURAL) for multi-aspect dense retrieval and introduce its core components.As in [13], MURAL is also based on BERT [5].Since MURAL encodes both items and queries in the same way, we only use items for illustration.

Aspect Representations
It is crucial to represent aspects reasonably in a pre-trained model so that aspect learning can guide their training effectively.To fully exploit the capabilities of the Transformer encoders, as shown in Figure 3, we introduce several tokens after CLS and before the content tokens to represent aspects from various perspectives.This aligns with the way CLS is obtained and these tokens can interact with the content tokens sufficiently.During pre-training, these inserted embeddings can act as different views of context when predicting the masked tokens in the content.In this way, these embeddings can also learn from the masked language model objective and capture the content semantics from various implicit views, which could bring more benefits, especially when there are no value annotations for an aspect.
Comparison with Previous Methods.As shown in Figure 2, MTBERT [13] reuses the CLS token to predict the values of item aspects, which enforces CLS to mix the information from item aspects with the overall content semantics it is originally designated to capture.The balance between the two cannot be automatically learned and CLS could be confused about what it should learn.MADRAL [13] represents each item aspect separately by attending to the final representations of content tokens and learns the aspect embeddings by predicting their associated values.During pre-training, the only guidance for the aspect embeddings is this value prediction Figure 3: Our MURAL with single-objective-based grouping in a simplistic scenario of two aspects and two granularities.objective, which could be insufficient to learn them well.What is worse, they will not be updated when there are no aspect-value annotations.In contrast, in MURAL, the aspect information would not mix with the overall content semantics in CLS.With the gating mechanism, they can interact more and the aspect importance can be learned automatically.Moreover, the aspect embeddings can be guided by the masked language model loss as well, which not only benefits their representation learning but also facilitates implicit aspect learning without aspect-value annotations.

Aspect Learning
For simplicity, we use an example of aspect  at granularity  to illustrate.For aspect learning in MURAL, there are two important components: value representation and the aspect learning objective.Value Representation.To conduct effective aspect learning by predicting the value annotations of an aspect in terms of both coarse and fine grains, value representations play an important role.There are two options: 1) Sharing the existing token embeddings in the backbone PLM and calculating the value embeddings of word-level and phrase-level grains by a projection function.It can reuse the semantic information carried in the PLM tokens.However, the token embeddings are learned towards the goals of both PLM and aspect learning, which may interfere with each other.2) Declaring separate value embedding tables, which is consistent with [13] and the research before the PLMs era in [3].The extra value embeddings could serve the model to conduct aspect learning better without other interventions.However, if trained from scratch, these new parameters may be difficult to optimize.We refer to these two options as "shared" and "unshared" respectively in terms of whether to share the underlying token embeddings with the existing encoder.We investigate both ways in our experiments (see Section 6.2).
Specifically, for the "shared" option, we first tokenize each aspect value  in    using the BERT tokenizer.Then, we extract their embeddings from BERT's embedding tables and use a projector function on the token embeddings to obtain the corresponding value embedding   during training.In this paper, we adopt average pooling as the projection function since it is simple and produces similar results to other methods in our preliminary experiments.
For the "unshared" option, each aspect  has a separate embedding table    for each granularity , storing the embeddings of its values    .Instead of training these tables from scratch, we initialize the tables using the average token embeddings in the PLM for each value  at granularity  (the same as the initial embedding in the "shared" option).This gives the new embeddings a decent starting point in the semantic space and has the freedom to better conduct aspect learning, the benefit of which will be shown in Section 6.2.Aspect Learning Objective.Once we obtain the representation of aspect  at granularity , denoted as h   ( ) ∈ R  , we adopt the widely-used group-wise contrastive loss to pre-train the encoder.It aims to bring the source representation closer to instances in its target group while distancing it from representations of other groups [12].
where e  /e  + ∈ R  is the aspect value embedding from    , (•) is the dot-product function, and A   is the set of aspect value annotations for aspect  at granularity .

Multi-Granularity-Aspect Grouping
Assume there are || aspects and | | granularities, our goal is to facilitate aspect learning for each aspect of each granularity, totally |A|*|G| learning objectives.A straightforward approach is to use a single representation to handle these multiple objectives.However, this method enforces all the information to be compressed together, severely limiting the learning capacity of each objective.Therefore, we introduce three grouping schemes to integrate multigranularities and multi-aspects: Single-objective-based Grouping, Granularity-based Grouping, and Aspect-based Grouping.Single-objective-based Grouping.As shown in Figure 3, when there are only a few aspects and granularities, we can directly introduce || * | | tokens in the input sequence  to capture the item semantics from || * | | views.Each of them accounts for a single objective among the multi-granularity-aspect combinations.Specifically, we obtain a sequence of  = ( 0 ,  1 , ...,  || * | | ,  1 , ..,   ).We utilize the hidden vector h(  )( = 1, ...,  * ) from the final layer as the item representation from the perspective of aspect   at the granularity   .The aspect learning loss function becomes: When || * | | is large, adding a significant number of tokens can adversely affect the semantic representation of the original input.
Hence, it becomes essential to further group the objectives across various granularities and aspects.
Granularity-based Grouping.The same granularity indicates the same level of semantic information, and grouping the objectives at the same grain is a reasonable option.In this case, | | tokens will be inserted into the input sequence, yielding  = ( 0 ,  1 , . Aspect-based Grouping.An alternative option is to group the objectives across multi-granularity-aspects by aspects so that different aspect information will not mix together and various levels of granularities could benefit each other.Here, we introduce || guiding tokens before the content tokens:  = ( 0 ,  1 , ...,  || ,  1 ,  2 , ..,   ).The hidden vector h(  ) ( = 1, ..., ||) captures the representations of all granularities for the input item corresponding to aspect   .In particular, when calculating the loss L     ( ) using equation 2, the representation h     of aspect   remains consistent across different granularities.Under this aggregation method, loss L A can be reformulated as follows: Grouping by granularities or aspects reduces the number of guiding tokens, accommodating scenarios with numerous aspects and granularities.Their model architectures stay the same as Figure 3, except that aspect learning objectives for the same granularities or aspects are conducted on the shared token.

Aspect Embedding Fusion
For efficiency concerns, it is necessary to consolidate multiple embeddings into a single one to minimize storage and computation costs.Inspired by [13], we adopt the "CLS-Gating" fusion mechanism in MURAL.To illustrate the fusion process, we present an example using the single-objective-based grouping approach discussed in Section 4.3.Specifically, we pass the CLS embedding through a linear layer and a softmax function to compute the weighting scores for h( 1 ), ..., h(  ), where  = || * | |: where  ∈ R  × and  ∈ R  are trainable parameters.Then, we utilize the learned weights to fuse multiple embeddings, thereby obtaining the final encoded representation of the input  :

Training Objectives
Pre-training.As discussed in previous work [21], the Masked Language Model (MLM) [29] task could help construct good text representation for IR.Therefore, similar to [13], we also adopt MLM as one of the pre-training objectives besides aspect learning.
where  means the input sentence,  ( ) and  \ ( ) denotes the masked tokens and the remaining tokens from  , respectively.
We then pre-train the Transformer encoder using the aspect learning loss jointly with the MLM loss, as follows, where  is the hyperparameter.
Fine-Tuning.We adopt the following in-batch softmax cross entropy loss L  as the learning objective during fine-tuning.Note that although the aspect learning loss could also be added during fine-tuning, our experimental results show no significant improvements for all the multi-aspect retrievers.Hence, we omit this objective in this paper.

EXPERIMENTAL SETTINGS 5.1 Datasets
We use the following two large-scale search datasets from realworld platforms with rich aspect information for our experiments.The aspect-related statistics of the two datasets are in Table 2.
(1) Multi-Aspect Amazon ESCI Dataset (MA-Amazon).MA-Amazon [27] enriches the English portion of the Amazon ESCI [24] dataset with item category information.In MA-Amazon, only items have annotations for brand, color and category.The item corpus contains 482K distinct products, which are used for pre-training.The retrieval dataset has 17k, 3.5k, and 8.9k queries for fine-tuning, validation, and testing respectively, without any query overlaps.For each query, the retrieval dataset provides 20.1 items on average, along with their ESCI relevance judgments (Exact, Substitute, Complement, Irrelevant), indicating each item's relevance to the given query.Following [24], we treat Exact as positive and all others as negatives for fine-tuning and metrics requiring binary labels.(2) Alipay Search Dataset.Alipay is a Chinese mini-program (applike service) search dataset.In Alipay, both queries and items are annotated with two aspects: brand and category.We conduct pretraining on both a query corpus, containing 1.3M unique queries, and an item corpus with 1.8M distinct items.The retrieval dataset contains 60k, 3.3k, and 3.3k real user queries for fine-tuning, validation, and testing respectively, without query overlaps.Note that

Baselines
We adopt the following dense retrieval baselines for comparison, including models using or without using aspect information: (1) BIBERT [15,25]: A standard bi-encoder baseline and the backbone of MURAL, using CLS encoding of the BERT-based encoder for both query and item representations.BIBERT is pre-trained with MLM loss and fine-tuned with loss L  (Equation 10).( 2 MURAL is our proposed multi-aspect dense retrieval model.In contrast, MUR disables aspect learning.Specifically, when  in Equation 9is set to 0, MURAL regresses to MUR.MURAL-CONCAT employs the same aspect-content text concatenation strategy as BIBERT-CONCAT for the model input.Note that unless the model name includes "-CONCAT", the model input consists solely of content text.

Implementation Details
We implemented MURAL and all the baselines by ourselves to ensure consistent implementation details and fair comparisons.

Pre-training.
For all methods, the encoder is shared for both queries and items to facilitate knowledge sharing.Specifically, we pre-train on a corpus consisting of the item corpus or a mixture of the query and item corpus (when query aspect annotations are available) to obtain the shared encoder for fine-tuning.
Muti-granularity Value Collection.Given an aspect  and its original aspect vocabulary at the phrase level, we obtain its word and token granularity vocabularies: For word granularity, we segment each aspect value  by spaces and punctuation (for English) or employ the Jieba tool [1] (for Chinese), and eliminate duplicates to aggregate the generated "words".For token granularity, we merge the token list obtained by processing each aspect value  with the BERT tokenizer to create the corresponding vocabulary set.
Model Pre-training.We initialize all the BERT components using Google's public checkpoint and employ the Adam optimizer with the linear warm-up technique.The learning rate and epoch for the MA-Amazon/Alipay dataset are set to 1e-4/5e-5 and 20/10, respectively.The maximum token length is 156, the MLM mask ratios are 0.15 for items and 0.3 for queries.For all methods requiring adjustment of training objective scaling coefficients, we uniformly select coefficients based on their validation set performance after fine-tuning.These coefficients vary from 0.1 to 1, in 0.1 intervals.
For our method, we set  in Eq.9 to 0.1.We use the following finetuning procedures to evaluate pre-trained model checkpoints every two epochs and select the best one on the validation set.

Fine-tuning.
For both datasets, we fine-tune all the models for 20 epochs with Tevatron toolkit [8].Following the previous work [10], we include a hard negative sample for each query besides inbatch negatives.We use a learning rate of 5e-6 and a batch size of 64.The maximum token lengths are set at 32 for queries and 156 for items.All the models are trained with relevance loss L  (Eq.10).

Overall Performance
We compare MURAL with baseline models both utilizing and without utilizing aspect information.As shown in Table 3, we have the following observations: (1) The models that leverage the aspect information (methods except for BIBERT, Condenser, and MUR) outperform their backbone (BERT) without using it.Among the multi-aspect dense retrievers, MURAL performs the best with a significantly large margin.This confirms the necessity of incorporating aspects in query/item representation learning.(2) On MA-Amazon, MADRAL underperforms the simpler MTBERT.We believe this is due to the less pre-training data of MA-Amazon and the absence Notably, the gains from advanced pre-training techniques are orthogonal to our method.Our approach can be easily incorporated into stronger backbones like Condenser and could achieve even better performance.We leave it in our future work.(4) BIBERT-CONCAT performs better than MTBERT and MADRAL in terms of some metrics on the two datasets, indicating that concatenating aspects as text strings can be beneficial.However, query aspects should be taken special care of during relevance matching in order to achieve good performance.Incorporating both concatenation and aspect prediction in the same model (MURAL-CONCAT) does not always result in better performance than without concatenation.
We have similar observations with MTBERT and MADRAL, but due to the space concern, we do not report them.The reason may be that the model learns unwanted shortcuts when using the aspect both as the model input and the learning objective.(5) Our method shows competitive performance even without using aspect annotations (MUR).MUR outperforms most baseline models in terms of all the metrics except NDCG@50 on Alipay.This indicates that MUR can capture complementary information from implicit perspectives for the final representation.It also confirms the advantages of aspect representations and the MLM training for the aspects in MURAL.

Studies on Model Variants
In this subsection, we study various options for the essential components in MURAL.For reproducibility, all experiments are conducted on the public MA-Amazon dataset.
Studies on Value Representations.In Table 4, we observe that using an independent value embedding space (the "unshared" option in Section 4.2) leads to better performance.As we mentioned earlier, the "shared" option optimizes token embeddings both towards the objectives in BERT and the aspect learning, which could interfere with each other and limit the capacity of the model on aspect prediction.However, under the "unshared" option, it may be difficult to optimize the separate value embeddings from scratch while other parameters only need fine-tuning.To see whether this affects model performance, instead of using the same initial state as the "shared" option, we randomly initialize the embeddings while (1) For small numbers of aspects and granularities, simply use independent learning for each objective (MURAL ℎ_ ).( 2) When there are more aspects and granularities, grouping multiple objectives in one guiding token can be a better choice.
Studies on Guiding Tokens and Fusion Methods.Instead of adding separate guiding tokens for aspect learning, we study a variant that reuses the same amount of tokens at the beginning of the input sequence to conduct aspect learning, denoted as MURAL   _ .The results show that MURAL   _ has similar or better performance to the best baseline in Table 3 but is significantly worse than the best variant of MURAL.This indicates that the multigranularity-aware aspect learning is beneficial but using separate guiding tokens to conduct the learning is needed.
To study whether CLS-Gating (introduced in Section 4.4) is helpful for aspect embedding fusion, in MURAL __ , we remove it and use the CLS embedding as the final representation.CLS naturally fuses the aspect embeddings in the second-to-last layer while the aspect learning is conducted in the last layer.This variant performs better than the best baseline in Table 3 but is worse than the best variant.It indicates that the fusion should be carried on the final aspect embeddings with a proper weighting mechanism.

Ablation Studies
We ablate various components of multi-aspect, multi-granularity, and query/item aspect learning.In this section, our experiments are also based on the enriched MA-Amazon dataset.Additionally, we validate the importance of the query and item side effects on the Alipay dataset, since MA-Amazon lacks query-side aspect information.
Effect of Aspects and Granularities.In Table 5, we first study the impact of multi-aspect and multi-granularity in MURAL.We find that: (1) Every aspect contributes to the model performance, especially category, consistent with [13].(2) Each granularity alone  Effect of Query/Item Aspects.We disable the aspect learning on the query/item side in Table 6.We observe that both query and item aspects are beneficial and query aspects have a larger impact, which is also consistent with [13].This is not surprising since query aspects are obtained from query analysis such as intent classification and carry more additional information.

Aspect Learning Accuracy
In Table 7, we compare the accuracy of MURAL with baseline methods after pre-training and fine-tuning to understand the aspect learning process better.We only analyze the most important aspect category on Alipay.MA-Amazon and other aspects have similar conclusions.Considering that each item may have multiple category annotations, we use Accuracy@3 to calculate accuracy.Evaluation of the query and item aspect prediction is based on the test query set and item corpus of the Alipay dataset, respectively.First, all methods have high accuracy after aspect learning in pretraining while lower accuracy after fine-tuning.Since we only use relevance loss during fine-tuning, it is expected that the accuracy will drop.In our experiments, we find that adding aspect learning loss during fine-tuning enhances aspect prediction accuracy but will harm retrieval performance.We speculate that this objective guides the model parameters to somewhere not aligned with the relevancematching objective.Hence, higher aspect prediction accuracy does not always co-occur with better retrieval performance.
Secondly, the prediction accuracy of MTBERT drops dramatically after fine-tuning.Since MTBERT uses the same CLS token to conduct relevance matching and aspect prediction, optimization only toward relevance matching during fine-tuning undermines its ability to predict aspect values.In contrast, MADRAL and MURAL retain most of such ability after fine-tuning since they use extra aspect embeddings to perform aspect learning.
Lastly, for phrase-level evaluation, MURAL has the best aspect prediction accuracy.As we know, MURAL also has the best retrieval performance, which means MURAL can learn the two objectives well and let better aspect embeddings assist relevance matching  more.Notably, the accuracy at the word and token level is not comparable with the phrase level since the ground truth is different.The finer-level prediction accuracy is also good.When the granularity becomes finer, the accuracy becomes lower after fine-tuning, which is probably because finer grains have more ground-truth values, making the multi-label classification more challenging.

Case Visualization
We visualize the item representations of three categories, as shown in Figure 1, to see their distributions in semantic space.Specifically, c1 (Sport Specific Clothing) and c2 (Exercise & Fitness) are semantically similar, while c3 (Household Supplies) is unrelated to the first two.We first use MADRAL and MURAL to obtain all the item representations on MA-Amazon and put items into their categories.Then we randomly pick 20 items of c1, c2 and c3 and plot them using the t-SNE toolkit in Figure 4. We can observe that MADRAL separates c1, c2, and c3 to a similar extent.By contrast, MURAL places the related categories c1 and c2 closer and puts them farther from the unrelated c3.This demonstrates MADRAL's inability to discern the semantic similarity between c1 and c2, as it treats different phrase-level product categories as isolated IDs, overlooking their word-level semantic connections.MURAL can capture fine-grained semantic relations among similar aspect values while maintaining precise phrase-level aspect discrimination by incorporating both coarse and fine granularity information.

CONCLUSIONS
In this paper, we propose a multi-granularity-aware aspect learning model that enhances the utilization of additional aspect information in structured data.Unlike previous methods that disregard the semantic relationship among different aspect values, our approach incorporates multiple granularities of aspect values to facilitate query/item representation learning.By effectively capturing the semantics of queries/items from implicit views, our model achieves compelling performance even without the supervision of aspect annotations.Empirical results on two real-world datasets demonstrate the superiority of MURAL.

Figure 1 :
Figure 1: An example of a query and its candidate items.
Input : Renewed Apple MacBook Air 11.6 ….The aspect value embedding tables A={ !: brand,  " : category} G={ !: phrase,  " : word} CLS Transformer Encoder ) Condenser[6]: A pre-trained method tailored for unstructured textual dense retrieval.It introduces a short circuit between middle-layer tokens (excluding CLS) and their corresponding head-layer tokens during pre-training, optimizing the CLS embedding to encapsulate more information.(3) BIBERT-CONCAT: It treats the aspect values as texts and concatenates them with the query/item content during pre-training with MLM.During fine-tuning, since the concatenation with query could change query semantics , we only concatenate item aspects for relevance matching.(4) MTBERT[13]: A multi-task (MT) learning model based on BIBERT.Besides MLM during pre-training, it conducts || aspect prediction tasks using CLS.(5) MADRAL[13]: It incorporates an aspect extraction attention network to extract || aspect representations for both queries and items.These embeddings are learned from aspect prediction tasks during pre-training and fused to yield the final representation during fine-tuning.(6) MURAL, MUR and MURAL-CONCAT:

Figure 4 :
Figure4: The t-SNE plot of item representations for MADRAL and MURAL on MA-Amazon.more.Notably, the accuracy at the word and token level is not comparable with the phrase level since the ground truth is different.The finer-level prediction accuracy is also good.When the granularity becomes finer, the accuracy becomes lower after fine-tuning, which is probably because finer grains have more ground-truth values, making the multi-label classification more challenging.

Table 1 :
A summary of main notations used in this paper.

Table 2 :
Aspect-Related Dataset Statistics.It presents the percentage of queries/items with non-empty aspect values in the pre-training corpus and the aspect value vocabulary sizes at various granularities: phrase, word, and token.

Table 3 :
Comparisons between MURAL and the baselines.The best results (excluding MURAL-CONCAT) are in bold.†,‡, and * indicate significant improvements over the best baselines in the first/second group and the backbone BIBERT, respectively.thequeries for validation and testing do not appear in the pretraining query corpus.Each instance in the relevance dataset is a <query, item, label> triplet, where the label indicates the manually annotated binary relevance of this query-item pair.

Table 4 :
Variants of MURAL on MA-Amazon.† indicates significant differences from the best option.

Table 5 :
Ablation studies of MURAL in terms of category and granularity on the MA-Amazon dataset.†,‡indicatesignificantdifferencesoverMURAL and BIBERT.keepingotherbestsettings in MURAL.The harmed performance of MURAL ℎ_ in Table4confirms our presumption and shows the benefit of decent initialization states.Studies on Grouping Methods.In Table4, we observe that MU-RAL ℎ_ , which groups objectives by granularity, performs the best.Note that on the Alipay dataset, which has fewer aspects, MURAL ℎ_ , single-objective-based grouping, has the best performance.This is consistent with our claim in Section 4.3 that as the aspect count increases, further grouping benefits model training.Based on these observations, we suggest:

Table 6 :
Ablation of query and item aspects on Alipay.†, ‡ indicate significant differences over MURAL and BIBERT.

Table 7 :
The category aspect accuracy on Alipay dataset.