Knowledge-Enhanced Multi-Label Few-Shot Product Attribute-Value Extraction

Existing attribute-value extraction (AVE) models require large quantities of labeled data for training. However, new products with new attribute-value pairs enter the market every day in real-world e-Commerce. Thus, we formulate AVE in multi-label few-shot learning (FSL), aiming to extract unseen attribute value pairs based on a small number of training examples. We propose a Knowledge-Enhanced Attentive Framework (KEAF) based on prototypical networks, leveraging the generated label description and category information to learn more discriminative prototypes. Besides, KEAF integrates with hybrid attention to reduce noise and capture more informative semantics for each class by calculating the label-relevant and query-related weights. To achieve multi-label inference, KEAF further learns a dynamic threshold by integrating the semantic information from both the support set and the query set. Extensive experiments with ablation studies conducted on two datasets demonstrate that KEAF outperforms other SOTA models for information extraction in FSL. The code can be found at: https://github.com/gjiaying/KEAF


INTRODUCTION
Product attribute value pairs are important for e-Commerce because platforms make product recommendations for customers based on the key attribute-value pairs and customers use attributes to compare products and make purchases.Existing studies on AVE based on neural networks view AVE as sequence labeling [13,34], question-answering [24,28] or multi-modal fusion problems [15,41].These supervised-learning models are well trained to classify attribute-value pairs when large quantities of labeled data are available for training.Even the most current open mining model needs a few attribute-value seeds and iterative training for weak supervision [38].However, new products with new attribute-value pairs enter the market every day in real-world e-Commerce platforms.It is difficult, time-consuming and costly to manually label large quantities of new products profiles for training.Besides, with the appearance of new attribute-value pairs, the class distribution becomes long-tailed, where a subset of the labels have many samples, while majority of the labels have only a few samples.We formalize AVE as a multi-label FSL problem, aiming to extract structured product information from unstructured profiles with limited training data.We take the common head labels data for training and the limited tail labels data for testing, and there is no overlap of classes between training set and testing set shown in Figure 1.Recent methods on multi-label FSL have made great progress in CV [1,25] and NLP [10,17,39].Among these methods, prototypical network [26] has been proved to be powerful and potential.However, different from AVE in e-Commerce, these models (1) explore only label tags as auxiliary information, (2) still have noise when learning prototypes, and (3) require further data or additional models to learn the threshold for label numbers prediction.
To address the above challenges, we propose a Knowledge -Enhanced Attentive Framework (KEAF) for product AVE.The main contributions of KEAF consist of three parts.(1) To the best of our knowledge, we are the first to formulate AVE as a multi-label FSL task to tackle the problem of limited training data for longtailed datasets.Unlike open mining models, KEAF does not require the attribute-value pairs exactly appear in the product profiles.(2) By leveraging both the label description generated by a generator and the category information as the auxiliary information to obtain more discriminative prototypes, KEAF can not only avoid the issue that different attribute-value pairs share the identical prototype for 1-shot learning, but also alleviate ambiguity by obtaining both label and category relevant information.The hybrid attention mechanism also helps reduce the noise and capture more informative semantics from the support set by calculating both the label-relevant and query-related weights.(3) To achieve multi-label inference, a dynamic threshold is learned during the training stage by integrating the semantic information from support and query sets.The adaptive threshold does not require additional training data or based on additional models.Extensive experimental results on two datasets show that our proposed model KEAF significantly outperforms other existing information extraction models for AVE.

RELATED WORKS
Early works on AVE use a domain-specific dictionary and rulebased methods to identify attribute value pairs [8,19,23,32].Then, sequence labeling models [13,21,34,40], question answering-based models [24,28,33], multi-modal models [15,29,41], extreme multilabel models [3,4] and open mining models [38] are trained for AVE.However, these approaches require large quantities of labeled data for training or iterative training for weak supervision.Most works of FSL focus on single-label classification [2,6,9,14,18].However, one product may have multiple attribute value pairs for AVE task.Early works on multi-label FSL depends on a known structure of the label spaces [22] and label set operations [1]Then, prototypical networks [27] are revised for multi-label cases by learning a shared embedding space [37], grouping samples multiple times [25], and learning local features with different labels [35].Attention mechanisms [11] and label information [10,17,39] are considered to differentiate prototypes.Different from these approaches, we leverage both label and category for product AVE in e-Commerce., where  usually includes  samples (K-shot) for each of  labels (N-way).In contrast to the single label N-way-K-shot setting [30], multi-label FSL allows that each single sample can have multiple labels simultaneously.There are  total classes, and each class has at least  samples (with at least one label appearing less than  times if any samples are removed) because we can not guarantee each label appears exactly  times while each sample has multiple labels.The input data for each product  is a tuple < , , ,  >, where  is the product title,  is the product description,  is the label description, and  is the product category.The input label is a vector  =  1 ,  2 , . . .,   , where  ∈ 0, 1 , indicating whether the product has the label or not, and  is the total number of classes.The outputs are attribute-value pairs.

Multi-label Few-Shot Data Sampling
Multi-label few-shot data sampling includes data splitting, data balancing and data sampling.We first reconstruct the dataset by splitting data based on upper thresholds   and lower thresholds   , learned from the frequency of class labels to guarantee that   ∩   = ∅.We filter the dataset by discarding the samples with the label count below   or above   , updating the label dictionary and its corresponding samples.To guarantee that the shot   +  >= 10 for FSL, the filtering process is done iterately until class number  is fixed.Then, we balance the dataset by randomly dropping single-label data to achieve a similar size with multiple-label data.To approximately conduct N-way-K-shot learning, we follow [10] to construct query and support sets for each episode.Details of multi-label few-shot data sampling are shown in Algorithm 1.

Knowledge-Enhanced Attentive Framework
In this section, we introduce the overview of KEAF in Figure 2. 3.3.1 Contextual Representations.Labels for AVE tasks are attribute value pairs such as 'wallet type: long wallet', which may lose contextual information due to the simple format.To achieve more information related to labels, we adopt GPT-2 [20]as the text generator to generate a detailed description for the attribute-value pairs.We adopt a pre-trained language model BERT [5]as the product input encoder to generate the contextual representation.We construct a string [CLS;;SEP;;SEP;] by concatenating product category, title and description as the input.The output representation for the product input   and label input   is: where  ∅ is BERT encoder,  ∅ is GPT-2 generator,  is category,  is title,  is description and  is 'attribute is value' label information.

Label-Enhanced Prototypical Network.
In Figure 2, we adopt prototypical networks [26] to get the original prototype of each attribute-value pair by averaging the embedding of support samples.However, different labels may share the same support samples in multi-label settings, resulting in severe ambiguity.To emphasize the difference between prototypes and reduce such ambiguity, we leverage label descriptions generated by GPT-2 [20] to fully express the semantic information for attribute-value pairs and help learn more representative prototypes.Label has shown significant effect on learning more discrimitative prototypes [10,17,39].Thus, we combine the label with the average of support samples to compute a label-enhanced prototype   with an interpolation factor : where  (•) is the BERT encoder,   is the label description,    ∈ { |(,  ) ∈  ∧   ∈  } is the support sample labeled with   , and   is shot number.The combination of label description and support embedding helps the prototypes better separated from each other.

Hybrid Attention.
The aim of hybrid attention is to select more informative instances by retaining attribute-value relevant information while eliminating the negative effect triggered by the noise.As shown in the third stage in Figure 2, we first capture the similarity weight   in the label by calculating the semantic similarity between the label-enhanced prototype embedding   from Equ 2 and the attribute-value description embedding   from Equ 1: where  (•) is the cosine similarity and ĉ gets the class-relevant information.To further capture informative semantics from queryrelated instances and reduce the noise, we apply the instance-level attention, where each instance has a different importance factor   : where (•) is the linear layer,  (•) is encoder from Equ. 1,    represents the query instance and r is the final prototype.Now, r contains label-relevant semantic information and it can be closer to the instances with features more related to queries.3.3.4Dynamic Threshold.In Figure 2, we train the threshold  by integrating the semantic information from both the support and query sets.The thresholding function  (•) is calculated by the ) with the relevance score between the final prototype r in Equ. 4 and query instance    generated from Equ. 1.The number of query labels is estimated by averaging the number of support labels of support instance   : where  is the support set,  and  denotes N-way-K-shot,  (•) represents the label counter, ⊙ is element-wise production, and  (•) is the distance function.The threshold is dynamically updated for each training epoch.In testing, the framework predicts the query label set    by comparing the distance    with the threshold : The final threshold for the testing phase is chosen by the threshold value that has the best performance in the evaluation phase.The model is trained by repeatedly sampling training episodes from   with support set  and query set .The model parameters are updated using the following binary cross entropy (BCE) loss: where  is query shot,  is N-way,  (•) is the sigmoid function and    represents the ground truth.

EXPERIMENTS 4.1 Experimental Setup
We evaluate our model over two datasets: a large e-Commerce platform in Japanand MAVE [36].The dataset statistics is shown in Table 1.We compare KEAF with SOTA few-shot IE approaches, models without label information: Siamese [14], Proto [26], MTB [2], and models with label information: HCPR [9], FAEA [6] and SimpleF-SRE [18].For evaluation, we use both Micro(Mic-) and Macro(Mac-) Precision, Recall and F1.The max length of input is 512 for project and 32 for anchor.The dimension size is 768.We vary the label interpolation factor  in 0.1, 0.5, 1.0 and the optimal anchor weight is selected.Our model is implemented on PyTorch and optimized with AdamW optimizer.The learning rate is 10 −5 with weight decay 10 −6 .The batch size is 1 and dropout rate is 0.2.The experiments are conducted on Nvidia A100 GPU with 80G GPU memory.  2 and Table 4.We observe: (1) KEAF significantly outperforms other baselines on both macro and micro F1 in 1-shot and 5-shot settings.These results reveal that KEAF better learns the prototypes and captures the informative semantics.(2) On the in-house E-Commerce dataset, models using label semantics improve the performance more in 1-shot than 5-shot setting.This is consistent with our expectations that adding label information helps reduce ambiguity.On MAVE, baselines using labels even have a worse performance.We conjecture that the original labels in MAVE are too simple to learn anything, and they even cause noise.In KEAF, we generate a more detailed label description for better integrating the label and reducing the noise, resulting in the best performance among all models.(3) MTB and other baselines demonstrate good results only on Recall and bad results on Precision.For AVE, low precision means lots of human efforts are needed to manually remove the extracted non-relevant attribute-value pairs.A possible reason is that a very large threshold is learned on these baselines, trying to predict as many labels as possible and resulting in a very large recall value.In contrast, KEAF better learns the threshold and balance the Precision and Recall, resulting in the highest F1 score.

Ablation Study.
To verify the effectiveness of each component in KEAF, we conduct the 1-shot ablation study in Table 3.We observe: (1) Fusing anchor to the prototypes results in a large performance improvement because the label helps discriminate prototypes and reduce ambiguity.(2) Generating a more detailed label description helps improve the performance more on MAVE than on the in-house E-Commerce dataset.We conjecture that MAVE is an English dataset and English GPT-2 is well trained than Japanese GPT-2, resulting in a more accurate label description on MAVE.
(3) Category information shows vital importance on MAVE.We think that AV pairs are from different categories and adding the category can better separate the prototypes.(4) Using the attention can improve the performance by reducing the noise to some extent.

CONCLUSION
In this paper, we formulate attribute value extraction task in fewshot learning to solve the long-tailed data problem and limited training data for new products.We propose a Knowledge-Enhanced Attentive Framework for product AVE.We design a label-enhanced prototypical network with the hybrid attention to alleviate ambiguity and noise, and capture more informative semantics.We train a dynamic threshold to achieve multi-label inference.Results demonstrate that KEAF outperforms other SOTA IE models significantly.

Figure 1 :
Figure 1: An example of multi-label few-shot product attribute-value extraction task.
Given a set of training classes   and testing classes   , where   ∩   = ∅.The model is trained with numerous samples from   , and it can quickly adapt to   with few labeled data.Each training episode involves a support set  = (  ,   )   =1 and a query set  = (  ,   )   =1

Figure 2 :
Figure 2: The overview of our proposed KEAF framework.

4. 2 . 1
Main Results.The results of multi-label FSL are shown in Table

Table 1 :
Comparison with other multi-label FSL datasets.
production of query label counter  (