Hierarchical Query Classification in E-commerce Search

E-commerce platforms typically store and structure product information and search data in a hierarchy. Efficiently categorizing user search queries into a similar hierarchical structure is paramount in enhancing user experience on e-commerce platforms as well as news curation and academic research. The significance of this task is amplified when dealing with sensitive query categorization or critical information dissemination, where inaccuracies can lead to considerable negative impacts. The inherent complexity of hierarchical query classification is compounded by two primary challenges: (1) the pronounced class imbalance that skews towards dominant categories, and (2) the inherent brevity and ambiguity of search queries that hinder accurate classification. To address these challenges, we introduce a novel framework that leverages hierarchical information through (i) enhanced representation learning that utilizes the contrastive loss to discern fine-grained instance relationships within the hierarchy, called ''instance hierarchy'', and (ii) a nuanced hierarchical classification loss that attends to the intrinsic label taxonomy, named ''label hierarchy''. Additionally, based on our observation that certain unlabeled queries share typographical similarities with labeled queries, we propose a neighborhood-aware sampling technique to intelligently select these unlabeled queries to boost the classification performance. Extensive experiments demonstrate that our proposed method is better than state-of-the-art (SOTA) on the proprietary Amazon dataset, and comparable to SOTA on the public datasets of Web of Science and RCV1-V2. These results underscore the efficacy of our proposed solution, and pave the path toward the next generation of hierarchy-aware query classification systems.


INTRODUCTION
Hierarchical query classification is a vital task in the domain of e-commerce and search, playing a crucial role in driving customer obsession [8].As users interact with online services, they input various queries to search for products, services, or information.Accurately classifying these queries is pivotal in ensuring that users are presented with the most relevant and valuable results.One significant application of the hierarchical query classifier in industry is categorizing sensitive queries that follow a predefined hierarchy in e-commerce.For example, given a query, it can be classified as harmful, adult-oriented, or non-sensitive products (Here, for illustration, we define these categories as parent categories).Furthermore, for harmful products, there are two child categories: self-harm and harm to others.The child categories for the adult-oriented category can be adult products and adult content.Since these queries contain offensive content or pertain to unregulated goods, and different categories need to be handled differently, mis-classification of such queries can lead to unpleasant or even detrimental user experiences, potentially damaging a service's reputation and user trust.Moreover, presenting inappropriate or restricted content could lead to legal ramifications for the service.Hence, building an accurate hierarchical query classification framework is of paramount importance, not just for user satisfaction, but also for the overall compliance and integrity of the service [31].
Various machine learning techniques are employed to identify the appropriate hierarchical category for each query [8,41,42], and sense the context and nuances each query presents [38].However, these algorithms usually require large-scale high-quality annotated data, which is challenging to obtain.Instead, the more practical semi-supervised setting has gained popularity [3,17], where unlabeled queries are used to boost the classification performance.When executed effectively, a well-performing hierarchical query classifier enhances the user experience, fostering a smoother and more productive interaction between users and the service.
However, building an accurate hierarchical query classification framework in real-world is non-trivial due to two challenges: (1) severe class imbalance.Take Amazon as an example, sensitive queries are infrequent, accounting for less than 0.05%~0.15% of all queries.Even worse, when training the classification model, only a small initial set of sensitive queries is accessible.This greatly hinders the development of a high-quality classifier.(2) typically short and ambiguous query text in search queries.The average search query is about three words [11], leading to a weaker semantic understanding of queries for correct classification.
To overcome these two challenges, we propose a semi-supervised machine learning framework utilizing the instance hierarchy and label hierarchy to enhance query representation learning and classification performance.Particularly, (1) for the class imbalance challenge, we use contrastive learning to learn representations that attend to the minor classes through instance hierarchy.Intuitively, even if the number of queries under a child category can be small, the number of queries under the corresponding parent category is large and these queries are close to each other.To leverage them, we adopt contrastive learning where we create positive pairs from the queries under the same child category while negative pairs from cross-child categories.We formulate it as an intra-class hierarchy in instance hierarchy, and further extend it to the inter-class hierarchy, where we consider positives as queries across child categories and negatives as queries across parent categories, as shown in Figure 1.This instance hierarchy helps capture fine-grained information for model training.
(2) for the ambiguous and short text challenge, we utilize the information from three parts to enhance the understanding of a query -i.e., (i) the neighboring queries that are under the same child category.Motivated by the fact that not all queries are short and ambiguous, especially, some neighboring queries are clear and distinguishable, we can leverage this similarity between neighboring queries by the aforementioned intra-class hierarchical contrastive learning; (ii) the neighboring child categories that share the same parent category.The intuition is that when implicitly aligning the query to different child categories under the same parent category, the model learns the semantics of the query from other queries.This is achieved by adding a hierarchy-aware loss in our classification task; (iii) the text information in the label usually contains useful signals like the semantic meaning that contributes to the downstream classification.Based on this, we first adopt BERT [9] to encode the label text into an embedding vector to capture the contextualized semantic embedding.We concurrently create a "label" graph to take hierarchies/relationships between labels into account and employ the previously generated embedding vectors as node features for downstream graph representation learning.We finally combine the representation vector with the query embedding vector to form the finalized feature vector of a query for the final classification task.Since we use hierarchical label information in this step, we define this process as label hierarchy.Altogether, the designed instance hierarchy and label hierarchy components aim to address the aforementioned challenges for better classification.
In the real-world e-commerce search, we have one observation that there exist many unlabeled queries that share the typographical similarity to the annotated queries of the same category due to potential typos.These queries, when used as training examples, potentially assist the classifier by improving the robustness against mis-typed queries, as they are typographically close to the queries to augment the dataset.Based on this finding, we deploy the self-training learning stage in our pipeline, i.e., we use the pseudo labels of classified queries to retrain the model.When selecting queries for self-training, inspired by the aforementioned observation, we develop neighborhood-aware sampling to effectively identify high-quality similar queries.We argue that the observation of topographical similarity can be extended to generic semantic similarity.The proposed self-training pipeline can be adapted as well.Besides, we can interpret this observation from the adversarial learning perspective where we adversarially generate similar queries or identify similar queries from our unlabeled queries for model training to improve the robustness of the classifier.Overall, we deploy the self-training in our framework to utilize the crucial unannotated queries for classification performance gain.
To evaluate the proposed method, we examine it on proprietary Amazon data and public Web of Science and RCV1-V2 datasets using Micro-F1 and Macro-F1 scores.Our proposed method achieves the best performance in most cases across all compared methods and datasets, except for Micro-F1 on Web of Science and RCV1-V2 dataset.However, Micro-F1 is less critical than Macro-F1 in realworld applications since we have an imbalanced class distribution and need to focus on minority classes, which is attended to by Macro-F1.Our result demonstrates the efficacy of our proposed method, especially on the Amazon dataset.Our method is generalizable to solve hierarchical query classification tasks in all domains and paves the path toward the next generation of hierarchy-aware query classification.The main contributions of our work are: • We propose a new algorithm that utilizes the instance and label hierarchy through contrastive learning-enhanced representation learning, which allows us to leverage hierarchical information in a fine-grained manner to improve classification performance.

RELATED WORKS
In this section, we briefly introduce relevant research areas.

Hierarchical Query Classification
Hierarchical query classification aims at classifying queries into a category within a given taxonomy to understand user intent and facilitate downstream recommendation tasks.It can be formulated as a text classification problem where the input text is a combination of short keywords [41].Existing conventional methods employ either a single flattened multi-class classifier or multiple binary classifiers.Based on extracted query features, these works can be categorized into two groups: (1) N-gram-based features [6]: Since the query keywords are indicative of the category it belongs to, the count of keywords can serve as features; (2) Embedding-based features [9,29]: Due to the advances in deep learning and natural language processing, some researchers use word embeddings (e.g., Glove) and contextualized embeddings (e.g., BERT) to represent the query for classification.Later, researchers designed advanced classification models and learning diagrams utilizing additional information from queries to enhance classifiers.For instance, Liu et al. [23] proposed a mixture of conventional neural network and Naive Bayes as a classifier [23] while Wang et al. [36] incorporated the hierarchy of label information by a graph encoder into the text encoder [36].Besides, the context-aware session information [5] and searcher engagement data [14] are explored as well.Different from these existing efforts relying on extra information or overlooking abundant unlabeled data, we aim to boost performance using only easily accessible query and label data combined with unlabeled queries.

Imbalanced Learning
Class imbalance is a common issue in text classification, especially prominent in the hierarchical setting [7].To address it, one commonly-used solution is re-sampling, which involves oversampling the minority class, under-sampling the majority class, or combining both to achieve a balanced class distribution [24].Another strategy is cost-sensitive learning [33], where higher costs are assigned to the misclassification of minority classes during model training, eventually making the model more sensitive to the minority class.The Synthetic Minority Over-sampling Technique is another notable approach that generates synthetic instances of the minority class to balance the class distribution [7].For instance, Pereira et al. [30] utilized the path and depth information to oversample and undersample data points to improve the classifier.Different from the previous works, we utilize the unlabeled queries that are predicted as minority classes to augment datasets.

Contrastive Learning
Contrastive learning has emerged as a powerful paradigm in unsupervised and self-supervised learning techniques [15] by significantly reducing the performance gap between supervised and unsupervised learning.At its core, contrastive learning aims to learn similar representations for semantically similar instances and dissimilar representations for distinct ones.It accomplishes this by distinguishing the "positive" pairs (two similar data points) from the "negative" pairs (two dissimilar data points).Especially by leveraging large amounts of unlabeled data, it opens up new avenues for model training in scenarios where labeled data is scarce or expensive to obtain.The effectiveness of this approach has been showcased in numerous applications, such as image, speech recognition, and natural language processing [21].

Semi-supervised Text Classification
Semi-supervised learning is a promising research direction since it utilizes both the labeled and unlabeled data points in machine learning [34], which alleviates the high cost of data annotation.

PROBLEM DEFINITION
In this section, we provide the mathematical definition of the hierarchical query classification problem.Note that, for the sake of simplicity in presentation, we assume the problem space is a twolevel category hierarchy, but, the proposed method is extensible to accommodate a multi-level category hierarchy.
We have a set of queries  = { 1 ,

PROPOSED METHOD
In this section, we provide the details of the proposed semi-supervised hierarchical query classification framework to accurately classify a query for a given taxonomy.Specifically, the framework has three major components: i) It utilizes the hierarchical label information to enhance initial query embeddings.ii) It attends to fine-grained instance hierarchy by modeling intra-class and inter-class relationships.The resultant contrastive loss boosts the query embedding learning, which is finally combined with a classification loss to train the classifier.iii) Through the proposed neighborhood-aware sampling technique, it selectively chooses high-quality unlabeled data points with pseudo labels to augment existing labeled data for model re-training.An overview of our proposed method is presented in Figure 1.

Label Hierarchy
Given a query   , we pass it to BERT to get the textual embedding    , following existing works to generate feature vectors [9,26].To attend to the hierarchy, we first create a label graph  = ( , ) representing the taxonomic hierarchy, where  is the vertex set of labels and  is the edge set of connections between parent and child labels.Because the label text (e.g., "self-harm", "adult products") contains useful semantic information for the downstream classification, we follow similar approaches to transfer text into embedding vectors [13,28], passing the label text to BERT to get the textual embedding as the node feature vector.For the root node, we use the average of all label embedding vectors.Due to the advances in graph neural networks for graph representation [39], in practice, we pass the graph to a conventional two-layer graph convolutional network [19] to get the embedding as: Note that, not all label embeddings are equally relevant for a query.Motivated by the attention mechanism [35], we compute the attention score between a query and each label embedding.Formally, we have an attention matrix: where  is a matrix for feature dimension alignment between    and   during matrix multiplication.Then, we compute the attention-weighted label features, denoted as We finally concatenate it with query textual embedding    derived by BERT and form the final representation     = [   ,     ] for the downstream classification task.Particularly, for query   , we predict its label as: In addition, to leverage the hierarchical label information, we use the neighboring child label information to assist the classification by aligning the predicted child category to the neighboring child category.Such signal is incorporated into the model by a loss derived from between the current child category and the neighboring child category as: where   denotes the parent category of   .By combing the two loss together, we have the final classification loss as: where   is the label of the query and  adjusts the importance of between the two losses.In implementation, we empirically set  = 1.

Instance Hierarchy
Given query   , to learn its comprehensive representation in the context of its hierarchical structure, we use contrastive learning at two levels: (1) the intra-class hierarchy; and (2) the inter-class hierarchy.Particularly, for the intra-class level, we randomly sample two queries within the same child category as one positive pair (  ,   ),   ∈   ,   ∈   .For the query pairs that are in the same parent category but from different child categories, we treat them as negative pairs (  ,   ),   ∈   ,   ∉   ,   ∈   ,   ∈   .The contrastive objective is defined as: , where sim(, ) = denotes the cosine similarity between two vectors.Similarly, for the inter-class level, we have the predefined (  ,   ) query pair as the positive and the query pairs that are in different parent categories as negatives (  ,   ),   ∈   ,   ∉   .The corresponding contrastive loss is: .
Combining them together, the contrastive loss is given as: where  intra-class denotes the weight of the intra-class hierarchy in the contrastive loss.

Objective Function and Model Training
Finally, we combine the classification and contrastive loss together, which arrives at: where  contrastive indicates the weight of contrastive loss in the final loss computation.
When training the model, we minimize the loss for optimization through back-propagation using the Adam optimizer [18].

Neighborhood-aware Sampling
After training, we apply the classifier F to predict labels of the unlabeled data   and then use the classified high-confidence data points to retrain the classifier.Motivated by our aforementioned observation that queries with similar labels tend to share similar typographical representations, we develop a neighborhood-aware sampling algorithm containing the following steps: Given an unlabeled query    ∈   , after inference by F (   ), we have the predicted child category and parent category: Step I: We use the K-Nearest Neighbors (KNN) to find the labeled queries similar to the unlabeled queries in the feature space.Here, motivated by the aforementioned observation, the feature space can be the simple  -length string space and we compute Levenshtein Distance (Edit Distance) neighborhood search [4].More generally, we use the previously generated BERT embedding to represent the feature space due to its powerful semantic representation in a broader case.Formally, we have: where   is from the labeled query set   and its child category is   and parent category is   .In practice, we utilize the hierarchical navigable small world method for the indexing and search process because of its efficiency in high-dimensional data spaces, making it a suitable choice for large-scale and high-dimensional datasets [27].To measure the similarity between queries, we adopt the cosine similarity metric.By using these two off-the-shelf solutions, we aim to achieve an efficient and accurate KNN search process.
Step II: After getting the neighboring labeled queries, we need to compute their distribution for the downstream sampling.To achieve this goal, we leverage the child category information between labeled and unlabeled queries.The intuition is that if the unlabeled query shares the same child category of the labeled query, chances are high that this predicted child category is correct.Specifically, we compute KL divergence scores between the child category of the labeled query  We then compute the divergence score as:
Finally, we add the scores from the child category information together as the final distribution: Step III: Similarly, we compute the distribution based on the parent category information and get where pNeighbor After addition, we have Finally, we have the sampling distribution as: Step IV: We sample the unlabeled queries following: We then add the sampled data points: {(   , ĉ   , p   )} to our existing labeled queries: to retrain the classifier.After re-training, we run the neighborhoodaware sampling again to select new queries to augment the existing labeled datasets.We repeat these steps until the model converges.

EXPERIMENTAL EVALUATION
In this section, we examine the performance of the proposed framework by conducting extensive experiments.Specifically, we aim to answer the following research questions: • RQ1: How effective is our proposed method when compared to other methods?• RQ2: What is the contribution of each component in the proposed framework?• RQ3: How sensitive is the model performance when we change the parameters?

Datasets
We evaluate the proposed framework on both public and proprietary datasets.
5.1.1Public Datasets.We adopt two benchmark datasets widely used for hierarchical text classification, including Web-of-Science and ECV1-V2.The data statistics is shown in Table 1.
• Web-of-Science (WoS) [20]: WOS dataset contains keywords and abstracts of academic papers across several disciplines, e.g., economy and science.For one paper, it also has a hierarchical domainarea label, representing the hierarchical nature of the discipline This makes WOS suitable for the hierarchical query classification task where we treat keywords, area, and domain as query, child category, and parent category, respectively.• RCV1-V2 [22]: The RCV1-V2 dataset is a benchmark corpus for text categorization research where each document has metadata such as date and title, in addition to the content of the news story.
It comprises an archive of over 800,000 manually categorized newswire stories from Reuters Ltd.Its hierarchical categorization scheme includes four main topics (Corporate/Industrial, Economics, Government/Social, and Markets), which are further divided into subtopics, leading to over 100 leaf-level categories.Thus, it is an excellent dataset for hierarchical classification tasks.We use extracted nouns from titles, subtopic, and main topic as query, child category, and parent category, respectively.

Proprietary Dataset.
For proprietary dataset, we use Amazon search queries as our testbed for examination on real-world application settings.We sampled 9~10 million user queries to create the dataset.In the Amazon dataset, a substantial portion consists of unlabeled queries, making up 40% to 50% of the total.Labeled non-sensitive queries also represent a significant segment, comprising 45% to 58% of the data.Within the labeled sensitive categories, queries related to adult-oriented products form 3% to 6%, while adult content is less common, constituting only 0.3% to 0.5%.The dataset also includes a small fraction of queries that are potentially harmful, with those related to self-harm and harm to others present in 0.003% to 0.005% and 0.01% to 0.03%, respectively.The remaining sensitive queries count for 0.04% to 0.07%.

Evaluation Metrics.
Since it is a standard imbalanced data classification task, we follow the existing related works and measurement: the Micro and Macro F1 score [36,37].

Compared
Methods.We compare with the standard multiclass text classifiers using fine-tuned BERT.Besides, several state-ofthe-art (SOTA) hierarchical text classifiers using transfer learning and prompt learning are examined [2,16,25,36,40].Namely, we compare with (1) HPT [37], where prompt tuning on pre-trained language model is utilized to handle hierarchical classification from a multi-label masked language model perspective; (2) HGCLR [36], where new queries will be generated by the label hierarchy to enhance the query representation learning using the contrastive loss; and (3) HiTIN [40], where the label hierarchy is converted into an unweighted tree structure to enhance the query representation.
In implementation, to utilize the unlabeled queries for a fair comparison, we add the widely-used confidence-based sampling strategy in the self-training stage for each compared method.This ensures that all methods are compared in the same semi-supervised setting.For the train/val/testing split, we follow the existing split Table 2: Comparison of hierarchical query classification performance.The best algorithm in each row is colored in dark blue and the second best is light blue.Note that we present the baseline result on Amazon as "0" for relative comparison (Here, the baseline is BERT), and ± indicates that the corresponding method is above or below the baseline.

Dataset
Metric This demonstrates the efficacy of our proposed method, especially on the Amazon dataset.In detail, we find that our proposed method beats the baseline fine-tuned BERT with the largest margin, indicating the necessity of designing sophisticated approaches for performance gain (i.e., instance hierarchy, label hierarchy, and neighborhood-aware sampling).We also beat the advanced HPT and HGCLR solutions.The reason may be that we explicitly consider the instance hierarchy to model the relationship between queries while HPT and HGCLR focus more on the hierarchical label structure.Our neighborhood-based sampling technique also contributes by selecting high-quality data points for self-training.Even if we are weaker than HiTIN regarding Micro-F1 on Web of Science and RCV1-V2, we are better in Macro-F1, which is more crucial.This is because in the real-world application setting, like the sensitive query classification on Amazon, critical categories have fewer data points and we should treat each category equally rather than each data point equally during evaluation.This is achieved by the Macro-F1 score.

RQ2: Ablation studies.
To examine the contribution of each component in the proposed framework (i.e., label hierarchy, instance hierarchy, and neighborhood-based sampling in the selftraining stage), we first remove one component in the framework.Then, we retrain the model and measure the classification performance.
As shown in Table 3, our proposed method with all components is better than any revised method that is removing one component.This demonstrates the necessity of each component in the pipeline.Interestingly, we find removing the label hierarchy leads to the largest performance drop, possibly because the additional information from the label text shares certain similarities with the Table 3: Ablation studies of our proposed framework.Note that we present our method on the Amazon dataset as "0" for relative comparison, and ± indicates that the corresponding ablated method is above or below our method.As we can see, (1) for  inter-class , the best value is 0.9 and the higher value leads to better performance except 1, which indicates that the intra-class hierarchy contributes more than the inter-class hierarchy.The reason can be that intra-class helps the model learn the representation better for the downstream query classification;

Dataset
(2) for  contrastive , we find when we add the contrastive loss to the classification loss, it helps improve the classification performance.But, the weight should not be large.0.1 works best in the current setup.(3) for  child , we see that 0.3 is the best.It implies when utilizing both child and parent category information in the sampling stage, we should carefully choose and tune the weight.All these results indicate the contribution of each corresponding component.When using them together, we should take caution and exhaustively test different values to find the proper hyperparameter set for the specific application setting.

APPLICATION IN PRACTICE
We launched the proposed model in our sensitive query detection platform on Amazon, which is used in the search module.We compare the proposed method with our previous rule-based production model.We sample queries detected as positive by the model, and ask the human labeling team to measure precision before and after the launch.The results show that our method is better.

CONCLUSION AND LIMITATION
In this work, we propose a novel hierarchical query classification framework to effectively classify short queries into different groups.In essence, we first utilize label and instance hierarchy patterns to derive the classification and contrastive loss to train the model.We then design a neighborhood-aware sampling method to intelligently utilize unlabeled queries with pseudo labels to boost the model performance for self-training.Extensive results on both proprietary and public datasets demonstrate the effectiveness of our model.However, our work still suffers from a few limitations.First, since our method is a multi-stage framework rather than an automatic end-to-end solution.Manual configuration and monitoring of the model training are needed, especially, during the self-training stage to determine the number of high-quality data points for the model retraining.Second, in the use case of sensitive query classification, users can purposefully write queries to bypass or attack the classifier such that the classification performance drops [10,12].We plan to explore this in future work.Third, Large Language Models have gained popularity in recent years due to their promising results across multiple applications including text classification and text generation.Our method can beat ChatGPT during our preliminary examination of the Web of Science dataset.The potential reason is that our task is more challenging than the conventional text classification due to the complex label hierarchy structure.But, more efforts are required to accomplish a thorough comparison.Fourth, our method still requires a large number of annotated data points.In the real-world application, especially the sensitive query classification on Amazon, there are only few annotated data points for certain categories.Few-shot learning can be a possible direction for future research.

Figure 1 :
Figure 1: The overview of the proposed framework.
2 , ...,   }, where   is the -th query.A query is a sequence of words, represented as   =  1 ,  2 , ...,   , ..., where   is the -th word in a query.We can divide the query set  into two groups: unlabeled queries   = {  For one labeled query, we have its child category   and parent category   where   denotes the parent category from a set of parent categories  = { 1 ,  2 ,  3 , ...} and each parent category   consists of a set of child categories   = { 1 ,  2 ,  3 , ...}.In this case, we have   ∈   The goal is to leverage the information in both labeled and unlabeled queries to learn a function F (  ) →   ,   , where   ∈   ,   ∈   , and   ∈   .

Table 1 :
Data statistics of public datasets.
Effectiveness of the proposed framework.As we see the comparison results in Table2, our proposed method is the best in most cases except Micro-F1 on Web of Science and RCV1-V2 dataset.

Table 4 .Table 4 :
In this section, we examine the effect of hyperparameters on the model performance.Here, we focus on three major parameters  intra-class ,  contrastive , and  child .The value ranges from 0.01 to 1 and we report the performance in Effects of varying weights on Amazon dataset.Note that we present the result by our method (i.e.,  − = 0.1,   = 0.01, and  ℎ = 0.1) on the Amazon dataset as the baseline, denoted as "0" for relative comparison, and ± indicates that the corresponding method configured with certain parameter values is above or below the compared method.Δ Micro-F1 means the difference in the Micro-F1 score and Δ Macro-F1 means the difference in the Micro-F1 score.