Build Faster with Less: A Journey to Accelerate Sparse Model Building for Semantic Matching in Product Search

The semantic matching problem in product search seeks to retrieve all semantically relevant products given a user query. Recent studies have shown that extreme multi-label classification~(XMC) model enjoys both low inference latency and high recall in real-world scenarios. These XMC semantic matching models adopt TF-IDF vectorizers to extract query text features and use mainly sparse matrices for the model weights. However, limited availability of libraries for efficient parallel sparse modules may lead to tediously long model building time when the problem scales to hundreds of millions of labels. This incurs significant hardware cost and renders the semantic model stale even before it is deployed. In this paper, we investigate and accelerate the model building procedures in a tree-based XMC model. On a real-world semantic matching task with 100M labels, our enhancements achieve over 10 times acceleration (from 3.1 days to 6.7 hours) while reducing hardware cost by 25%.


INTRODUCTION
Matching in product search seeks to retrieve the most relevant products given an user issued query from the catalog containing billions of items.Lexical matching [29,30] is currently the most widely used retrieval architecture in industry.To efficiently retrieve relevant candidates from a large search space, lexical models, such as BM25 [29], use inverted index to get documents containing same tokens as the query.These methods rely purely on token matching, making them vulnerable to morphological variants or spelling errors.Therefore, many search systems have semantic matching models in addition to lexical matching.Semantic matching models utilize query and document semantic meaning and can learn from customers' historical behavior.Embedding-based neural models [23,36] learn query and document embeddings and use their inner product as the indicator of relevance.During inference, approximate nearest neighbor search (ANNs) [12,21] is performed to get top relevant documents for a given query.In practice, however, considering the inference latency and model building cost, shallow neural networks are often used, which sacrifices model performance.In addition to these architectures, a recent study [5] demonstrated the efficiency and effectiveness of extreme multi-label classification (XMC) models in semantic matching.To handle extremely large output space, tree-based XMCs [27,39,40] recursively partition the label space to construct hierarchical label trees, enabling fast inference with time complexity logarithm to the label size.At the industry scale, tree-based XMC methods have shown significantly higher recall and low online inference latency compared with the dense two-tower models [5].However, just as bi-encoder models need to re-index to adapt to the updated retrieval space, XMC methods, such as XR-Linear [39], need to be constantly refreshed to reflect latest user behavior and newly added products in the catalog.While bi-encoder training and inference have been well optimized, accelerating model building for XMC is under-studied.
In this work, we develop methods to accelerate large scale treebased XMC model building.We focus on three components: query and label vectorization, label tree construction and matching model learning.An overview of our accelerations is presented in figure 1.
In particular, we make the following contributions: • With our efficient implementation in multiple component in XMC model building, we are able to reduce the building time of 100-million label semantic matching problem from 3.1 days to 6.7 hours (10 times acceleration) with 25% reduction in hardware cost.• We demonstrate that our implementations are more efficient than the widely used off-the-shelf utilities, such as scipy, sklearn, pytorch and IntelMKL.To support a wider range of data mining applications, we also provide standalone easy-touse APIs and make all of our implementations open sourced in https://github.com/amzn/pecos.

RELATED WORKS
Bi-encoder models, or two-tower models, have been one of the most widely used architecture in semantic matching [22,36].These models leverage deep neural networks to encode query and products into a shared feature space and use maximum inner product search to retrieve relevant products for a given query.The bi-encoder models usually need to leverage strong encoders such as ResNet [13] or Transformers [7] to achieve good performance.These models are difficult to be deployed into large scale semantic matching systems because of significant cost to encode hundreds of millions of documents as well as large latency in query encoding.As an result, many real-world production systems, such as Youtube [9], Google Play [37] and e-commerce product search [23,31], adopt shallow multi-layer perceptron (MLP) as two-tower encoders.Extreme Multi-label Classification has received much attention in recent years.On one hand, recent studies have opened a gate to leverage powerful transformer models in XMC problems [8,16].Novel architectures are proposed to address the difficulty in finetuning transformers on very large label space.XR-Transformer [40] and CascadeXML [19] established the state of the art performance on public XMC benchmarks.In the industry community, partition based XMC models have successfully been adopted in applications such as dynamic search advertising [27,28].Recent studies also showed the efficiency of tree-based XMC method, such as XR-Linear [39], in handling semantic matching problem with search space up to 100 million in size.Tree-based semantic matching models [6] not only have shown significant performance gain over shallow bi-encoder methods, but also enjoys much lower inference latency.
In these models, sparse TF-IDF is used instead of DNNs to extract text features.This reduces the model size and improved inference throughput.However, while dense operations are highly optimized via BLAS [2] on CPUs and CUDA [24] on GPUs, existing implementations for most sparse operations are either missing or not efficient.For example, the general matrix matrix multiplication in most existing libraries only supports parallel computing while at least one of the matrices is dense.Intel MKL [35] supports parallel sparse-sparse matrix multiplication but it's not open-sourced.Another example is the hierarchical k-means with sparse features.As a half-century old algorithm, many previous works have studied various methods to accelerate the general K-means algorithm [1,26,34].However, few works optimizes K-means while the features are highly sparse.Also, most libraries that have implementations of hierarchical k-means either does not support sparse features, such as Faiss [17], or cannot scale to large scale data, such as Scikit Learn [4].
In practice the number of labels  can scale to 10 7 − 10 8 .To efficiently learn and infer on such large label space, the tree-based XMC methods perform label space partitioning to divide and conquer the task.In these methods, the label space is partitioned and represented by an Hierarchical Label Tree (HLT) where each label is represented by a leaf node.The matching models (matchers) are then learned to navigate through the tree to find the most relevant leaf nodes.There are many works on tree-based XMC methods [14,18,27,38,40], yet not many have verified efficacy on industry scaled problems.In the scope of this paper, we consider a representative method PECOS XR-Linear [39], which has shown in previous studies the efficacy on industry level semantic matching problems [5].The overall model building can be separated into 3 phases: query and label vectorization, HLT construction and matcher learning.We describe each of these components below.
Query and Label Vectorization: This step constructs the numerical representation for every input queries   and every label ℓ  ∈ Y.In practice, the TF-IDF is used as query vectorizer.XR-Linear adopts the positive instance feaature aggregation (PIFA) to construct label features.In particular, given instance feature matrix each label is represented by aggregating feature vectors from positive instances: Hierarchical Label Tree (HLT) Construction: At this phase, an HLT is constructed on the label space.The label indexing is usually done via hierarchical k-means clustering algorithms using the label features constructed in the previous step.Therefore, at level  of the  level HLT, the tree nodes consists of a new label space where   = .Matcher Learning: At this phase, we learn the models to help us navigate through the HLT.At each level  ∈ [ ] of the hierarchical label tree, an one-versus-all (OVA) classifier  ( ) (, ), namely the matcher, is trained to score the relevance of any cluster ℓ ( )  given input .In particular, the relevance score is com- ), where w ( ) ∈ R  ×  is the model weight.The total learnable model weights is denoted as In fig 1, we present the time breakdown of the 4 important components with or without our acceleration.In the next sections, we will discuss each component individually and present our solution to accelerate each of them.

FAST TF-IDF VECTORIZATION
As a commonly used text feature extractor, there are off-the-shelf modules available in several open sourced packages, e.x.Scikit Learn.However, these implementations cannot be directly used in production systems because of the efficiency issue.Firstly, industry level product search systems have hundreds of millions of queries and it could cost several hours to construct TF-IDF feature for training queries.Secondly, online product search often happens under order of milliseconds and it's unacceptable for query vectorization alone to take more than that.Therefore, we implemented efficient TF-IDF vectorizer to satisfy these requirements.
Given a training corpus, the TF-IDF building procedure requires going over the corpus twice: first time to build vocabulary and then count the term and document frequencies.
Our C++ implementation based implementation adopts documentwise parallel computing for both steps.The implementation also allows hybrid tokenization such as combining character tri-gram with word bi-gram, where each tokenizer and TF-IDF vectorizer is learned independently.For large datasets, loading the corpus into memory would introduce large time and memory overhead.To handle such cases, we enabled out of memory TF-IDF training, where corpus file is processed by paralleled streaming fashion.In section 8.1, we present the acceleration of our TFIDF implementation over popular open sourced implementations, where our method reduces vectorization time by over 40 times on large datasets.  .The sparsity condition ensures column-wise computation has much lower time complexity of the two.
Sparse Accumulator.The crucial part of columnwise SpGEMM is accumulating weighted columns of A onto sparse vector C :, , a.k.a. the Sparse Accumulator (SPA) [11].This is not trivial as the sparsity patterns of columns of A are irregular.To support constant time insertion and O() time gather, we implement abstract data type SPA that consists of value vector w ∈ R  , nonzero indicator Parallel SpGEMM.Extending algorithm 1 into parallel setting is not trivial.Naively doing so by columnwise split B matrix can lead to double memory consumption of output matrix C.This is because we need to copy each worker's local copy of submatrix C to the final aggregated results.To avoid this, we introduce a novel trick to estimate the upper bound of number of non-zeros in C and pre-allocate memory for C.This resolves the double memory issue and save us the time for result copying.
In Section 8.1 we compare our implementation with state-of-theart linear algebra libraries that implement the SpGEMM operation under both single threaded and multi-thread setting.Our proposed PECOS-SpGEMM achieves highest speedup.

LABEL SEMANTIC INDEXING
Upon constructing label representation Z ∈ R  × , a hierarchical label tree (HLT) of depth  is constructed such that semantically similar labels are placed closer than irrelevant labels.The HLT is constructed via top-down -array K-means clustering.At level  ∈ {1, 2, ..., − 1}, there are  ( ) =   clusters, and at bottom level  the number of leaf nodes is  ( ) = .The resulting HLT can be represented with a series of indexing matrices {C ( ) }   =1 , where C ( ) ∈ {0, 1}   ×  −1 is the adjacency matrix between two consecutive tree levels.At level , the time complexity of one iteration of hierarchical -means is given by where the first term is the center initialization complexity, the second term is the cost for cluster center update and the third term is the cost of sorting node distances.Sparse Accumulator for Cluster center update.One of the most time consuming parts in K-means algorithm is updating the cluster center, which involves accumulating the node embeddings within each individual cluster.For clustering with sparse features, this is tricky as the final sparsity pattern is not known until all cluster members have been accumulated.Most existing implementations avoid this by using a dense center even for sparse k-means.Although the accumulation can be done only on the non-zero dimensions of the node embeddings, the center initialization have complexity O() for each cluster and this overhead is not to be neglected when  and   is large.To address this issue, we leverage the sparse accumulator (SPA) described in section 5 to achieve fast cluster center initialization.
As illustrated in fig 2, using SPA centers result in significant acceleration when cluster centers are sparse but introduces overhead when the centers are close to dense, especially on top levels of the HLT.In practice we switch between SPA centers and dense centers using an estimation of the cluster center sparsity ratio by assuming no overlap amongst all nodes' features in all clusters: When  ( ) >  we use dense accumulator for center update, and use SPA otherwise.Here  and  are hyper-parameters that are often chosen as 5% and 10%.This will reduce the first term of (2) to O( p  ) where p is the average number of non-zeros in cluster centers.
Node sub-sampling Scheme.Researchers have studied accelerating large scale K-means clustering with bootstrapping on smaller sub-sampled data [3,32].In practice, we also observed that at top levels of the HLT, when there are millions of nodes in each of the cluster, clustering with an uniformly sampled subset of the labels results in similar clusters as using the full data.Therefore for large scale data, we adopt adaptive sampling scheme for different levels of the HLT clustering.In particular, we use a sampling rate + that linearly increases by layer for clustering, where  ∈ R + is the hyper-parameter to control the sampling rate.
With SPA centers and the node sub-sampling, at level  the time complexity for every k-means iteration is On a semantic matching task with 100-million labels, we are able to reduce the clustering time from 8.7 hrs to 0.6 hrs (14.5X acceleration) with the same hardware.More empirical comparisons are presented in section 8.1.

DISTRIBUTED MATCHER LEARNING
To retrieve relevant labels from an extremely large space, XR-Linear learns matcher models to navigate through the tree.At level  of the HLT, a scoring function is learned  ( ) (x, ℓ :, , x > ), ∀ ∈   , where w ( ) ∈ R  ×  is the model weight at level .The positive nodes at level  is defined as where Y ( ) = Y is the original label matrix.The query-label pairs considered in training  ( ) is given by matching matrix M ( ) = binarize(Y ( −1) C ( )⊤ ) and the objective function at level  can be written as: min where L is a point-wise loss such as hinge loss, squared hinge loss or BCE loss.
Model Separability.Because Y ( ) and M ( ) are determined for a given HLT, the objective (5) independently columnwise min This model separability feature can be used in single box parallel solving [10], where each column of w :,ℓ is optimized independently.We further exploit this feature and designed distributed solving scheme by leveraging the structure of the hierarchical label tree: Since all the non-zero entrees of M ( ) are within the same cluster, we can separate the hierarchical XMC problem into independent sub-probolems via tree structure splitting.
Meta and Sub-tree Training.The overall design of distributed XR-Linear follows the divide-and-conquer scheme: A pre-constructed HLT is separated into a meta tree (level 1 through t) and  =  t sub-trees.On each of the meta or sub-trees, an XR-Linear model can be trained independently.These  + 1 XR-Linear models are then assembled to reconstruct XR-Linear solution for the original XMC problem.Using the similar idea, the construction of HLT can also be separated into  +1 tasks and achieve distributed computing.
Load Balancing.Most XMC problems have "long tailed" label distribution [8], which leads to large variance in each sub-problem's training load.To address this issue, in practice, we choose the number of sub-problems  to be larger than the number of workers and perform load balancing to balance the load of each worker.In particular, the workload of an sub-tree is estimated by the total positive and negative training samples and we adopt the Longestprocessing-time-first (LPT) algorithm to greedily assign the heaviest job to the lightest worker until all jobs are assigned.

EMPIRICAL RESULTS
All experiments are conducted on x1.16xlarge instances.In section 8.1, we compare the component-wise time cost between our implementation and widely used open sourced implementations.In section 8.2, we present the end-to-end model building time before and after the acceleration of each component.Datasets.We conduct experiments on the largest two public XMC benchmark datasets (Wiki-500K and Amazon-3M) and industry level semantic matching datasets.Following the procedures in [6,15,20], we collect the semantic matching datasets from one of the largest e-commerce product search engine.In particular, we construct 12 months of search logs as the training set and use trailing 1 month' search log as the evaluation set.The resulting XMC training data (SM-large) are with around 100 million labels.We construct SM-small as an intermediate sized benchmark between Amazon-3M and SM-large via uniform sampling.Detailed data statistic are presented in table 1.

Component Wise Comparing
Text Vectorization.We conduct experiments comparing the training and prediction time of TF-IDF vectorizer on all datasets in   1 with results recorded in Table 2.We compare our implementation with baseline method from Scikit Learn [4] (Sklearn).For Wiki-500K and Amazon-3M, we use word tokenizer and set the max number of features same as [38].For SM-small and SMlarge, we follow the same setting as [6] and use combination of word unigram, word bigram and charactor trigram to build the TF-IDF feature with total of 4.2 million dimensions.On large dataset SM-small and SM-large, our TFIDF vectorizer is 30 times faster in training and over 90 times faster in predicting.SpGEMM.We compare the proposed SPA-SpGEMM with stateof-the-art linear algebra libraries that implement the SpGEMM operation, including SciPy [33], IntelMKL [35] and Pytorch [25].For all datasets we compute the matrix product between the feature matrix Y ⊤ ∈ R × and X ∈ R  × , where both matrices are presented in CSR format, except for Pytorch, where the matrices are converted to COO format before multiplication, as its CSR implementation is not able to scale to SM-small and SM-large.Given that we already included its backend MKL into our comparison, we omit Pytorch CSR from our results.Run time with 1 to 32 threads are presented in fig 3.Under both single-thread and multi-thread settings, our SPA-SpGEMM implementation outperforms other linear algebra libraries.Hierarchical Clustering.Most available implementations for hierarchical clustering either does not support sparse features, e.g., FAISS [17], or rely on the condensed distance matrix which is not able to scale, e.g., Scipy [33].Therefore, we compare our implementation of hierarchical k-means with the strong baseline method provided in Scikit Learn [4].Both Sklearn and our methods leverage multi-CPU parallelism.

End2end acceleration
To understand the contribution of each enhancement, in table 4, we list the detailed breakdown of each of the steps in model building by adding our implemented enhancement one at a time.The money cost for model building is also computed by the hourly rate of x1.16xlarge instance of $6.67.Distributed training can give larger acceleration but only financially desirable for large datasets like SM-small and SM-large: with 8 x1.16xlarge instances, we are able to achieve 10 times acceleration with 26.7% reduction in hardware cost.
In production, one needs to find a reasonable trade-off between time and hardware cost for the most efficient setting.

CONCLUSIONS
In this work, we present an end-to-end acceleration to the XMC model building under industry scale data.Our method achieves over 10 times acceleration while saving hardware cost by 25% on the industry level semantic matching tasks.Our enhanced modules, including TF-IDF vectorizer, SpGEMM, hierarchical k-means and distributed training setup/infrastructure are general enough for a larger spectral of data mining applications.We have made everything open sourced in https://github.com/amzn/pecos.

Figure 1 :
Figure 1: Comparison of model building time with and without our acceleration.From small benchmark data with 500K labels to the real world semantic matching task SM-large with more than 100 million labels, we can achieve 10× to 20× acceleration over the baselines. do

Figure 2 :
Figure 2: Layerwise clustering time along the HLT with dense and sparse cluster centers.As the number of clusters grow, cluster center becomes sparse and dense center initialization becomes the dominate cost.

Table 1 :
Data statistics. : the number of queries.: the number of labels.L: the average number of positive labels per instance.n: average number of instances per label.     : the sparse feature dimension of Φ     (•).

Table 2 :
TF-IDF vectorizer training and predict time reported in min.Green texts in parentheses highlight the acceleration of our method over the corresponding Sklearn baseline.

Table 3 :
Hierarchical Clustering time comparison in minutes.We use the green texts to highlight the acceleration over the Sklearn baselines and the blue texts to highlights acceleration over ours implementation without SPA and sampling.

Table 4 :
Detailed results with component wise breakdown.