A Transformer-Based Substitute Recommendation Model Incorporating Weakly Supervised Customer Behavior Data

The substitute-based recommendation is widely used in E-commerce to provide better alternatives to customers. However, existing research typically uses customer behavior signals like co-view and view-but-purchase-another to capture the substitute relationship. Despite its intuitive soundness, such an approach might ignore the functionality and characteristics of products. In this paper, we adapt substitute recommendations into language matching problem. It takes the product title description as model input to consider product functionality. We design a new transformation method to de-noise the signals derived from production data. In addition, we consider multilingual support from the engineering point of view. Our proposed end-to-end transformer-based model achieves both successes from offline and online experiments. The proposed model has been deployed in a large-scale E-commerce website for 11 marketplaces in 6 languages. Our proposed model is demonstrated to increase revenue by 19% based on an online A/B experiment.


I. INTRODUCTION
Substitute-based recommendations are widely adopted in Ecommerce by giving customers more options, especially when the reference product is out-of-stock, higher-priced, or lowerrated [1], [2].It improves the overall shopping experience and increases customer affinity and loyalty to the E-commerce service provider.
Substitute recommendation aims to provide alternative products given by a reference product, which can be regarded as <reference product, alternative product> pairs.When learning such pairs, existing research usually utilizes customer behavior signals to extract the substitute relationship [3].Two commonly adopted substitute definitions are co-view and viewbut-purchase-another [4].These behavior-based heuristics are called buyability signals in this paper since they are correlated with customer purchase behavior.
However, such an approach does not consider product functionality.As shown in Figure 1, although vitamin D and C are usually coupled together based on co-view, they are not substitutable in functionality or product characteristics.Such bad cases not only harm customer trust but also might incur legal regulation issues when claiming them substitute with each other.Therefore, we define substitute recommendations based on both buyability and functionality in this paper and then further consider product functionality in the proposed model.Ideally, such functionality relationships can be learned from human-annotated labels.But in practice, it is highly timeconsuming and expensive to acquire such information.Hence, we still use the signals from production (i.e., impressions, clicks, purchases, and revenue) to train our models with the awareness of their weak supervision nature.The corresponding technical challenges are described as follows: 1) Inaccurate supervision: Customer behavior might be confounded by factors other than functional substitutabilities, such as complementary products and customer preference.2) Selection bias: It occurs when data samples are not representative of the underlying data distribution [5].corpus.To address the above challenges, we first build a dedicated classification dataset to incorporate functionality in evaluation.We use negative sampling augmentation to address selection bias and improve model robustness.We adopt a regression setting with log transformation to de-noise the weakly-supervised traffic signals and consider product functionality and buyability together.We take the product description title as the model input, from which the model can learn more domain and contextual background.Thus, we convert our substitute recommendation problem into a natural language matching problem, in which the reference product is regarded as the "query" and the alternative product is the "answer" or "document".Another advantage of using a deep language model is multilingual support.Compared to training each model for every language (or marketplace), a single multilingual model largely reduces the development and maintenance efforts.Specifically, we adopt the XLMR [6], [7] into our use case.
To summarize, our contributions are as follows: 1) To our best knowledge, it is the first work to consider product functionality in the substitute recommendation.2) We employ the state-of-the-art transformer-based model to learn textual information from the product title and fine-tune it in our E-commerce specific domain.3) We design new transformation methods and loss function objectives to de-noise label signals from production data and further adopt the corresponding negative sampling strategy to improve robustness.4) Practically, we further consider multilingual support from an engineering point of view.The proposed model is deployed into production and demonstrates success in both online and offline experiments.

II. RELATED WORK
There are three mainstreams of literature in our work.a) Substitute recommendation: Substitute recommendation is widely adopted in E-commerce and has been studied in recommendation research.[4] released the Amazon product catalog dataset, which defines the substitute as (1) users viewed v also viewed v', or (2) users viewed v eventually bought v'.As mentioned in the introduction, although significant efforts [4], [8], [9] have been made in advancing the performance of fitting this behavior-based metric, little attention was paid to the functionality and characteristics of products.
b) Weakly supervised learning: Weakly supervised learning is to train models without perfect supervision [10].Data imperfection of implicit customer feedback in recommendation systems is a well-studied problem [11], [12].Specifically, we focus on selection bias [5] and noisy labels in this paper.Negative sampling is a common technique to mitigate the first issue, especially in the context of sampling from implicit negative feedback [13], [14].[15] solves the noisy customer label problem by setting the temperature for the softmax function to control the level of confidence.c) Multilingual language understanding: BERT and its variations [6], [7], [16] are state-of-the-art models in many natural language processing tasks.Specifically, in recommendation systems and information retrieval, there are two main diagrams of BERT models: representation-based and interactionbased models.Representation based approach follows the twotower architecture and is widely used in the retrieval stage because of its scalability [17].Interaction-based approach [18] leverages the interaction between pairs and typically requires more computation power.Orthogonal to model architecture selection, fine-tuning a multilingual model in an industry setting is less studied compared to a monolingual model.[7] argues that scaling to more languages by a single model causes dilution and consequently leads to relative underperformance on monolingual tasks.However, we found multilingual model can achieve better performance compared to the monolingual model as it exploits more supervision from other languages.

III. PROBLEM FORMULATION
The main input to our substitute model is a pair of reference product and alternative product titles.The model is tasked with predicting the customer feedback that is correlated with the functionality and buyability.Specifically, let u and v be the product titles from the reference product and the alternative product, y be the label extracted from the raw customer signal (the count of received impressions, clicks, purchases, etc.).Assuming D is the set of n pairs of reference and substitute products, Div is the loss function measuring the divergence between the model output and the ground truth label, f is the model that outputs substitute score and θ is model learnable parameters, we can define the learning objective as: In this paper, we study different selections of y, including click-through rate (CTR), conversion rate (CVR), purchase rate (PR, equal to CTR x CVR), and gross merchandise value (GMV).

IV. METHODOLOGY
In this section, we first discuss feature selection to mitigate the weak supervision issue.Then, we demonstrate how we find the best label as the proxy for buyability and substitutability and use negative sample augmentation to address the selection bias problem.Afterward, we present the model design for multilingual understanding with domain adaptation.

A. Feature selection
We use the title as the feature of the model in order to avoid overfitting the noise from customer preference.Low coverage features are dropped like size and color, and most of the information is covered in the title.When the mappings get ingested from various retrieval sources, they often come with confidence values measuring the quality of the mappings.We also drop these values and source information to avoid the cold start problem and dependency on the upstream model.Otherwise, the model will need to be re-calibrated every time once the upstream modules update.Note that we used the price information in our early iteration since they improved the offline metric.However, we found that the trained model filtered more high-priced products even if they were substitutable.It was because customer engagement is worse on average for the expensive product.For example, products cheaper than $20 have three times higher purchase rates than products with prices higher than $100, driven by higher conversion rates.As a short-term solution, we dropped these features and left them for future investigation.

B. Label engineering
The goal of label engineering is to find the best proxy labels correlated with the functionality and buyability.For functionality, purchases are stronger signals because customers need to pay, while clicks can occur on the non-substitutable product out of the customer's curiosity.Hence, we use purchase rate (PR), the ratio of the number of purchases and the number of impressions, as the training label.Besides, we found that CTR and CVR have a Pearson correlation lower than 0.20, suggesting using any one alone will lead to a suboptimal buyability ranking.Another commonly adopted approach is to view the problem as a binary classification.Following [19], we can define positive samples as the recommendations purchased by customers at least once, and negative samples as the ones that are not.
We also discover the long-tail distribution of the purchase rate as shown in Figure 2a.An extremely high purchase rate is likely to be noisy due to insufficient impressions or data collection errors.Hence, we log-transform the label and find the resulting distribution smoother as shown in Figure 2b and Figure 2c while maintaining the same relative order.

C. Negative sample augmentation
To mitigate the selection bias, we randomly sample pairs of products as negative training samples.Given the wide spectrum of our products, the chance that random mappings are relevant is negligible.In the regression setting, we need to assign numerical "purchase rate" for the random mappings.Since random mapping is expected to have lower quality than serving data, we assign a negative value for random negative  samples.We perform random sampling before the training instead of for each batch separately, which is computationally efficient and has similar performance as mentioned in [15], [20].In Figure 3, we visualize the score distribution with and without the negative sampling.A well-trained model is expected to separate random samples and positive samples easily.With the correct setting of negative values and ratio, the model becomes more robust to random mappings.Note that aside from random negative samples, we also have around 60% of training data with zero purchase rate as the hard negative samples.

D. Models
Our main proposed model is a transformer-based deep learning model.Recent years have witnessed significant progress in adopting deep learning for better model performance [21], [22].But given the popularity of the gradient boosting decision tree (GBDT) model being used in industry [23], [24], we build GBDT at first.GBDT requires light computation requirements and is easier to deploy in production compared to deep learning models.In this section, we give a high-level overview of those two models and how to adapt them to our use case.
1) GBDT model: To adapt GBDT in our use case, we first featurize the text into a fixed-length vector.We use the word embedding to embed each word into a low-dimensional vector, which has been proved to be effective in many areas [25], Objective ∆ AuPRC ∆NDCG@CTR ∆NDCG@CVR ∆NDCG@PR TABLE I: Model performance with different objectives.The baseline model is underlined and its score is marked by dash.[26].Specifically, we first remove the stop words using the NLTK library and encode the words with the FastText word embedding [26].Then, we sum over each word embedding to get fixed-size embedding.Our early experiments show that sum performs better than average in our case.Lastly, the embeddings from both products are concatenated and fed into the GBDT model for learning.
2) Deep learning model: One disadvantage of the GBDT model is that it completely disregards the word order in the sentence.Besides, since the word embedding is pretrained on the public corpus and cannot be fine tuned in an end-to-end manner [27], making it difficult to learn with our E-commerce domain-specific data [28].To solve this issue, we utilize the transformer-based neural network [6] and fine-tune it on our dataset for domain adaptation.We adopt interaction-based model in our case, which achieved higher accuracy than the representation-based model in our preliminary experiments.Specifically, we use XLMR [7] as the model backbone, which achieves state-of-the-art performance on cross-lingual benchmarks such as GLUE [29].

V. EXPERIMENT
In this section, we first describe the dataset and evaluation methods.Then we compare our proposed method with baselines and conduct an ablation study in the offline experiment.Lastly, we present the online impact conducted on a large-scale worldwide E-commerce website.

A. Training dataset
We use the historical aggregated traffic feedback data, which record the count of customer behavior for a specific mapping pair since inception, including impressions, clicks, purchases, and GMV (gross merchandise value).We only keep the recommendation with over 250 impressions to balance the signal quality and size of the dataset.Since we only use the aggregated count of customer behavior, no customer identification information is touched.We exclude the mappings whose query products occur in the validation data to avoid data leakage [30].There are 460k mappings in the training set.It consists of data from 11 countries: US (English), UK (English), DE (German), FR (French), JP (Japanese), CA (English), IT (Italian), ES (Spanish), IN (English), AU (English), and MX (Spanish).

B. Evaluation dataset and metrics
We prepare the two datasets to evaluate the two aspects of our substitute recommendation: functionality and buyability, which are described as follows: a) Functionality classification dataset: It contains 215k mappings from product managers' audits on traffic data, random negative samples, and good/bad mappings extracted from traffic data based on customer signal.It is a binary classification dataset where the mappings are classified as substitutable or non-substitutable.The ratio between positive (substitutable) samples and negative (non-substitutable) samples is kept to be 6:4, which is similar to the production distribution.The area under the precision-recall curve (AuPRC) is used as the functionality classification metric.
b) Buyability ranking dataset: It contains 222k mapping in the traffic dataset with more than 500 impressions.A higher impression threshold is used for a more reliable purchase rate estimation.Normalized discounted cumulative gain (NDCG) over the PR is used to evaluate the buyability ranking performance.We first calculate the NDCG for each query product based on the model score and ground truth purchase rate and then take the average over all query products to get the final metric.

C. Offline experiment
In this section, we validate the proposed design choices by checking the offline metrics.To reduce the search space, we sequentially search for the best setting of the individual components in our design and keep it in the following experiments.We conduct the offline evaluation on the US fold of the data for the first two experiments and all marketplace data for the last multilingual experiment.For data safety, the performance was reported as the delta over the baseline, which is marked by an underline and dash."NA" means the model cannot handle the corresponding data/language.All experiments are based on single runs on validation sets.It should be reliable since both validation sets have over 200k samples.
1) The choice of objective loss function: We compare two settings: classification and regression, and show their AuPRC and NDCG for different objectives in Table I.The baseline model is GBDT.
First, we observe that using PR as supervision achieves the best performance in AuPRC, NDCG@CVR, and NDCG@PR.Second, it shows that logistic regression (binary crossentropy) performs worse than the log-transformed PR model in ranking.The reason is that PR provides extra information for the model to identify the high-performing pairs while the classification setting will treat them the same.
2) Monolingual vs. Multilingual: In this section, we aim to build a single multilingual model that achieves the best performance across all marketplaces.For all transformerbased models, we use a batch size of 512 and AdamW optimizer [31].We experimented with different learning rates (lr=1e-5, 2e-5, 4e-5, 8e-5) and found that 4e-5 worked the best.We used the base version of each model provided by [32].To save space, only the US and top 2 non-English marketplaces (JP and DE) are reported as well as the performance of overall 11 marketplaces in Table II.
GBDT model remains a strong baseline in the US but can only support English.RoBERTa can achieve comparable performance in the US with a lower AuPRC but higher NDCG score.If only provided with the monolingual corpus, the multilingual model behaves very similar to its monolingual counterpart (XLMR (US) vs. RoBERTa), but it performs well in the non-English marketplace even without any supervision.For example, XLMR (US) achieves higher AuPRC for both JP and DE than the US with only English train data.It demon-  strates its ability of cross-lingual inference in the E-commerce domain.We can further improve US performance by 0.6% in AuPRC with data from other marketplaces (both English and non-English).It validates that the multilingual model can generate universal embedding for different languages and the model benefits from more training data.After these changes, we can support all marketplaces with a single model while raising the bar for our main marketplace.

D. Online customer impact
We have experimented three different model variations.We present the customer impact during the experiments in Figure III, in which each variation is compared against its predecessor and we only compare the marketplace which new model supports (US for GBDT model, and 11 marketplaces for Robust XLMR model).Note that "No model" means that we only use the upstream score to filter and rank the mappings, which is inefficient and requires repetitive audits.For the GBDT model, we concatenate the other numerical features in Table IV with the sentence embedding as input.The detail technical settings can be found in Table IV and Table V.
Naive GBDT model increased the revenue by 10% because of larger coverage, however, it suffered from the selection bias and hence decreased the purchase rate by 12.6%.Besides, Naive GBDT model is only evaluated on the traffic data, hence the model selection is also suboptimal.With negative sampling and replacing CTR with PR, Robust GBDT model significantly outperformed Naive GBDT with 19% incremental revenue and 24.1% purchase rate increase.Robust XLMR further drove 19% additional revenue with only 2.5% purchase rate decrease, even though it has a much smaller feature set.The improvement is mainly driven by the non-US marketplace without dedicated model support, where it achieves 22.3% higher PR and 0.6% higher OPS  The paper explores the goal of substitute recommendation as optimizing both buyability and functionality.The issues of inaccurate supervision, selection bias, and domain gap in the E-commerce corpus are identified and provided with the techniques to solve them.The proposed method is demonstrated to be effective in both offline and online experiments.In future work, we plan to disentangle the buyability and functionality.and we will build a human label functionality dataset and learn a multi-task model for buyability and functionality separately.

Fig. 1 :
Fig. 1: Popular substitute pair based on co-view substitute definition, but alternative product has different function with the reference product.

Fig. 2 :
Fig. 2: Purchase rate distribution histogram with different transformations for the mappings with over 2000 impressions.The density is normalized such that the total area of the histogram equals one.
(a) Predicted score distribution w.o.negative sampling.(b) Predicted score distribution w.negative sampling.

Fig. 3 :
Fig. 3: Model score distribution for random mappings and positive mappings in functionality classification set.The blue part is the random mapping and the orange part is the positive mapping.For illustration purposes, the score shown here is log-transformed.

TABLE II :
Monolingual vs Multilingual model.The baseline is marked with the underline and its score is marked by dash.

TABLE III :
Online model performance.The delta is measured against its predecessor.

TABLE IV :
Feature set used for each model iteration.

TABLE V :
Details about each model iteration settings.VI.CONCLUSION AND FUTURE WORK