G-STO: Sequential Main Shopping Intention Detection via Graph-Regularized Stochastic Transformer

Sequential recommendation requires understanding the dynamic patterns of users' behaviors, contexts, and preferences from their historical interactions. While most research emphasizes item-level user-item interactions, they often overlook underlying shopping intentions, such as preferences for ballpoint pens or miniatures. Identifying these latent intentions is vital for enhancing shopping experiences on platforms like Amazon. Despite its significance, the area of main shopping intention detection remains under-investigated in the academic literature. To fill this gap, we introduce a graph-regularized stochastic Transformer approach, G-STO. It considers intentions as product sets and user preferences as intention composites, both modeled as stochastic Gaussian embeddings in latent space. We also employ a global intention relational graph as prior knowledge for regularization, ensuring related intentions are distributionally close. These regularized embeddings are then input into Transformer-based models to capture sequential intention transitions. On testing our model with three real-world datasets, it outperformed the baselines by 18.08% in Hit@1, 7.01% in Hit@10, and 6.11% in NDCG@10.


INTRODUCTION
Sequential recommendation, which aims at understanding evolving customer behaviors and dynamically recommending new items, has garnered considerable interests.E-commerce stores, e.g., Amazon, have adopted customers' shopping intentions signals into their recommendation systems [11].Although such initial works were built with a heuristic shopping intention detection approach, they already achieved significant improvement in terms of the MOI metric (metric of interest), which considers both the short-and long-term effects on the customers' shopping experience.Besides product-level recommendations, whole-page optimization, another downstream application, can also benefit from such customer intention signals.Recent work [17] has also achieved significant lift of MOI on retail website homepage and checkout page through online A/B experimentation.Thus, a robust, scalable, and explainable shopping intention detection approach plays crucial roles in multiple stages of the recommendation pipeline.
However, the majority of existing sequential recommendation algorithms [5,18,36,37] focus solely on product-level data to predict subsequent recommendation items, regardless of the underlying shopping intentions.This leaves these approaches inadequate in capturing whole-sequence patterns, as they tend to place more emphasis on the item transitions learned from training data than on comprehending the customers' underlying objectives.For example, when a user's interaction sequence is provided as in Figure 1, most sequential recommendation systems will emphasize the commonlyseen transition pair, (baby formula→baby formula), to recommend subsequent products without explanation.In contrast, if we can identify the main shopping intention, we will find out that PC Accessories should be the most important intention for this customer and recommend products accordingly.
To this end, some approaches [2,23] start to take both productlevel and intention-level data into account, but merely use the concealed main shopping intentions as implicit guidance for the following item recommendation.However, these implicit guidances can be severely influenced by the product-level interactions, making popular item transitions dominate the main intention detection.Thus, explicitly identifying the customers' main shopping intentions to infer the user preferences becomes a prerequisite for explainable recommendations and user understanding.
To accomplish the main shopping intention identification task, the most typical and direct method is to map product-level interactions to intention-level sequences and then apply sequential recommendation algorithms.Among the existing methods, recent advancements in Transformer [18,24,36] introduce the self-attention mechanism to reveal the position-wise item-item relations, which have achieved state-of-the-art performances.Despite their success in product-level sequential recommendations, we argue that simply adapting such embedding-based Transformer architectures to intention-level sequences fail to incorporate: (1) the shopping intention characteristics: Shopping intentions are higher-level taxonomy, which can be considered as sets of products.Using only deterministic embeddings to represent shopping intentions is insufficient to capture this high-level characteristic; (2) the user preferences composed of multiple intentions: Users can have multiple intentions and preferences in mind during a shopping journey.Using the Transformer architecture to characterize user preferences as deterministic points is also insufficient for estimating the relevance between user preferences and a composition of multiple shopping intentions; (3) the collaborative transitivity: Collaborative transitivity indicates the ability of introducing additional collaborative relevance beyond constrained intention transition pairs.Transformer architectures employ dot-product-based attention mechanism, which is difficult to infer the relevance across pairs, (e.g., using a and b, b and c pairs to infer a and c are relevant as well); (4) dynamic uncertainty: In customer interaction sequences, it is common to witness a significant random shift in preferences without obvious correlations across intention transitions.Customers with more interests in dynamic variability are intuitively more uncertain.Therefore, when modeling user preferences, dynamic uncertainty is a vital component.
To this end, we present a new graph-regularized stochastic Transformer framework, G-STO, for main shopping intention identification.Our approach overcomes the aforementioned challenges with the following key designs: (1) To better incorporate collaborative transitivity and uncertainty into representations, we describe each intention as an elliptical Gaussian distribution.Specifically, G-STO applies stochastic embedding layers to assign each intention a mean and covariance embedding, composing the stochastic representations; (2) To transfer knowledge from popular intentions to unpopular ones for cold-start issue resolution, we introduce global intention relation information as prior knowledge for improved intention modeling.Specifically, we design an intention relation graph for regularization, with diverse intentions as nodes and complementary/relevant relations between intentions as edges.By propagating intention representations on the graph, relevant shopping intentions are dragged towards each other on latent representation space to share close embedding distributions.It can also alleviate the data scarcity issue for unpopular intentions, whose embeddings can be inferred from their neighbors on the graph.
(3) Once we obtain the regularized stochastic representations for the users' interactions with intentions, we send them to mean and covariance Transformers to model the sequential information from intention transitions.Instead of using dot product to compute relevance score between user preferences and recommendations in deterministic models, we apply Wasserstein distance to measure the distances between distributions.Considering the distances as dissimilarity between intentions with uncertainty information, we combine it with Bayesian Personalize Ranking (BPR) loss [30] as the training objective.
Our contributions can be summarized as follows: • To the best of our knowledge, this is the first work focusing on the main shopping intention identification task with solely intentionlevel data.This will help with user understanding and improve the performances of downstream tasks, including product-level recommendation, ranking stage of whole page optimization; • We describe the intentions as Gaussian distributions using stochastic representations to reflect the high-level properties of intentions, collaborative transitivity across intentions, and user uncertainty; • We introduce the shopping intention relation graph as prior knowledge and propose a novel graph regularizer to restrict the stochastic representations in distribution-based methods; • We develop three different Amazon real-world datasets, covering long-term, short-term, and purchase-related user cases.Our proposed G-STO outperforms state-of-the-art baselines significantly by 18.08% in Hit@1, 6.11% in NDCG@10, and 7.01% in Hit@10 on average of three datasets.

RELATED WORKS 2.1 Sequential Recommendation (SR)
Sequential Recommendation (SR) aims to predict the next item based on the user historical interactions with products.Earlier SR works, like FPMC [31] and Fossil [13], apply Markov Chains with matrix factorization to model the first-order and higher-order item-to-item transition matrices.However, when encountering with long interaction sequences or considering long-term influence from previous items, the computations of modeling the transition matrices increase exponentially.More recent sequential recommendations can better learn item-to-item transition patterns automatically via deep learning architectures, including Convolution Neural Networks (CNN) [37], Recurrent Neural Networks (RNN) [7,21,25,27,29] and self-attentive models [18,36,41].Among them, the Transformer-based models reach the state-of-theart performances with the capability of extracting context information from all past actions and learn short-term dynamics.However, they still struggle to solve the cold-start issue [44], which means for unpopular items, the representations are under-trained and it is difficult for models to make precise predictions over them without sufficient data.In addition, despite the fact that these models have been successful in sequential recommendation systems and can be directly transferred to the main shopping intention identification task, they are incapable of capturing the unique characteristics of intentions, as intentions are higher-level concepts than items.

Intention Identification in SR
With the development of recommendation systems, people start to seek for other side information to help improve the recommendation qualities.Shopping intentions usually serve as a coarse-grained side information that can better describe users' preferences.Most of the existing methods [23,24,35,38,45,48] focus on treating the intentions as an implicit guidance or an additional feature for downstream product-level recommendation.However, the intention guidance learned from the sequences may not be in line with expectation.As they are using the same model architecture on both intention and item sequences, the main shopping intention can be misled by some commonly seen yet totally irrelevant item pairs.To resolve this issue, some other methods leverage clustering algorithms [2] or graph neural networks [4,6,40] to better understand the intentions via linking them with items and users.However, these approaches lose the explainability of shopping intentions for better user understanding.Another line of works identify the shopping intentions from search queries input by the users [10,12,22,43], which is orthogonal to our work.

Stochastic Representations in SR
Representing concepts (e.g., natural language sequences, images, graphs) as distributions has attracted interests from the research community [1,14,20,28,33,34].Most stochastic models represent the concepts with Gaussian distributions, composed of mean and covariance.The distribution representations introduce uncertainties and provide more flexibility compared with deterministic embeddings.In the recommendation systems area, a few studies propose to leverage the advantages of Gaussian distributions to flexibly represent users and items.For example, GeRec [16] models each user and item as a Gaussian distribution and applies CNN on the sampled matrix from their distributions for inference.To dynamically monitor the sequential changes in user interactions, a series of work [8,9,47] also combine Gaussian distributions with sequential recommendation algorithms.DT4SR [8] is one of these attempts, proposing the mean and covariance embeddings to model items as distributions.STOSA [9] extends DT4SR architecture via proposing a new stochastic attention mechanism based on Wasserstein distance.However, all the previous stochastic methods, learning the distributions only from the transitions within the sequence, can still fail in cold-start situations in § 2.1, which can be mitigated by G-STO.In addition, none of the aforementioned methods naturally combine stochastic representations with intentions to learn the intention distributions, which is accomplished by G-STO as well.Another methodology line is variational autoencoder (VAE), approximating posterior distributions of latent variables via variational inference.Combining SR and VAE, SVAE [32] and VSAN [46] learn the dynamic hidden representations of shopping sequences.However, these efforts still perform worse than existing deterministic models on many tasks [18,36].As an update, ACVAE [42] incorporates adversarial variational Bayes and maximization of mutual information between user embeddings and input sequences to obtain more distinctive and personalized representations of individual users.However, all these VAE-based SR methods are easy to suffer from posterior collapse problems, generating poor-quality latent representations, which G-STO can avoid.

METHOD
In this section, we introduce our proposed graph regularized stochastic model, G-STO, for main shopping intention identification ( § 3.1).Figure 2 displays the working flow of G-STO.It consists of several key components: (1) stochastic representations ( § 3.2), which model the shopping intentions as stochastic distributions, consisting of mean and covariance embeddings; (2) intention relation regularizer ( § 3.3), which creates an intention relation graph and regularize more relevant intentions on the graph to have closer stochastic representations; (3) mean and covariance Transformers ( § 3.4), encoding sequential information from the transition patterns in user historical interactions to generate stochastic representations of user preferences; (4) Wasserstein distance ( § 3.5), which measures the dissimilarity between intentions and user preferences and can be combined with Bayesian Personalized Ranking (BPR) [30] loss on the positive and negative sequences for model training.

Problem Definition
A sequential recommendation system collects the interactions between a set of users U and items V, (e.g., clicks, purchases, etc.) and sorts them chronologically into sequences.Similarly, to identify users' main shopping intentions, models need to map all the items V to their belonging shopping intentions M and reorganize them chronically as The goal of the main shopping intention identification can be formulated as: which measures the probability of an intention  being the main shopping intention   given user 's sequence.

Stochastic Embedding Layers
Different from the deterministic embedding layers that only map the items/intentions to unique high-dimensional vectors, stochastic embedding layers formulate the intentions as high-dimensional elliptical Gaussian distributions.These Gaussian distributions are constructed of mean and covariance embeddings, spanning a broader space to include more high-dimensional points, which naturally captures high-level semantics of shopping intentions.Specifically, we define a mean embedding table ...
...  embeddings and position embeddings: For example, the first intention  1 in the sequence S  is formulated as a Gaussian distribution N (  1 , Σ  1 ), where is the ELU activation function and 1 ∈ R  is an all-ones vector. 1

Intention Relation Graph Regularizer
To effectively model infrequent and under-trained shopping intentions, our goal is to utilize the most related intentions to help the model comprehend them.To this end, we propose a novel graphbased regularizer, allowing more pertinent shopping intentions to share closer stochastic representations.Thus, we introduce the global intention relationship as the prior knowledge and create a graph accordingly, which can be introduced as follows: Intention Relation Graph.We create the intention relation graph with the aid of P-Companion [11].Given a pair of shopping intentions (  ,   ) as inputs, we treat the co-purchase relations between them as distant supervised labels  , = {+1, −1}.To extract relevant information from co-purchase relations, each shopping intention and its complementary side are represented by two separate embeddings,   ,    .Thus, to infer the relation between shopping intentions (  ,   ), we first apply a 2-layer feed-forward network (FFN) to transform   : where where  is the base distance to distinguish    and     , and  is the margin distance.When  , = 1, the two shopping intentions have the co-purchase/complementary relation and the model will force the distance between    and     to be smaller than  − .Otherwise, when  , = −1, the two shopping intentions do not possess the complementary relation, pushing the    and     far away from each other with distance more than  + .With the trained    and     , we can create the shopping intention relation graph G = {V, E}, where the nodes V are different shopping intentions, and the edges E indicate the relevant/complementary scores between the intentions.To compute edge weights between intentions   and   , we apply cosine similarities between   's complementary embedding,     , and   's embedding,    : Graph Neural Regularizer.Considering the previously constructed relational graph of intentions as prior knowledge, we aim to regularize the more relevant intentions on the graph so that they share closer distributions.When determining the main shopping intentions of a user, we can rank all relevant intentions higher, including those even unpopular ones, if they are represented similarly.Thus, we employ the graph convolutional network (GCN) to induce stochastic representations of nodes based on the neighboring features, transferring knowledge to under-trained nodes from their frequently-seen neighbors.To learn a unified set of parameters regularizing both the mean embeddings and the covariance embeddings T Σ ∈ R | M | × simultaneously, we concatenate them together as the initial node representations: Then, the GCN propagation can be represented as follows: where W ( −1) ∈ R 2 ×2 is the trainable weight matrix after the -th layer, and , where D is the degree matrix 2 of A. Once obtain the regularized intention representations X ( ) ∈ R | M | ×2 at the -th layer, we will separate the new mean embeddings T ∈ R | M | × and the new covariance embeddings TΣ ∈ R | M | × from them to form the regularized Gaussian distributions with intention relations: Thus, we can finally rewrite the embedding layer in Eq.( 2) to encode the user shopping sequences with regularized distributions:

Mean and Covariance Transformers
Apart from the prior knowledge we obtain from the global intention relation information, we still need to encode the sequential information from the user historical interaction sequences.Therefore, we propose mean and covariance Transformers to automatically learn the hidden patterns from the intention transitions in the sequences.The deterministic Transformer-based models build up the self-attention mechanisms with the dot products between query Q, key K, and value V.In sequential recommendation, the query, key, and value are obtained from the linear transformations of the same sequence embedding ÊS  .However, in distribution-based stochastic models, we use mean and covariance embeddings to form a Gaussian distribution as intention and sequence representations.Thus, we need two separate sets of Q, K, and V for both mean and covariance embeddings of the sequence: In addition to uncovering sequential patterns from linear transformations, we leverage the feed-forward network (FFN) to endow the model with non-linearity.The FFN with respect to both MSA and CSA at position  are defined as: where all the W * * ∈ R  × and b * * ∈ R  are trainable parameters in feed-forward networks.

Training and Evaluation
Wasserstein Distance.For embedding-based models, when measuring how accurately the model detects the main intention, we need to apply dot products between the user preference embedding and the intention embeddings.Similarly, for stochastic models, we need to identify the distances between distributions of the groundtruth labels and the inferred distributions.Many existing works formulating concepts as distributions use Kullback-Leibler (KL) divergence to compute the distribution distances.However, when two intentions are excessively unrelated, their stochastic representations will be expressed as two almost non-overlapping distributions, and the KL divergence will describe the distance as nearly infinity, resulting in value instability.Thus, we use Wasserstein distance to measure the distance between Gaussian distributions.Given two Gaussian distributions   = N (  , Σ  ) and   = N (  , Σ  ), the Wasserstein distance can be computed as: ). (13) For time efficiency, the second term in the above equation can be simplified as a calculation of Euclidean norm: where ∥ • ∥ 2  is Frobenius norm that can be calculated by matrix multiplications.Training Objective.Many deterministic sequential recommendation models apply Bayesian Personalize Ranking (BPR) loss [30] on the dot-product scores, making the ground-truth label nearest to the customer preference.In G-STO, we apply BPR loss on the previous defined Wasserstein distances between distributions to measure the correctness of the main intention identification: where p dentoes the inferred distribution of user preference at position ,  +  is the ground-truth shopping intention and  −  denotes the negative samples from the intentions that the user never interact with.During the inference stage, for user , we calculate the distances between the customer preference distribution and the candidates set containing the ground-truth intention and 100 negative sampled intentions.

EXPERIMENTS
In this section, we evaluate the empirical effectiveness G-STO by studying the following research questions (RQs): •RQ1: Does G-STO provide better shopping intention identification results than baselines?•RQ2: Why do we need to design different kinds of scenarios?How does G-STO perform on different circumstances?•RQ3: What is the influence of the intention relation graph regularizer and stochastic representations?•RQ4: How sensitivity is G-STO to the hyper-parameters?•RQ5: Why can G-STO alleviate intention cold start issue?

Data Curation
We create three benchmarks for main shopping intention identification using anonymized data from amazon.com.To provide a comprehensive evaluation, the ground-truth main shopping intention labels of the three datasets are created based on different real-life scenarios: (1) original sequences, using the raw sequences composed of user historical interactions to model the long-term shopping scenario; (2) 24-hours sequences, leveraging more frequent user-intention interactions sampled from raw data to model the short-term shopping scenario; (3) purchase sequences, considering purchase actions of the raw sequences as strong signal of the main shopping intentions to model the purchase-related shopping scenario.Table 1 shows examples and the statistics of three datasets, original, 24-hours, purchase sequences.Original sequences: We follow the "leave-one-out" strategy [18] to split the sequences into training, validation, and test datasets.To create the intention labels for each user and each split, we partition the curated historical sequence S  for each user  into three parts: 24-hours sequences: Although the original sequences can more accurately describe the long-term preferences of customers and be more useful for the downstream task of next item recommendation, the users' final interactions cannot always convey the main shopping intention labels due to the random shift of interests.Assuming that, for more dense and frequent interactions in a short period, the last intentions from "leave-one-out" mechanism can better reflect the main shopping intentions, we assess the time intervals between the successive activities to breakdown the raw sequences.If there is a temporal gap longer than a pre-determined threshold (e.g., 24 hours), we will insert a break-point and divide the entire sequence into two sub-sequences.Purchase sequences: Aside from using the last intentions as the labels, the purchase actions can also serve as a very strong positive signal for main shopping intention identification.For each customer's historical data, the purchase actions can be viewed as the main shopping intentions for its preceding sub-sequences.As purchasing activities are dispersed over the sequences, we can only apply user-based split for train/val/test separation in this scenario.

Evaluation Protocol.
We evaluate all models with the following metrics: (1) NDCG@10: A position-aware metric which assigns larger weights for higher positions; (2) Hit@K, (K=1,2,5,10): Metrics counting the fraction of times that the ground-truth intention is among top  predictions.We report the averaged metrics over all users.We select the model to report the test set performances based on the best validation NDCG@10 score.

Baselines.
We compare our proposed model with baselines from three different groups: (1) static recommendation methods: Count-based Bayesian (CB) is a non-learning approach that solely considers the appearance frequency of shopping intentions.We consider the intention frequencies across the entire market as prior, and the intention frequencies in each user's shopping sequence as likelihood.The final shopping intention rankings are derived using posterior, which is the multiplication of prior and likelihood; LightGCN [15] is a state-of-the-art graph-based static recommendation method, which considers high-order collaborative signals in user-item graph; (2) deterministic sequential recommendation methods: SASRec [18] is a self-attention based sequential recommendation system model that captures long-term semantics and short-term dynamics; (3) stochastic sequential recommendation methods: DT4SR [8] is a distribution-based method, mapping the intentions to elliptical Gaussian distributions and then send them into two separate Transformer-based model to infer the users' preferences; STOSA [9] is a state-of-the-art distribution-based recommendation system model.STOSA extends DT4SR by proposing a new stochastic self-attention mechanism to further improve the combination of Transformer and stochastic representations; (4) VAE-based sequential recommendation methods: SVAE [32] is a recurrent version of VAE, combining recurrent neural network (RNN) and VAE.The model outputs the probability distribution of the most likely future preferences at each time step; ACVAE [42] is a state-of-the-art VAE-based model, first introducing the adversarial training for sequence generation, enabling the model to generate high-quality latent variables.

Implementation Details.
For the hyper-parameters, the learning rate is set as 1 − 4, the maximum number of epochs is 500, the batch size is 128.For the stochastic representations, the hidden dimensionalities of mean and covariance embeddings are set as 64.For the intention relation graph regularizer, we apply 1-layer graph convolution network (GCN).We train and test our code on the system Ubuntu 18.04.4LTS with CPU: Intel(R) Xeon(R) Silver 4214 CPU@ 2.20GHz and GPU: NVIDIA V100.We implement our method using Python 3.8 and PyTorch 1.6 [26].For the hyper-parameters, the learning rate is set as 1 − 4, the maximum number of epochs is 500, the batch size is 128.For the stochastic representations, the hidden dimensionalities of mean and covariance embeddings are set as 64.
For the intention relation graph regularizer, we apply 1-layer graph convolution network (GCN).During training, we use the Adam [19] optimizer with  1 0.9 and  2 = 0.999 our experiments for all the models.We select the best set of hyper-parameters of the models based on the NDCG@10 on the corresponding validation sets.

Performance Comparison
Cross-Method Comparison (RQ1).Table 2 reports the performances of G-STO and the baselines on all three benchmarks.The results demonstrate that G-STO consistently outperforms all baselines in terms of all metrics on all the three datasets.Compared with the strongest baseline, STOSA, G-STO shows significant improvements of 18.08% of Hit@1, 6.11% of NDCG@10, 16.87% of Hit@2, 11.87% of Hit@5, and 7.01% of Hit@10 on average of three datasets.From the results, we have the following observations: (1) The performance gaps between our method and static methods, CB and LightGCN, show the importance of temporal order sequential information.Different from product-level recommendations, the intentions may appear multiple times in single sequence, making counts-based method, CB, achieve comparable results to state-of-the-art static method of LightGCN, and even SASRec.
(2) Comparing our method with the backbone model, SASRec, G-STO shows significant improvement of 43.44% in Hit@1 and 15.50% in NDCG@10.This indicates that the stochastic representations expand the latent space for user-intention interactions and equip the model with collaborative transitivity.Besides, the graph regularizer captures global intention information, enabling model to better understand the intentions.Both modules help enhance the intention identification capabilities, particularly in small-data situations; (3) The comparison between G-STO and other distribution-based recommendation methods, DT4SR and STOSA, shows the efficacy of leveraging the intention relation graph as prior knowledge to regularize the stochastic representations and reveals the potential to further incorporate G-STO with distribution-based attention; (4) Comparing G-STO with VAE-based methods, we find that VAEbased methods are easy to suffer posterior collapse problems: if the decoder is too expressive, the KL divergence term in the loss will converge to 0, generating similar latent representations for all inputs.This situation becomes worse for less-interacted undertrained intentions.On the contrary, our model leverages graph regularization to transfer knowledge from more-interacted intentions to less-interacted ones to mitigate this issue.
Hit@1 NDCG@10 Hit@2 Hit@5 Hit@10 0.0   2, we can also compare the performances horizontally to gain insights into the models' performances in different scenarios.We notice that the performance improvements made by G-STO, compared with baselines, varies by different scenarios.The performance gap is larger on 24-hours and purchase sequences than on original sequences.We summarize the reasons as follows: (1) On 24-hour sequences, G-STO and all the baselines achieve higher absolute performances than on the other two categories.As the 24-hour sequences consist of more dense and frequent activities, where users' severe random interest shifts are less likely to appear, and the hidden sequential dynamic patterns are easier for models to learn and capture.(2) On purchase sequences, G-STO performs slightly better, while most baselines perform poorly.This is because of the difference of data split for train/validation/test set: we apply user-based split on purchase sequences, and "leave-one-out" on the other two categories.Thus, it is easier for model to face new users/intentions during validation or testing, making the models more likely to suffer from the cold-start issue.With the graph-regularized stochastic Transformer, G-STO can better resolve this issue than the other baselines, enlarging the performance gap on this dataset.

A/B Test
We have an existing products recommendation feature using a static mapping created from the work [11].However, the procedure only provides static product to product recommendation, it doesn't take customer's shopping history and provide personalized shopping experience.Our goal was to improve the recommendation's relevance by taking customers' shopping intentions into consideration while optimizing sales and revenue.Our work, G-STO, has been recently deployed for online A/B testing as the treatment group, Table 2: Performance Comparison in Hit@1, NDCG@10, Hit@2, Hit@5, and Hit@10 on three different datasets.The best results are boldfaced.

Methods
Recall@1 NDCG@10 Recall@2 Recall@5 Recall@10 while the control group is the existing solution using static mapping.The model takes customer's past shopping activities as input, then predict the possible product categories for downstream recommendations.This online testing has been run for 2 weeks, and we have already found the achievements shown below: We observe a World Wide (WW) commercial success: • WW annual revenue was improved by 1%; • WW annual sales were improved by 2%.

Ablation Study (RQ3)
Ablation on Model Structure.Figure 3 presents the ablation study to verify the effects of different components in G-STO.We compare G-STO with model variants that remove one of the key components: (1) Ours w/o Graph Regularizer (GR): We remove the intention    covariance embedding results in a further decline in performance, as the model degenerates into a lower-dimensional deterministic sequential recommendation model.

Parameter Studies (RQ4)
Effect of Hidden Dimensionalities.We study the influence of hidden dimensionalities of mean and covariance embedding, , towards our methods.Among the searching range of [16,32,64,128,256,512],  = 64 works the best.If  is too small, the model cannot encode the shopping intentions well and cannot learn some high-dimensional relation between intentions.If  is too large, it will become a problem for model to learn the high-dimensional representations, also causing a performance drop.
Effect of GNN Types.Table 4 presents the results of using various GNNs in the intention relation graph regularizer.The results indicate that the graph convolution network (GCN) outperforms the graph attention network (GAT).This is because GAT requires additional edge weight learning for dense, noisy graphs.However, our intention relation graph is not dense, and the edge quality is high, rendering the training of additional attention weights unnecessary.We utilize the MAD metric [3] to evaluate the smoothness of node representations.The results show that as the number of GCN layers increases from one to two, the MAD drops significantly from 0.774 to 0.379, indicating that the 2-layer GCN produces smoother representations.On the other hand, the performance drop as shown in Table 4 suggests that the 2-layer GCN is over-smoothing.Effect of Number of GCN Layers.Besides, we also investigate the effect of the number of GCN layers on G-STO.With a single layer of graph convolution, GCN can only gather data from its immediate neighbors.Information from larger neighborhoods can only be incorporated when numerous GCN layers are applied.The results in Table 4 indicate that 1-layer GCNs perform better than multi-layer GCNs.The reason is that the directly related intentions on the graph are more crucial when identifying the main shopping intentions from consumer historical data, and introducing more distant neighbours on the graph can introduce more noise.

Intention Embeddings Comparison (RQ5)
To better explain the efficacy of the proposed intention relation graph regularizer, we compare the shopping intention representations trained by SASRec, STOSA, and G-STO via t-SNE visualization [39] in Figure 5.The red points in the figure represent the relevant shopping intentions to the intention "miniature", while the blue points indicate the irrelevant ones.From Figure 5(a), we observe that although SASRec is trained on user historical data to learn some correlations between intentions, the embeddings of relevant shopping intentions are still quite scattered across the latent space.From Figure 5(b), the red points start to cluster with each other, because the state-of-the-art distribution-based model, STOSA, can capture the collaborative transitivity between intentions, which are ignored by SASRec.In contrary, from Figure 5(c), it is obvious that the related intention embeddings further cluster with each other.This proves that the intention relation graph truly regularizes the stochastic representations, constricting relevant intentions embeds closer to improve G-STO performances.

CONCLUSION
We presented G-STO, a graph-regularized stochastic Transformerbased model for main shopping intention identification.G-STO first models the shopping intentions as Gaussian distributions and then creates an intention relation graph as prior knowledge to regularize these distributions.The regularized stochastic representations will be fed to the Transformer architecture for main shopping intention identification.We perform experiments under three different scenarios in real-life applications.Extensive experimental results demonstrate the superiority of G-STO over the state-of-the-art baselines.In the future, we will change the Transformer architecture to accommodate distribution-based models more effectively.

Figure 1 :
Figure 1: Illustration of sequential item recommendation and main shopping intention identification.For the majority of sequential recommendation algorithms, recommendations are provided without specific reasons, which can be explained by the main shopping intention identification.

Figure 2 :
Figure 2: Illustration of G-STO model, containing key components: (A) stochastic representations to map each intention as a Gaussian distribution; (B) graph regularizer to restrict the more relevant intentions to have closer representations; (C) mean and covariance transformers to encode the sequential information from user historical interactions; (D) Wasserstein distance for training and inference.
) where W * * ∈ R  × represent the learnable weight matrices in the linear transformation.Combining the computed query Q, key K, and value V with scaled dot-product attention.we can use the mean self-attention (MSA) and covariance self attention (CSA) to obtain newly generated sequence stochastic representations {z  S  , z Σ S  }:

( 1 )
the most recent intention action S  | S  | as the intention label for test set; (2) the second most recent action S  | S  | −1 as the intention label for validation set; and (3) all remaining actions as training data.Note that during testing, the input sequences contain training actions and validation action.

Figure 3 :
Figure 3: Ablation study on model components.The performances are averaged on three datasets.Cross-Dataset Comparison (RQ2).From Table2, we can also compare the performances horizontally to gain insights into the models' performances in different scenarios.We notice that the performance improvements made by G-STO, compared with baselines, varies by different scenarios.The performance gap is larger on 24-hours and purchase sequences than on original sequences.We summarize the reasons as follows: (1) On 24-hour sequences, G-STO and all the baselines achieve higher absolute performances than on the other two categories.As the 24-hour sequences consist of more dense and frequent activities, where users' severe random interest shifts are less likely to appear, and the hidden sequential dynamic patterns are easier for models to learn and capture.(2) On purchase sequences, G-STO performs slightly better, while most baselines perform poorly.This is because of the difference of data split for train/validation/test set: we apply user-based split on purchase sequences, and "leave-one-out" on the other two categories.Thus, it is easier for model to face new users/intentions during validation or testing, making the models more likely to suffer from the cold-start issue.With the graph-regularized stochastic Transformer, G-STO can better resolve this issue than the other baselines, enlarging the performance gap on this dataset.

Figure 4 :
Figure 4: Parameter studies of our model about the hidden dimensionality of mean and covariance embeddings on original sequences.The red line indicates that the best performances are obtained when the hidden size is 64.
Intention embeddings trained by G-STO.

Figure 5 :
Figure 5: T-SNE visualization of mission embeddings trained by SASRec, STOSA, and G-STO.Blue points are irrelevant points to a certain intention.Red points are relevant shopping intentions.
b 1 , b 2 } are trainable parameters.Then, we optimize the relation between the transformed embedding of   ,    , and the complementary embedding of   ,     , via applying hinge loss function on labels  , :

Table 1 :
An example of generated sequences under different scenarios.Different numbers represent different shopping intentions.The blue intentions indicate the validation set labels and the red intentions indicate the test set labels.

Table 4 :
Study on different graph neural networks (GNN) in intention graph regularizer on original sequences.