Global-Local GraphFormer: Towards Better Understanding of User Intentions in Sequential Recommendation

Transformer-based model has gained great success in the multimedia sequential recommendation task due to its strong ability to handle sequential data. However, existing Transformer-based models regard the items in the sequential data as a user-specific fully-connected graph (local graph) and only explicitly consider the temporal information in the local graph to capture the users’ intentions, ignoring the fact that the user-item bipartite graph (global graph) may carry important relation patterns to the sequential items. Additionally, it is still unclear whether (and how) the information hidden in the global graphs can help the Transformer-based models better understand the users’ sequential behavior according to the current literature. To investigate this important problem, we propose to utilize the global graph information to help the Transformer-based sequential recommendation, where the information from different modalities, i.e., user-item interactions in the global graph and the temporal patterns in the historical sequences, are taken into account jointly. In concrete, we propose two Global-Local (GL) GraphFormer models for utilizing both the global graph and local temporal information. One GL-GraphFormer is able to gift the Transformer-based model with both first- and second-order graph information through two specifically designed encodings. The other GL-GraphFormer transfers higher-order graph information into the local Transformer with pretrained Graph Neural Networks (GNNs). Extensive experiments on several real-world datasets demonstrate that i) our proposed GL-GraphFormers can bring substantial improvement over baseline methods, and ii) the benefits of different orders of global graph information vary with the dataset sparsity.


INTRODUCTION
With the rapid growth of modern web services and mobile devices, the increasing number of available sequential user behaviors are playing an important role in improving accuracy of multimedia recommendation.The sequential behaviors of a user is driven by their personal evolving intentions in the course of time.A deep understanding of user sequential historical behaviors can help to discover user intentions, thus resulting in more accurate personalized sequential recommendation, which aims at leveraging the sequences of behaviors to recommend the next item that the target user may be interested in.Inspired by the power of the Transformer [24] in handling temporal information in sequential data, recent works [1,11,16,21] utilize the Transformer-based model for sequential recommendation and have achieved impressive success.
The Transformer-based models for sequential recommendation generally regard the historical user behaviors as a user-specific fully-connected graph (local graph), and simply append temporal encoding to the node features to capture user intentions.However, different from traditional sequential data, the items within the historical user behaviors tend to form more complex relations with each other since there exists a global user-item bipartite graph that connects the items through users to other items.For example, if more users simultaneously interact with both item A and item B, then item A and item B should have a closer relationship with each other.Also, if item A is consumed by many users, then its popularity in the global graph may have an impact on its role in the user-specific local sequential fully-connected graph.These intuitive observations indicate that the information from the global graph will influence both the nodes and edges in the local graph.However, existing literature ignores the connection between global and local information, which is important in understanding the user's intentions.
In this paper, we propose to utilize the global graph information to help the Transformer-based sequential recommendation.Our target is to explicitly incorporate the global graph information into the Transformer, and it is necessary to investigate the following problems: • What kind of global graph information can help the local Transformer better understand the sequential behavior?• What is the most appropriate strategy to combine the global graph information with the local sequential temporal information so that the sequential recommendation can be improved?
To address the two problems, we propose two Global-Local Graphformer (GL-GraphFormer) models based on the basic Transformer structure, so that our proposed strategies can fit other models based on the Transformer structure.Specifically, one GL-GraphFormer is designed to utilize the first-order and second-order information of the global graph, i.e., the item popularity and the number of common users consuming a pair of items.We respectively encode the first-order and second-order information to the node and edge features in the local fully-connected graph of the Transformer, expecting this kind of information can help the model to more accurately capture the user intentions.Inspired by the power of Graph Neural Network (GNN) in recommendation [2,6,22,27,28] for capturing higher-order graph information, the other GL-GraphFormer utilizes the GNN to pretrain on the global graph and then transfer the higher-order knowledge to the local graph.We conduct experiments on several real-world datasets and the results show that i) the proposed two GL-Graphformer models can bring significant benefits to the sequential recommendation and ii) datasets with different sparsity may benefit from different orders of global graph information.We summarize our contributions as follows, • We propose to investigate the problem of utilizing the global graph information to help the Transformer-based sequential recommendation, which may potentially serve as a novel and promising research trend.

RELATED WORK
General Recommendation.Recommender systems aim to predict to what extent the user will like an item.Early recommender systems generally employ the collaborative filtering [3,9,13,19,20], based on the matrix factorization idea to learn the latent representation for users and items.With the development of deep learning, more and more deep models are used for recommendation and have gained great success.Multi-Layer-Perceptron(MLP) has been used to capture the non-linear relationships between user and item embeddings [7].Auto-encoder based methods show robustness to behavior noise [14].Due to the graph nature of user-item relation, recent efforts [2,6,22,27] design different kinds of Graph Neural Networks(GNN) for collaborative filtering.The GNN based models take the advantage of message passing to obtain high-order connectivity from the user-item interaction and show strong ability in improving the recommendation accuracy.However, these works neglect the sequential nature of the user's behavior, which may limit them for more personalized recommendation.Sequential Recommendation.With consideration to the sequential characteristics of user's behavior, another line of recommendation, sequential recommendation, has gained increasing attention in the community.The sequential recommender aims to predict the user's preference towards an item based on the user's personalized historical behavior.Early sequential recommenders utilize the first or second-order Markov chains [23,25] to model the users' sequential behavior.Inspired by the expressive power of sequential deep models, later works [8,10,30] have achieved further improvement by utilizing the recurrent neural networks and the more recent Transformer-based models [1,11,16,21].The Transformer-based models generally regard the items in the sequential historical behavior as a fully connected graph with temporal encoding for nodes, and then utilize the self-attentive mechanism to extract user intentions from the graph.However, this kind of graph construction method neglects the complex relationships among the historical items in the global user-item bipartite graph, which has shown effectiveness in general collaborative filtering.Although some other works [15,29,31] provide other ways to construct the graph for the sequential historical items, like utilizing additional sequences [26].However, there are still rare works exploring whether the bipartite global graph which shows strong ability in the general recommendation can help the Transformer-based sequential recommendation.
Long in history, the global graph information has been utilized for general recommendation through graph representation learning models such as GNNs.The novel idea of marrying global graph information with local fully-connected graphformer, as well as experimental discoveries on the benefits brought by different orders of graph information in this work, can be regarded as a pivot to bridge the traditional graph-based general recommendation with the sequential recommendation.

METHOD
In this section, we present the two Global-Local GraphFormer.The overall framework of the two GL-GraphFormer in fig. 1.We will first introduce the preliminaries and how Transformer is generally used for extracting the user's intention in sequential recommendation in section 3.1.Then, in section 3.2 and section 3.3, we respectively give the detailed description for how we design the two GL-GraphFormer.Finally, in section 3.4 we describe how we train our GL-GraphFormer.

Preliminaries
Assuming that there are totally  users and  items, constituting the user set  and the item set  .Then the users and the items will form a bipartite graph  = { , }, which we call the global graph.The nodes in  contain all the users and items, i.e.,  = { ,  }, | | =  +  , and there exist an edge in  if a user has interaction with an item.Note that the edges will only exist between the user node and the item node, there is no edge between two users or between two The overall framework of the GL graph-former for sequential recommendation.We explore two ways to combine the global graph information to the Transformer-based encoder.One is to add the first and second-order encoding to the item and relation embedding.The other is to utilize GNN to capture higher-order item interaction and initialize the item embeddings with the pretrained embeddings in GNN.

MatMul&Scale
items.In the general recommendation utilizing GNN, we usually split the edges in  into the training, validation and testing dataset.
The task is to utilize the edges in the training dataset to conduct the message passing so that the model can predict the edges in the validation and test dataset.
In the sequential recommendation scenario, we focus on the usercentric data, the user's sequential historical behavior.For each user  ∈  , we have their historical behavior, x  contains T items user  recently interacted with in chronological order, and each    ∈  .The task for the sequential recommendation is to predict the next item that the user  will most likely interact with.Next, we describe how the general Transformer-based sequential recommender extract the users' future intention.
We first map each item    in the historical sequence to their embedding and obtain the embedding sequence, where  is the dimension of the item embedding.Then to capture the temporal relations of each item, each item is equipped with its time position embedding, and then we obtain the time-aware item embedding    as follows, where   is the embedding for position ,  () is a function that is used to fuse the item embedding and the position embedding.The widely adopted  could be adding the two embedding(assuming the two embedding have the same dimension) or concatenating them followed by a linear layer, etc.For simplicity, we adopt the addition operation for  () in our experiment and assume the dimension of    is .The Transformer encoder takes the time-aware historical embedding sequence as input and will output the representation for the user's future intention.The Transformerbased model can be stacked with many Transformer encoder blocks.The input for each Transformer encoder block is a sequence of embedding with the same size  × .Denoting the input for the  ℎ block as ℎ  −1 ∈   × , we describe the flows in one block of the Transformer encoder.Each Transformer encoder block is mainly composed of the multi-head attention layer and the feed-forward layer.The multi-head attention layer is conducted as follows, where    ,   ,   ∈   ×  ,  =   × , and  is the number of the head.In Eq.( 2), each item embedding in ℎ  −1 will obtain its key   , query   and value   under the  ℎ head.Then in Eq.(3),   ∈   × gives the relation matrix of the items under the  ℎ head, and (ℎ  −1 )  ∈   × gives the user's intention at each time stamp under the  ℎ head.Finally, we concatenate the output of all the heads and obtain (ℎ  −1 ) = [(ℎ  −1 )  ]  =1 , and (ℎ  −1 ) ∈   × .To accelerate convergence, the LayerNormalization(LN) is usually conducted before the multi-head attention and the skip connection is adopted.We obtain the hidden state ℎ  1 after the multi-head attention as follows, After the multi-head attention, a Multi-Layer-Perceptron(MLP) is used to model the non-linear combination for different intentions.Similar to the multi-head attention, the LayerNormalization and the skip connection is adopted, we obtain the output of one Transformer encoder block as follows, ℎ 0 is usually initialized with the historical embedding sequence z  .The Transformer-encoder block can be stacked multiple times, it depends on how many orders of relations we want to capture among the items.Assuming that we stack the Transformer-encoder block for total M times and we will obtain the final ℎ  ∈   × , which contains the intentions of user .Usually, the intention of user  with the previous  historical behavior is obtained from the pooling result of ℎ  .Different pooling methods are adopted to extract the user intention from ℎ  in previous works.SASrec [11] adopt the last pooling, i.e., regarding the ℎ   , the last embedding in the embedding sequence ℎ  , as the user's future intention, while CDR [1] adopt the average of  embeddings in ℎ  as the user's future intention.
Next, we describe the two proposed GL-GraphFormer that combines the global graph information with the temporal sequential recommendation.

GL-GraphFormerI: First and Second-order Information Encoding
We focus on better modeling the relations among the items in the sequence of the user's historical behavior.For each item in the global graph G, its information of the graph can be obtained from its neighbors or high-order neighbors.In this part, we only consider the first-order information and the second-order information, and propose two kinds of encoding for the Transformer-based model.

Item popularity encoding.
The first-order neighbors of an item are the users who have interaction with this item.If many users have interaction with one item, this item may play a different role in the historical behavior.This observation fits a common concept in recommendation, item popularity.Users' purchase behavior is not only determined by their preferences but also affected by the popularity of the items.People always give priority to more popular items, but this kind of popular item may hardly reflect each user's personalized interest.Explicitly incorporating the item popularity information can make the Transformer better capture the general interests of all users and the personalized interest for each user.
To this end, we define item popularity as the degree of an item in the global user-item bipartite graph G, i.e., the number of the firstorder neighbors of an item.We develop item popularity encoding according to the degree of the item, and add it to the time-aware item embedding as follows, where ℎ 0 is the input for the first Transformer block and ℎ 0  is the  ℎ embedding in ℎ 0 ,    is the embedding of    ,  ∈   is learnable embedding vectors specified by the degree (   ), and   is the position embedding.By using item popularity encoding, the GL-GraphFormer can utilize the first-order information in the global graph and capture the item importance in the attention layer.

Common user encoding.
The second-order neighbors of an item in the global bipartite graph G are the items who have common users with this item.The relations between two items can be bridged by the user who has interacted with both of them.If more users interact with both item  1 and item  2 , item  1 and item  2 should have a closer relationship with each other.However, such information is not available in the sequential Transformer recommender.We propose to explicitly encode this information into the local Transformer.Specifically, we use the common user encoding to solve this problem.For item    and item    in the historical sequence x  , we denote the number of users that bought them simultaneously as  ( , ) and ,  ∈ {1, 2, • • • , }.This second-order information describes the number of common users between two items, so it is a natural practice to encode this information to the relation matrix   in Eq. (3).
where  , is the (a, b)-element of the relation matrix as shown in Eq.( 3).  (,) is a learnable scalar indexed by  (, ) , and shared across all layers.The kind of encoding will be added for all the  attention heads.
By introducing common user encoding, we inject the secondorder global graph information into the local Transformer explicitly, expecting the GL Transformer can pay more attention to the items that have more common users.
In sum, the item popularity encoding focuses on the first-order information of each item and the common user encoding focuses on the second-order information.Regarding the items in the historical sequence in the Transformer encoder as a fully-connected graph, the item popularity encoding focuses on the node features and the common user encoding focus on the edge features.Replacing the ℎ 0  in section 3.1 with Eq.( 7) and replacing Eq.(3) with Eq.(8)(9) gives the complete GL-GraphFormerI.

GL-GraphFormerII: High-order Information through Graph Pretraining
In section 3.2, we introduce two kinds of encoding methods that combine the first-and second-order global graph information into the local Transformer.In this section, we explore whether higherorder global graph information can bring further benefits.However, higher-order information of the global graph becomes less intuitive than the first-and second-order information, and it is hard to come up with a natural encoding method like section 3.2.To tackle this problem, inspired by the power of Graph Neural Networks(GNN) in capturing the high-order interactions among nodes, we utilize a pretrained GNN on the global graph to gift the item embeddings with higher-order knowledge.For simplicity, we adopt the efficient and effective LightGCN [6] structure for collaborative filtering as the pretrained knowledge extractor.Next, we describe the details of the adopted GNN and how we utilize it to extract the higher-order knowledge.

Message
Passing.The parameters for LightGCN are only the embeddings for the users and items, and it only relies on the linear message passing to obtain the higher-order interaction between different nodes.Recall that the bipartite graph  = { , }, where the nodes  contain the user set  and the item set  .For user  in  and item  in  , their embeddings are respectively e 0 and e 0  with dimension .Then, in the  ℎ layer of the LightGCN, the model will conduct linear message passing as follows, where   and   means the neighbors of user  and item  in G, and () is the same as our previous notation for the degree of one node.In each layer of the LightGCN, the embedding of each user will be updated by the linear combination of its neighbor item embeddings, and symmetric for the embedding of each item.The degree of the items and users are used for normalization to avoid the embedding scale explosion.

Model
Pretraining.The embeddings of each layer are averaged to make the final embedding contain different orders of information.The final embedding for user  and item  is obtained as follows, Then whether user  will interact with item in the global graph G  is calculated with the inner product: ŷ = e ′  e  .Then the pretraining of the LightGCN parameters will be conducted by optimizing the following Bayesian Personalized Ranking(BPR) loss [19].
where  is the coefficient for the L2 regularizer.After training on the graph of the training dataset, we can obtain the embedding for each item.Replacing the    with the pretrained item embedding gives the GL-GraphFormerII.In our experiments, we utilize the pretrained e  0 for item  in the historical sequence, where e  0 implicitly contains the higher-information through the back-propogation process.Also, we provide the results utilizing the pooling embedding e  in section 4.

GL-GraphFormer Training
In this section, we describe how we train the GL-GraphFormer for sequential recommendation.With the GL-GraphFormer, we can obtain the intention for user  from their historical sequence 6).Based on    , we predict the user's preference towards an item ,   , , with their inner product, where   is the embedding for item .We adopt the same prediction loss as [11] as follows, where   is the ground truth item and we follow [11] to sample one negative item  for each time step for user .

EXPERIMENTS
In this section, we present the experimental setup and results.
Datasets.We conduct experiments on 5 recommendation datasets.Two of them are from Amazon [5], named Amazon-Beauty and Amazon-Games.Two of them are from MovieLens [4], named MovieLens-1M and MovieLens-10M.Another one is the Steam [17].These datasets have different scales and sparsity.The statistics of the datasets are shown in table 1.We split the datasets as the previous works [11,16,21], i.e., using the last item of each user's sequence for test and the last but one for validation, and the remaining data are used for training.Implementation Details.We implement our model based on the basic Transformer structure, the same as SASRec [11], the earliest work utilizing Transformer for sequential recommendation.Many later works [1,16,21] add more structures based on SASRec, thus the improvement on the basic model will be more meaningful.The layer for the Transformer block is 2 and the hidden dimension d is searched from{10,20,30,40,50} using the basic Transformer model(SASRec).The optimizer is Adam [12] with learning rate 1e-3.The batch size is 128 and the dropout rate for the MovieLens is 0.5 and 0.2 for other datasets.The pretrained LightGCN model is a 3-layer model recommended by the original paper and optimized with BPRLoss.The optimizer is Adam with learning rate 1e-3.The batch size is 2048, and the model is trained for 400 epochs.Additionally, when utilizing the pretrained embedding to initialize the historical sequence item embedding, we scale each embedding by dividing a scale factor 5.0.This operation is because the pretrained embedding from LightGCN has a large scale and directly using them for initialization causes hard convergence.
Recommendation Performance.We evaluate the models in terms of Recall@10 and normalized discounted cumulative gain (NDCG@10), which are two widely adopted metrics in recommendation.Higher value in both metrics indicates better performance.Different from SASRec, we do not calculate these metrics with the sampled negative items, we calculate the metrics with all the items, which is more accurate as indicated in previous work [18].
The experimental results are shown in table 2. We report the performance of the basic Transformer-based sequential recommender, SASRec, and LightGCN, the model we utilize for pretraining.The GL-GFormerI and GL-GFormerII indicates the two Global-Local GraphFormer in section 3.2 and section 3.3.Except for these models, we also explore the variants with different combinations of the proposed first-, second-and higher-order information .
From the results in table 2, we can see that the sequential recommender generally performs better than the collaborative filtering model in sequential recommendation (except on the Beauty dataset).More importantly, we obtain the following observations: • Incorporating the global graph information into the local Transformer can bring benefits.Comparing the SASRec baseline model with two kinds of GL-GraphFormer, we can see that on the Amazon-Beauty, Amazon-Games, and Steam dataset, GL-GFormerII achieves significant improvement than SASRec.On

Dataset Model
Amazon-Beauty Amazon-Games MovieLens-1M MovieLens-10M Steam Recall@10 NDCG@10 Recall@10 NDCG@10 Recall@10 NDCG@10 Recall@10 NDCG@10 Recall@10 NDCG@ the two MovieLens dataset, the GL-GFormerI brings substantial recommendation accuracy improvement.• The benefits of the global graph information to the Transformer are highly related to the dataset sparsity.We can see that GL-GFormerI shows better performance on the dense MovieLens-1M and MovieLens-10M dataset, while GL-GFormerII performs better on the more sparse dataset Amazon-Beauty, Amazon-Games and Steam.More specifically, with the results of the variants "Base+X"(X=1,2,1+P, etc.), we can see that the first-order encoding (item popularity encoding) does not play a significant role under different datasets.The second-order encoding brings the best performance on the MovieLens dataset, and the combination of the second-order encoding and pretraining gives the best performance on the sparse dataset.These phenomenons indicate that more sparse dataset needs higher-order global graph information.It is not hard to understand this observation.When the dataset becomes sparse, the common items between two users will become fewer, more valuable relations between items will come from higher-order information.
Effects of Different Initialization.In our previous experiments for GL-GraphFormerII, we utilize the 0 ℎ layer of embedding from LightGCN because the backpropogation process will make the item embedding in the 0 ℎ layer aware of their higher-order neighbors.We also provide the results of utilizing the average pooling embedding of different LightGCN layers as initialization.The results are shown in table 3, and we observe that utilizing pooling embedding shows comparative performance on the Beauty and Steam dataset, but achieves clear improvement on the Games dataset because the average pooling explicitly contains different order of graph information.
Visualization.We provide a Top1 recommendation example of a one-layer GL-GraphFormer and the original one-layer SASRec.The ground truth is within the Top10 of GL-GFormer while but beyond the Top10 of SASRec recommendation.The SASRec baseline wrongly focuses on the shampoo and gives the wrong matching, indicating the attention mechanism and the learned embedding for the items in the SASRec are not so good.In contrast, our method

CONCLUSION
In this paper, we present a novel and promising perspective, utilizing the bipartite user-item global graph information to enhance the Transformer-based sequential recommendation.We offer two Global-Local GraphFormers, effectively improving the performance of Transformer model for sequential recommendation.This work also discovers that more sparse datasets benefit from higher-order global graph information.Further investigations including exploring more ways of utilizing the global graph information for sequential recommendation and designing more advanced models based on the GL-GraphFormer will be a promising future direction.

Figure 1 :
Figure1: The overall framework of the GL graph-former for sequential recommendation.We explore two ways to combine the global graph information to the Transformer-based encoder.One is to add the first and second-order encoding to the item and relation embedding.The other is to utilize GNN to capture higher-order item interaction and initialize the item embeddings with the pretrained embeddings in GNN.

Figure 2 :
Figure 2: Attention in GL-GraphFormer and SASRec gives more reasonable attention about facial cosmetics and predicts the right moisturizing intention of the user.

Table 2 :
Model performance.The variants of the GL graph-former as denoted as follows: Base means the original Transformer(SASRec) for sequential recommendation. 1 means adding the first-order global graph information (popularity encoding), 2 means adding the second-order common user encoding, and P means using the pretrained embedding from LightGCN.Pretraining of LightGCN on the MovieLens-10M has high computational cost because of the huge number of interactions, and we report it as out of time(OOT).