User Behavior Enriched Temporal Knowledge Graphs for Sequential Recommendation

Knowledge Graphs (KGs) enhance recommendations by providing external connectivity between items. However, there is limited research on distilling relevant knowledge in sequential recommendation, where item connections can change over time. To address this, we introduce the Temporal Knowledge Graph (TKG), which incorporates such dynamic features of user behaviors into the original KG while emphasizing sequential relationships. The TKG captures both patterns of entity dynamics (nodes) and structural dynamics (edges). Considering real-world applications with large-scale and rapidly evolving user behavior patterns, we propose an efficient two-phase framework called TKG-SRec, which strengthens Sequential Recommendation with Temporal KGs. In the first phase, we learn dynamic entity embeddings using our novel Knowledge Evolution Network (KEN) that brings together pretrained static knowledge with evolving temporal knowledge. In the second stage, downstream sequential recommender models utilize these time-specific dynamic entity embeddings with compatible neural backbones like GRUs, Transformers, and MLPs. From our extensive experiments over four datasets, TKG-SRec outperforms the current state-of-the-art by a statistically significant 5% on average. Detailed analysis validates that such filtered temporal knowledge better adapts entity embedding for sequential recommendation. In summary, TKG-SRec provides an effective and efficient approach.


INTRODUCTION
Sequential Recommendation (SR) models how user preferences change across temporally-ordered item sequences, evolving from the use of Markov Chains [9] and Recurrent Neural Networks [8,13,21], to Transformers [18,34,46].In SR tasks, knowledge graphs (KGs) serve as valuable auxiliary resources, providing additional information about items through external relationships to enhance recommendation accuracy and diversity [61].However, KGs contain massive irrelevant knowledge (both entities and relationships) for specific recommendation tasks, weakening performance.
Distilling relevant knowledge is necessary [6,49], and existing methods fall into two categories: item-centric and user-centric.Item-centric methods disregard the role of users in knowledge filtering and focus only on the filtering by item relationships; e.g., RippleNet [41] and KGAT [49].In contrast, user-centric methods select adjacent entities favoring user-specific preferences, as demonstrated in KGNN-LS [43], CKAN [50], KGCN [45].However, either item-centric or user-centric methods pose limitations for SR.
(1) The item-centric approach considers only those KG entities statically linked to an item as relevant ones [17], overlooking entities derived from sequential user behaviors.However, sequential relevance is a strong signal in SR tasks.For example in Figure 1, while Apple Watch and iPhone are closely connected in the KG, itemcentric methods symmetrically treat them as relevant to each other.However, user behavior often shows an asymmetric sequence, like buying an iPhone before an Apple Watch.Ignoring this sequence can lead to irrelevant recommendations, as a user who first buys an Apple Watch may not find an iPhone as immediately relevant.This motivates the need for SR methods that account for asymmetric sequential relationships in knowledge distillation.
(2) Existing user-centric methods rely on static user preferences for identifying relevant entities [43].However, This approach falls short in SR contexts where user preferences evolve over time.Consider the example (Figure 1, right) regarding the proposition of "buying a digital watch after buying a mobile phone": a mere 6% of Apple Watches were sold for every iPhone sold in 2016, but this figure rose markedly to 22% in 2020 [2] -clearly illustrating that relationships among entities change over time.While both items are connected in the original KG as they are supplied by Apple (static knowledge), their connectivity varies (temporal knowledge).In 2016, the connectivity between these two nodes is less relevant; but for users in 2020, it becomes more relevant in identifying user intentions.Hence, incorporating temporal knowledge is crucial for capturing these evolving relationships.
To tackle the above issues, we propose a behavior-centric approach, which introduces time-aware statistics of all user behaviors into existing KG, improving relevant knowledge identification.Specifically, we incorporate sequential relationships as constraints and emphasize the temporal dynamics of nodes and edges, which wax and wane in prominence.We term this time-aware behaviorinclusive KG as a Temporal Knowledge Graph (TKG).
How to construct the TKG?In real-world applications, the construction of temporal knowledge poses scalability issues when tracking each user's behavior at every timestamp.Including precise timestamps leads to an untenable computational complexity, and the distinct and varying levels of relevant knowledge across individual users make the creation of personalized TKGs cost-prohibitive.In our work, rather than treating time as a continuous feature, we break it down into discrete periods and monitor user behaviors during time intervals.We regard the information of nodes and edges in the original KG as static knowledge, and we use statistical features of user behaviors in each time period to select relevant temporal knowledge.As prior research [63] has affirmed that popularity bias has a beneficial impact on improving overall recommendation accuracy, we choose popularity as a statistical feature [14] -the item popularity as the node's feature, and the popular item-item transitions as the dynamic relations.Correspondingly, these two types of temporal knowledge capture the dynamicity exhibited in either its entities (i.e., nodes) or structure (i.e., edges).
How to model the TKG? Existing TKGs modeling methods primarily operate on small datasets [4], making it inapplicable in real-world recommendation scenarios, which must deal with large volumes of data.Additionally, they often neglect the evolving nature of real-world knowledge graphs [4], which continually adapt to reflect changing user-item interactions.In our work, given the graphical nature of KGs, using a graph neural network to model them is a natural strategy [62].Considering the vast scale of our TKG (containing millions of relations), it is not feasible to train it all at once.Therefore, we adopt pre-training, disentangling the modeling of static knowledge and temporal knowledge, processing them separately using distinct GCNs.Specifically, we employ a large pre-trained model to capture static knowledge, while smaller specialized models capture temporal knowledge within their respective time frames (here, the terms small and large are indicative of the volume of processed knowledge).And as real-world knowledge is continuously updated, we utilize selection gates to effectively merge temporal knowledge from different views into the static entity representation in an iterative fashion.We store the embeddings of each entity at every time point, allowing for rapid iteration and training in realistic scenarios that demand efficiency.
We refer to our aforementioned TKG modeling method as Knowledge Evolution Network (KEN).It acts as an effective technique to distill relevant information given static/temporal knowledge from each time frame, providing better support for downstream SR tasks.We call our framework of TKG construction, TKG modeling, and downstream sequential modeling as Temporal Knowledge Graph enhanced Sequential Recommender (TKG-SRec).Extensive experimental results demonstrate that TKG-SRec establishes a new state-of-the-art for SR.Through a detailed study of our dynamic knowledge graph learning approach, we find that our modeling of dynamic user behavior filters many irrelevant relations from the knowledge graph.We also show that TKG-SRec is compatible with various wide-adopted sequential modeling methods: SASRec [18] and FMLP-Rec [66].Such broad adaptivity lends additional evidence that dynamic KG learning is a fruitful area for future work.
The main contributions of our work are summarized as follows: (i) To the best of our knowledge, we are the first to explore the temporal dynamicity of knowledge graphs in SR.We propose our TKG-SRec framework that exploits temporal knowledge to improve SR. (ii) We contribute KEN as the key component in our framework which models the TKG in an efficient manner.(iii) We conduct extensive experiments on four benchmark datasets, showing that our TKG-SRec consistently outperforms state-of-the-art methods.We validate the effectiveness of our KEN by integrating it with GRU, MLP, and Transformer-based instantiations.

PROBLEM FORMULATION
The objective of sequential recommendation is to predict the next consumed item  ∈ I for each user  ∈ U, given his/her history sequence   .The I and U are items set with volume  and users set with volume .In a history sequence of length n, each element is a combination of an interacted item and its corresponding interaction time at that moment, denoted as   = (  1 ,   1 ), • • • , (   ,    ).For brevity, the superscript  will be dropped from the notation in the following.In this work, we propose to enhance sequential recommendation by further considering information from temporal knowledge graph, which is denoted as G = (V, R, E, E  , X  ).It links items directly or indirectly as an additional source of information.The node set V encompasses not only the set of items I (where each item can be viewed as a special type of entity  ∈ V) but also the set of non-item entities.More clearly, G comprises: • Static knowledge.The intrinsic edges E between these entities, known as static relations, are denoted by triple sets E : (  , ,   ).
Here,   is the subject entity,   is the object entity and  ∈ R is their relation.In our work, we preserve the original entity nodes and their static relations (such as "A is the director of B") in the knowledge graph as static knowledge.• Temporal knowledge.There exist various temporal relations between entities, which are represented as quadruple-sets edges E  : (  ,   , ,   ).Here, the variable  denotes the index of the time frame, and the granularity of these time frames is determined based on the specific needs of the task.The notation   ∈ R corresponds to a particular type of time-specific relation.Moreover, we also consider the temporal properties of entity nodes, denoted as X  : (, , ), indicating  holds property  at time .

METHODOLOGY
Our methodology is founded on constructing temporal knowledge from user behavior ( § 3.1).We then introduce the TKG-SRec framework, which operates in two phases: entity embedding learning on TKG (Phase-1, § 3.2 and § 3.3) and sequential modeling (Phase-2, § 3.4).In Phase-1, entity embeddings are first pretrained using static knowledge, followed by refinement with temporal knowledge.Phase-2 involves using a sequential model to predict user preferences with dynamic entity embeddings.

Temporal Knowledge Construction
For effective modeling of temporal knowledge, we divide it into time-indexed snapshots (Figure 2) from two views.The two views (series of snapshots) encapsulate time-specific entity properties and temporal relations, modeled using statistical characteristics derived from user-item behaviors within specific time windows.We only use partitioned training records to construct TKGs, preventing data leakage.Details of constructing the two views follow.⋄ Entity-dynamic view.In this work, we treat popularity statistics as properties for all entities, including items and non-items.To factor in popularity's temporal aspect, we introduce the entitydynamic view of the graph through a series of snapshots G  : {G  1 , ..., G   }.Each snapshot G   captures entity properties (, ,  ′ ) at time  ′ = .Here,  signifies whether an entity is popular, determined by the top  frequent item entities derived from user-item interactions in each timeframe.The ratio  distills popular items from the long-tail popularity distribution.For seamless popularity propagation, in each snapshot, we include the -hop neighbors of popular entities, including non-item ones.For instance, assuming the time interval is in a year, if the iPhone was popular in 2019, its -hop entities (e.g., Apple Inc. for  = 1) and their relation (e.g., "is produced by") are recorded in G  2019 , highlighting both item and non-item entities that garnered user attention in 2019.
⋄ Structure-dynamic view.Leveraging existing designs [7,53,57], we capture item-to-item transitions based on frequency, representing sequential dependencies within TKGs.These transitions lead to a series of snapshots, G  : {G  1 , ..., G   }, documenting finegrained relations between item entities over time.Each snapshot G   depicts a directed mini graph with edges symbolizing interest transition relations (  ,   , ,   ).In these snapshots, item entities can be either the source or the target of the transition relations.Focusing on first-order transitions between adjacent items in user sequences, denoted as   , we include transition relations between item pairs that exceed the frequency threshold  within the  ℎ timeframe.For example, if there is a frequent pattern of users frequently buying AirPods right after (i.e., first-order) an iPhone in 2020, surpassing  times, we include this directed link (iPhone,   , 2020, AirPods) in the snapshot G  2020 .In summary, for temporal knowledge modeling (as depicted in Figure 2), we introduce time-aware properties of entities into G   and a new type of relation   (referred to as interest transition) into G   .These properties and relations are derived from user behavior statistics within the time frame , serving as hard constraints for distilling time-specific relevant knowledge.It's worth noting that in the previous examples, we used year as the time interval length ( − 1, ) for easier understanding.However, the time window length can be dynamically set.As the time window expands, the granularity of the partitioned snapshots becomes coarser.The total time span and the number of time windows partitioned  can vary.

Static Pretraining on TKG
The static knowledge denotes fixed relationships E between entities.Given its high connectivity and computational demand in real scenarios, we use a structurally simple graph encoder (i.e., static encoder) for its modeling.This encoder takes -dimensional entity embeddings E  ∈ R  × (randomly initialized) as input, processing them with a graph neural network.Designed for modeling varied relations in knowledge graphs, its propagation function for each entity node  is as follows: Here, W   ∈ R  × is a self-loop transformation matrix; W  ∈ R  × is the transformation matrix w.r.t. each type of relation  linking central node  and neighbor node  ′ ; The nodes' hidden representation h (0)  := E   is initialized from the entity embedding table.Through -layer propagation from  = 0 to  = , we take the hidden state from the final layer as its encoded representation of static knowledge, denoted as static hidden state h During the static pretraining, we apply the scoring function of DistMult factorization [58] that decodes the hidden representation , where R  ∈ R  × is a diagonal matrix to be learned that bi-linearly interacts with the hidden vectors of entities   and   .For triple-lets (  , ,   ) belonging to E, we label them as positive samples with  = 1.For each positive sample, we follow previous practice [36], sampling the negative edges that do not occur in R. Specifically, we randomly corrupt either the subject or the object entity, marking it with the label  = 0.The pretraining objective is set as optimizing the cross-entropy loss:

Temporal Tuning on TKG
In addition to static knowledge, we leverage temporal fine-tuning to integrate temporal knowledge.Following [52], we establish dynamic entity embeddings E  ∈ R  × × to effectively represent entities/relations' evolution, storing unique parameters for each time.Each time's entity embedding is initialized by static entity embeddings E   = E  .As shown in Figure 3 (b), a temporal encoder processes the embedding E   at each time frame , yielding a temporal hidden state.We use gated knowledge evolution units to ensure dynamic continuity between temporal hidden states, while the temporal decoder generates predictions for the next time frame  + 1 based on temporal knowledge.

Temporal Encoder.
Compared to static knowledge, temporal knowledge in each time frame is more streamlined, allowing for detailed modeling in relation to the two dynamics.With the dual perspectives of temporal knowledge, we introduce the temporal encoder to selectively propagate through them.Entities with popular neighbors are likely to have popular traits; and transition relations between entities foster new item connections.Both factors play a crucial role in determining the probability of being relevant knowledge for predicting user preferences.In G   , popular property perceptions are relayed via original relations, while G   propagates transition data using the new relation.Since their modeling objectives differ, we employ differentiated aggregation for them.Specifically, the propagation function for node  at time  is: N  , and N   , denote object neighbors linked to node  through relations  and   in G   and G   , respectively.h  ′ + r  conveys the translational relation [3] from neighbor  ′ via edge   .The representation vector of   is consistent across snapshots, which mainly governs the effect of item transitions in entity temporal fine-tuning.To sidestep the dying ReLU issue [24], we adopt RReLU [55] as our activation function.The * symbolizes the Hadamard product, while  , ∈ R 1× , a selection gate, regulates how node  is influenced by the two views at time .It is further detailed as: where W  ∈ R  ×2 is the gate weight vector of the selection between two views, and   is the bias.[; ] is the concatenation operation, and (h *  ) represents the normalized average hidden vectors of 's neighbors connected through relation  or   .Similar to the static encoder, the temporal encoder outputs the temporal hidden state h  , = h , after  ′ -layer propagation.We separate the two views in the relation propagation in previous designs, aiming to achieve better decoding effects, corresponding to the entity classification and relationship prediction tasks respectively.

Gated Knowledge Evolution Unit.
In order to leverage the dynamic and static properties of each entity, we refer to the static hidden state h   as the long-term constraints, and the temporal hidden state h  , as the short-term information filters in our evolution units.Specifically, as illustrated in the green box in Figure 3 (b), in  ℎ unit cell, the update for each entity from h , −1 → h , : In Equation 5, the evolution representation vector h , ∈ R  combines three components: the static representation h   , the previous time frame's representation h , −1 , and the current frame's representation h, .The fusion vector   ∈ R  integrates h , −1 and h, , balancing long-term and short-term properties through elementwise production * .The fusion vector is activated by the gate weight matrix W  ∈ R  ×2 and bias term b  ∈ R  in equation 6.The long-term static representation serves as a connection to maintain awareness of the static knowledge within the evolution representation.The evolved representation h ,0 initialized from the static entity embedding table.The absolute time information  is incorporated into temporal hidden states via time positional embedding p  in Equation 7, which is initialized from a separate positional embedding table, following the method outlined in [18].
The entity evolution representation is updated iteratively from h ,0 to h , , and used for decoding and predicting the time-aware properties in the snapshots G   +1 and G   +1 , which is detailed next.The entity classification task aims to classify the properties (i.e., the popularity) of entities in the next time frame G   +1 .To accomplish this, we utilize the evolved representation h , from time  = 0 to  =  in the gated knowledge evolution units, along with an MLP decoder M for decoding.The probability of an entity being popular at time  + 1 is calculated as (,  + 1) =  (M (h , )).For training, we create positive-negative pairs where ( + , , ) ∈ X  represents positive entities  + with the popularity property.We also randomly sample negative entities  − to represent unpopular ones for pairwise training.The entity classification objective is optimized using the BPR loss [27]: The relation prediction task involves determining the existence of an interest transition relation in G   +1 .To address this, we use ConvTransE [32] as our relation decoder C, following the common practice of using graph decoders as score functions for relation prediction.We decode using the evolved representations h , .The probability of a quadruple (  ,   ,  + 1,   ) existing is calculated as (  ,   ,   ,  + 1) =  (r  • C(h   , , h   , )), where r  represents the edge embedding.For training, positive quadruples (  ,   , ,   ) ∈ E  are extracted from the TKG, marked as label  = 1.And negative quadruples ( −  ,   , ,   ),  = 0 are sampled by replacing the subject entity.The classification criterion is the Binary Cross-Entropy loss.
For better computational efficiency, we sample negative entities from all graph nodes for the entity classification task; and sample negative relations from the training batch (randomly choosing an edge from the batch) for the relation prediction task.The final target for the temporal tuning is defined as: The loss weight  serves as a hyperparameter balancing impacts of both tasks.Through optimization, the learned entity embedding e  , (from the temporal embedding table E  ) intrinsically captures the distilled knowledge specific to a certain time.

Entity-level Sequential Modeling in SR
A typical sequential recommender model (SRec) focuses on item sequential modeling, using history item representation for nextitem prediction.As shown in Figure 3.c, traditional SRec assigns item representation by randomly initialized embedding.For a better explanation, we employ GRU4Rec [13] as our backbone in Phase-2, which is a straightforward yet powerful choice.Taking the traditional method, given a user's interaction sequence  1: : ( 1 ,  1 ) → • • • → (  ,   ), GRU4Rec models it using GRU cells.At the  ℎ position, the GRU cell processes the input of randomly initialized item embedding i  and the hidden state from the previous GRU cell z −1 , and then generate the hidden state for the next cell.Mathematically, z  =  (i  , z −1 ).Beginning from  = 1, the sequence representation is obtained by propagating through  cells.
By contrast, the learned dynamic entity embeddings from Phase-1 can be regarded as leveraging distilled knowledge.The inclusion of user behavior serves to assist the training of temporal knowledge, enabling the hard constraints of more valuable information for each time period (while disregarding irrelevant information).To explore the efficacy of dynamic entity embeddings in enhancing item representation, we model them with another GRU, as shown: where e    ,  indicates the dynamic entity embedding of item   at time   .As shown in Figure 3.d, we leverage such entity-level modeling into item-level modeling through a linear combination layer B ([  ;  ′  ]).The final probability of recommending  +1 is ŷ In order to better adapt the entity embeddings to SR, we don't freeze the dynamic entity embeddings, allowing them to be further fine-tuned.Finally, the ultimate objective is to minimize the loss between the predicted value ŷ and the true label .

Complexity Analysis
Time complexity.Our TKG-SRec is a streamlined two-phase framework optimized for lightweight TKGs use.The enhanced entity embedding is directly applicable for online inference, making its time complexity equivalent to basic SR models.For training, the two-phase approach simplifies integration with a lower complexity  Space complexity.In Phase-1, training dynamic entity embedding is memory-intensive for many GPUs.We address this using memory-saving strategies.Both in static pretraining and temporal tuning, we employ block-wise graph propagation, where neighbor node messages are propagated within small batched sub-graphs (known as blocks in DGL [47].This method cuts GPU memory usage from  ( 2 ) to  (20) during static pretraining, with  being the average entity number in a batch and 20 is the default number of sampled neighbors for propagation.For temporal tuning's sub-graph sampling, nodes are selected from temporal snapshots G   +1 and G   +1 rather than the full knowledge graph.This reduces space complexity from  (  2 ) to  (  =1   +1   ), where   represents the number of entities in the -th snapshot.The entity count in each snapshot   , is generally much less than the overall entity count .

EXPERIMENTS
Datasets.We experiment on four public datasets including LastFM [30], Amazon-books [12], and MovieLens [10] (with two volumes).The datasets include both user-item interactions and side knowledge.To avoid heavy computation, following the common preprocessing practice [11,17,28,39], we filter out the overly unpopular items and inactive users with fewer than 5 records.The KGs are initially constructed by Zhao et al. [64], and we also eliminate entities from KGs that are too distant (more than 3 hops) from any of the item-linked entities.For the large LastFM, we follow the practice of Huang et al. [17], only keeping the last year's interaction.
Evaluation Protocol.Following standard SR settings [15,18], we allocate 85% of each user's earlier interactions for training (which also derives temporal knowledge construction in § 3.1) and the remaining 15% for testing.Unlike typical methods that sample negative items for evaluation [13,18], our model treats all nonselected items as negatives and pairs them with a single positive item in each sample.Our evaluation employs ranking-based metrics: Hit Ratio@k (HR@k), Normalized Discounted Cumulative Gain@k (NDCG@k), and Mean Reciprocal Rank@k (MRR@k), where k is the truncated length of recommendation list [33].
Baselines.To verify the effectiveness of our proposed TKG-SRec, we compare it with two groups of models.(A) Sequential recommender baselines include Caser [35] using vertical and horizontal CNNs to model short-term sequences (limited to the last 15 items to avoid gradient explosion); GRU4Rec [13], SASRec [18], and BERT4Rec [34] utilizing GRU, Transformer, and BERT model to capture sequential patterns in user interaction histories; FMLP [66] employing MLPs instead of multi-head attention in the Transformer framework for improved filtering; CL4SRec [54] introducing sequence-based contrastive learning with unsupervised techniques; DuoRec [26] further advancing contrastive embedding training with dropout masks and sample selection.(B) KG-based recommender baselines include KGAT [49] capturing high-order item relationships using attention to weigh entity neighbors; GRU4RK extending GRU4Rec with dual GRUs for item and entity embeddings (which is pretrained using TransE [3]); KSR [17] combining a GRU-based system with knowledge-augmented memory networks.

Overall Recommendation accuracy
Table 1 shows results from various models across four datasets.Our TKG-SRec outperforms other baselines, with relative improvements in brackets.Overall, this suggests that our work successfully distilled more useful dynamic knowledge for the SR task.Among the runner-up baselines, SASRec and FMLP are notable as effective models trained solely on interactions, emphasizing the power of attention and filtering mechanisms in sequential modeling.Also, DuoRec's stellar performance on the LastFM dataset underscores the benefits of unsupervised augmentation for recommendation.
When compared to the standard SR models, the models that incorporates additional information from KGs exhibits improved accuracy.This is evident from the relatively better performance of GRU4RK and KSR on the MovieLens dataset.However, the margin is not substantial, partly because the knowledge graphs are not specifically designed for recommendation tasks.At times, the KGs even introduce noise which can harm recommendation accuracy, as seen when FMLP outperforms them without KG information on the ML and Amazon datasets.KGAT, while not specifically designed for the SR task and focusing only on collaborative signals, still shows comparable performance against other SR models.It further emphasizes the importance of distilling useful knowledge for improving more accurate recommendations.
Interestingly, we observed that the impact of TKG-SRec's improvement is less pronounced on the ML-1M dataset compared to other datasets.This is attributed to the smaller ratio of temporal relations (|E  |) to static relations (|E |), with ML-1M having a ratio of 0.6M/4M, whereas datasets like Amazon-Books have a ratio of 0.4M/1M.Consequently, less information is derived from temporal properties and more from static ones.

Ablation Study
In this section, we conduct a series of experiments to better understand the design rationality of our proposed framework.

4.2.1
On the superiority of KG modeling.In Table 2 (left column), we measure the merits of various KG modeling methods -we contrast our static encoder (referred to as S.K.) with other commonly utilized KG embedding learning techniques -TransE and RESCAL.It's evident that our static encoder surpasses TransE and RESCAL across all four datasets, illustrating the efficacy of convolution in graph learning and the dependability of our pre-trained entity embeddings.Furthermore, we observe that relying solely on static knowledge results in a significant decrease in performance, which further confirms that knowledge distillation based solely on existing information is challenging to effectively apply in sequential tasks.

4.2.2
On the effect of TKG construction.In Table 2, we further analyze the impact of constructing each type of temporal knowledge separately: structure dynamics (T.K.  ) or entity dynamics (T.K.  ).Specifically, the temporal encoders propagate messages in the graph snapshot G  or G  and supervise the fine-tuning with relation prediction loss or entity classification loss.We observe that solely considering structure dynamics leads to a larger improvement over the base static encoder (↑8.2%) compared to utilizing entity dynamics (↑3.2%).The exception is the ML-1M dataset, where static information has a greater impact.As highlighted earlier, the sparse temporal knowledge in ML-1M could diminish its impact.Focusing only on a single type of temporal knowledge might further disrupt precise static modeling and obstruct effective knowledge distillation.Additionally, to verify the benefits of the evolution units, we replace them with a simple integration (h  = h  +h   ) in equation 5 (marked as w/o EU).We observe that the simple integration significantly   Table 3: Backbone compatibility analysis over four sequential datasets, evaluated with (HR, NDCG, MRR)@5.
underperforms KEN, confirming that taking static knowledge as soft constraints is a better approach to pretrain entity embeddings.4.2.3On the compatibility of sequential modeling methods.To assess the general applicability of TKG-SRec, we employ pretrained TKG embeddings with various backbones and compare them to the original approach.We adapted SASRec and FMLP following the approach outlined in Section 3.4, incorporating entity-level embeddings into Transformer and MLP-based encoding layers, and combining these with item-level outcomes for final predictions.Both models integrate dynamic entity and positional embeddings.
From the results in Table 3, we observe that the GRU variant benefits the most from dynamic entity embedding, consistently delivering substantial improvements.SASRec also demonstrates improvement with fine-grained entity embeddings.However, it appears that the TKG's contribution to enhancing the FMLP is relatively marginal, with its performance even slightly declining on the ML-100K dataset.We interpret this as a result of our method's underlying principle as a noise filter and a remover of irrelevant knowledge, similar to filter-based models that operate filtering at the feature level.This characteristic leads to a modest enhancement of the FMLP, particularly in smaller datasets like ML-100K, where the limited size of the KG curtails the impact of knowledge filtering.

Effectiveness of Popularity-Based Statistics
In this study, popularity-based statistics are perceived as a means of noise reduction, emphasizing behavior patterns with higher confidence.In this section, we scrutinized the utility of using popularity as a statistic feature in SR for knowledge distillation.We gauged the relevance of the distilled knowledge from the KG by examining the relatedness of sampled item entity embeddings (Figure 4), where the relatedness is measured using cosine similarity.Smaller relatedness values indicate less relevance of the distilled knowledge for the recommendation task.In SR, it is commonly agreed that closer items in a sequence should have a stronger correlation, while distant items tend to have weaker relevance.We then assessed if the relatedness mirrored the average order distance in the interaction sequence, thereby indicating their SR adaptability.
We observed that TransE and RGCN derived entity embeddings failed to clearly correlate item order distance and embedding relatedness, revealing a gap in KG and SR information and that traditional approaches fail to distill relevant knowledge from TKG. Contrarily, our KEN offers less related embeddings for distant sequence items.With increasing KG size, the correlation between order distance and embedding similarity strengthens, despite more fluctuations in Amazon-Books.This underlines the difficulty of preserving KG's static structural information while distilling information to optimize SR, especially for larger KGs with more complex structures.

Parameter Sensitivity
We first analyze the sensitivity of KEN to the number of GCN layers in both static Knowledge Graph encoder  and temporal Knowledge Graph encoder  ′ .In the sensitivity test, we fix other hyperparameters but only vary the combination of  and  ′ .As shown in Table 4, we find that the model performs poorly when both  and  ′ are equal to 4. It is because a large number of graph convolution layers mix too many information from neighbors, which would over-cover the origin entity properties.TKG-SRec achieves the best performance when  and  ′ are 1 or 2 in the four datasets.We also investigate how the hidden size  affects TKG-SRec performance.As shown in Table 5, it might increase model capacity with extra bits in hidden layers (where the performance improves with a rise in  at the beginning).However, when  is further raised, performance suffers as an overly large number of dimensions might cause overfitting.The  range of 32 to 64 yields the best results.

RELATED WORK
Knowledge Graph Completion is typically explored in static KGs, and common methods include translational models like TransE [3], TransH [51], and TransR [23], which aim to embed nodes and relations using a scoring function; and propagation-based models use graph neural networks like GCNs [31], RGCN [31].
Further considering Temporal Knowledge Graph Completion task [4], various techniques exist for integrating timestamps into TKGs.
Tensor Decomposition [29] simplifies the complex 4-way tensor (head, relation, timestamp, tail).Time Transformation [20] interprets timestamps as transformations for entity and/or relation representation using time-specific functions.Studies on Dynamic TKG note that entity or relation representations evolve over time.Such approaches include merging static information with trend and seasonal data at specific times [56] and capturing relations of concurrent facts through KG subgraph snapshots [22].However, according to Cai et al. [4], current TKG completion methods falter with larger, dynamic data.In our task of handling the application with dynamic user-item interaction data, we need efficient methods to address the ongoing changes in real-world knowledge graphs, discarding irrelevant knowledge and incorporating new data.KG-Augmented Recommendation employs two main approaches to utilize KGs: two-phase learning and joint learning.The former trains entity embedding before integrating it into the recommendation [16,17,42,59].The latter trains both entity and user/item embeddings together, either with a single [40,61] or multiple objectives [5,44].In addition, personalization in recommenders systems is advanced by incorporating user behaviors into KGs.RippleNet [41] uses a memory-network model for unique user operations, and KGCN [45] employs user-specific weights.TPRec [65] further integrates time-aware user-item graphs into KGs.
SR specifically targets modeling users' sequential interests [15,37,60].Enhancements in SR through KGs include KGIE [39] for entity-level user interest modeling; KSR [17] which integrates entity data with memory networks in recommendation systems; and KERL [48] that applies knowledge-guided reinforcement learning.MKM-SR [25] combines KGs with item sequences and user actions, and Chorus [38] uses KGs to link items in SR.However, time-aware & dynamic KGs in SR remain unexplored.We propose to use the temporal knowledge graph to have time-aware user behaviors in KG modeling, filtering irrelevant knowledge out for SR.

CONCLUSION AND FUTURE WORK
In our study, we enhance KGs by tracking dynamic user behaviors to create Temporal Knowledge Graphs (TKGs).The introduced temporal knowledge in TKGs focuses on two distinct facets: entityrelated and graph structure-related.Our novel KEN component filters information by leveraging both static and dynamic knowledge over time, leading to the development of the TKG-SRec framework, a novel approach for KG+SR.Our experiments show consistent effectiveness across various datasets and three backbones.
For future work, beyond merely treating behavior statistics as temporal knowledge, our framework can excel by additional factual data like product release dates.TKG-SRec sets a new paradigm for utilizing temporal knowledge in high-throughput recommendation systems.Additionally, adopting more sophisticated time modeling methods, like Time2Vec [19] which captures periodicity, could further refine and improve the system's effectiveness.

Figure 1 :
Figure 1: A TKG exploits the dynamic relations among entities: (left) the co-purchase of iPhone and Apple Watch rises in prominence over time; (right) resulting in the creation of a new dependency between the two entities in the TKG.

Figure 2 :
Figure 2: Temporal KGs construction from static knowledge and temporal knowledge.The solid arrows indicate static relations, and the dotted arrows indicate temporal relations.

Figure 3 :
Figure 3: The framework of Temporal Knowledge Graph enhanced Sequential Recommendation.KEN component (b) utilizing (a) graph encoder to learn dynamic entity embeddings.Sequential recommender component (d) leverages dynamic entity embedding into the traditional RNN-based backbone (c).h   .It measures the existence of (  , ,   ) by the score  (  , ,   ) :=  ((h

3 . 3 . 3
Temporal Decoder & Training Targets.We have two objectives -entity classification and relation prediction -to guide the training of the temporal knowledge evolution.
Here,  is the maximum sequence length,  is the user number, |E | and |E  | are the average edge numbers of static relations and temporal relations in each snapshot.Most training time centers on the temporal knowledge evolution training with |E | +  2  ′ |E  |.The parameter  influences time granularity and training complexity.Nonetheless, real-world TKGs' vastness necessitates a balance.

Figure 4 :
Figure 4: The correlation between entities' embedding relatedness and their order distances in interaction sequences.

Table 1 :
Overall Performance.Bold text indicates best performance, underlined text indicates second best.† indicates a statistically significant level -value< 0.05 comparing TKG-SRec with the best baseline TransE RESCAL S.K. T.K.  T.K.  w/o EU KEN

Table 2 :
The ablation study results are presented with metrics (HR, NDCG, MRR)@5.ML-HK and Amaz.stand for ML-100K and Amazon-Books.The left column shows models using only static knowledge, while the middle column shows results with added temporal knowledge.

Table 4 :
R@10 w.r.t. the number of layers  and  ′ .

Table 5 :
R@10 w.r.t. the dimension of hidden layers .