CPMR: Context-Aware Incremental Sequential Recommendation with Pseudo-Multi-Task Learning

The motivations of users to make interactions can be divided into static preference and dynamic interest. To accurately model user representations over time, recent studies in sequential recommendation utilize information propagation and evolution to mine from batches of arriving interactions. However, they ignore the fact that people are easily influenced by the recent actions of other users in the contextual scenario, and applying evolution across all historical interactions dilutes the importance of recent ones, thus failing to model the evolution of dynamic interest accurately. To address this issue, we propose a Context-Aware Pseudo-Multi-Task Recommender System (CPMR) to model the evolution in both historical and contextual scenarios by creating three representations for each user and item under different dynamics: static embedding, historical temporal states, and contextual temporal states. To dually improve the performance of temporal states evolution and incremental recommendation, we design a Pseudo-Multi-Task Learning (PMTL) paradigm by stacking the incremental single-target recommendations into one multi-target task for joint optimization. Within the PMTL paradigm, CPMR employs a shared-bottom network to conduct the evolution of temporal states across historical and contextual scenarios, as well as the fusion of them at the user-item level. In addition, CPMR incorporates one real tower for incremental predictions, and two pseudo towers dedicated to updating the respective temporal states based on new batches of interactions. Experimental results on four benchmark recommendation datasets show that CPMR consistently outperforms state-of-the-art baselines and achieves significant gains on three of them. The source code is available at https://github.com/DiMarzioBian/CPMR.


INTRODUCTION
For human beings, making interactions is an important behavior to understand other objects and update their own recognitions.With the time information available, interaction trajectories are typically modeled as chronologically ordered sequences in real applications, e.g., e-commerce (click, add to cart, buy, and even neglect while browsing) [48], music apps (listen for a while or switch rapidly) [3], as well as social media (post, reply, forward, and mention) [8].The motivations for making an interaction vary but can be generally categorized into two types: static preference (long-term interest) and dynamic interest (short-term interest).How to properly model these two types of user interest from interaction sequences breeds the need for effective sequential recommender models.
Different from traditional sequential recommender systems [4,14,17,25] that only make use of the chronological order of interactions, recent studies [10,16] have proven that timestamps of interactions are more informative in characterizing temporal dynamics.In terms of time modeling, some representation learning works employ additive or concatenative temporal embeddings [10,16,40,41].Some further model the continuous decay in the effect of interactions after their occurrences [5,49].Moreover, to better capture the sequential information from discrete occurrences of interactions in continuous time, hybrid models mixing temporal point processes and recommender systems [2,7,27,28] manage to calculate time-based intensity on items to give recommendations.Some graph-based works [10,30] also introduce time-based graphs to model complex dynamic connectivities and adopt additive or concatenative temporal embeddings.The above-mentioned studies model temporal representations in the same way as static embeddings, and thus fail to capture complex dynamics in time-varying interests.
To fully represent the dynamic states embeddings over time, In-cCTR [32] introduces incremental learning in Click-Through Rate prediction.SML [44] and FIRE [38] consider the time efficiency and train their models in a fast incremental learning manner without querying historical data.Besides these works, two evolution models for incremental sequential recommendation, JODIE [15] and CoPE [45], have been proposed.JODIE employs a recurrent neural network structure to discretely model the interest trajectories on dynamic embeddings.CoPE approximates the continuous evolution between sets of concurrent interactions by using CGNN [10] to aggregate all historical interactions.Both evolution models learn static embeddings and temporal states, representing their static attributes and dynamic properties as functions of time, respectively.
Despite their superiority over models without considering temporal states, these evolution models evolve their temporal states across the entire history graph without specially treating recent interactions.Consequently, they hardly capture the changing dynamics and ignore the timeliness of temporal states, that is, the impact of recent contexts on users' dynamic interests.Considering the fact that such contexts are sensitive to time, we propose a new evolution model, namely Context-Aware Pseudo-Multi-Task Recommender System (CPMR), for fusing the evolution of both historical and contextual scenarios in an MTL-like manner.Specifically, CPMR creates three representations for each user and item under different dynamics: static embedding, historical temporal states, and contextual temporal states.Based on them, CPMR carries out effective information fusion and efficient joint optimization by designing a Pseudo-Multi-Task-Learning (PMTL) paradigm.
MTL is a natural and principled solution to our recommendation problem as our modeling involves multiple tasks for evolving, updating, and fusing different temporal dynamics, all sharing the same set of temporal states.Conventional MTL cannot be directly applied because not all of our tasks generate losses.Therefore, we devise a new pseudo-MTL specifically for our problem.As shown in Figure 1, tasks of our proposed PMTL can be divided into the real task that generates losses and pseudo tasks that generate updated input for the next recurrence.
CPMR is an evolution model implemented under this PMTL paradigm by recurrently evolving the temporal state.In particular, CPMR leverages recommendation as a real task within the prediction module of the real tower, while employing the updating of temporal states as pseudo tasks within the update modules of the simulated towers.In order to model the continuous evolution and the mutual information sharing of historical and contextual temporal states, CPMR also employs two evolution module instantiations and two fusion module instantiations as the shared-bottom networks in PMTL paradigm.More specifically, the update module captures the instant evolution in temporal states from concurrent interactions, while the evolution module focuses on modeling the continuous evolution in temporal states from intervals between batches of interactions.To distinguish evolution within historical and contextual scenarios, CPMR creates two instances of both modules, each dedicated to leveraging the history graph and the context graph, respectively.In particular, the history graph consists of all the interactions that have occurred, and the context graph consists of the interactions that occurred within a fixed-length sliding time window (context window) from the current moment.After evolution, these temporal states are mutually updated at the user-item levels in two fusion module instantiations.
Empowered by the PMTL paradigm and context awareness, CPMR is able to dynamically model the historical and contextual temporal states of users and items and make recommendations.Experimental results demonstrate that CPMR outperforms stateof-the-art baselines and achieves 30.98% gains on MRR and 27.39% gains on Recall@10.Our contributions are summarized as follows: • We propose a Pseudo-Multi-Task-Learning module to stack single-target incremental recommendations into one multitarget task by mutually evolving temporal states between each task.• We devise CPMR based on the PMTL paradigm, which enables the evolution and fusion of user interests and item attributes as temporal states within both historical and contextual scenarios.• We conduct extensive experiments on four datasets to evaluate the effectiveness of CPMR and perform ablation studies on fusion modules and proposed contextual temporal states.The results show that CPMR consistently outperforms stateof-the-art baselines, where both proposed components play important roles.

RELATED WORK
Sequential Recommendation (SR): it is a task to predict the next behavior leveraging from the sequence of chronologically ordered historical behaviors.The earliest work, FPMC [20], utilized the Markov Chain to model the transition pattern within sequences.
To harness the representation learning capability of deep learning, CNN-based CASER [25] viewed sequential embeddings as images to extract information.Recent sequential recommender works can be divided into two categories: recurrent-based methods [9,34,39,46,52] and attention-based methods [14,22,42,43,47,48].Some sequential recommender systems explore temporal information to enhance representation learning.For example, [40] conducted extensive experiments on different types of temporal embeddings.TiSASRec [16] embedded relative time intervals to associate with time.CTRec [2] and DeepCoevolve [7] employed temporal point processes to introduce the temporal dynamics into recommendations.JODIE [15] represented temporal states of users and items by embedding trajectories.However, the update mechanisms for temporal states in these models largely rely on the entire history while ignoring the influence of context.To address this limitation, we calibrate the evolution by exploiting both historical and contextual information in a more timely manner.
Graph-based Recommendation: As each sequence can be viewed as a subgraph, the recommendation dataset can be transformed into a user-item bipartite graph or an item-item graph.For instance, SR-GNN [36] firstly introduced GNN techniques into recommendation tasks.LightGCN [13] designed a simple but effective graph convolution network (GCN).Subsequently, a number of studies apply graph learning to a variety of different recommendation tasks [17,21,36,50], showing the potential of this combination.
Besides, temporal information can also be applied to graph-based recommendations.TGSRec [10] unified sequential patterns and temporal collaborative signals to improve recommendation.CoPE [45] proposed a CGNN-based method to learn from continuous propagation and evolution.FIRE [38] designed graph filters from a graph signal processing perspective to capture the temporal dynamics and address the cold-start problem.RETE [31] proposed a retrieval-enhanced recommendation model based on knowledge graphs to model the temporal dynamics.However, the number of new interactions at each timestamp is too small compared to all historical interactions.Hence applying GNN directly on the graph of all historical interactions cannot effectively capture the dynamics of users and items.On the contrary, in this work, we propose to mitigate this problem by applying another GNN on a more dynamic context graph.
Multi-Task Recommendation: Multi-Task Learning (MTL) is an active research topic in recommender systems, drawing attention from both industry and research.The general network architecture of MTL [1,18,19,23,24] consists of a shared bottom network that learns task-shared knowledge and multiple task-specific towers to generate the results required by respective tasks.According to the setting of the targets, MTL models can be divided into two categories, one is to use auxiliary tasks to assist in optimizing single or multiple target tasks [29,35], and the other is to optimize all tasks at the same time [11,51].By definition and task settings, common single-task problems do not have to be transformed into multi-task problems.However, if we consider the evolving temporal states as inputs to the shared-bottom network, one tower making recommendations and the other towers updating the input temporal states for further recommendations, we can jointly optimize these single-task recommendations in an MTL-like way by sharing the evolving temporal states among them.

Problem Formulation
We provide a formal definition of the incremental sequential recommendation (ISR) task we are tackling.Assume we have a set of users U and a set of items I.We see each user-item interaction with a timestamp as a triplet (, , ).Therefore, at a given timestamp   , a user   ∈ U has a chronologically ordered historical interaction sequence If   makes an interaction at   , ISR aims to predict the ground-truth item   that   will interact with at   by mining   's interest from   together with all observed interactions from other users before   .That is to say, in terms of time, this task strictly adheres to the principle of no data leakage.Unlike some other incremental recommendation tasks [32,38], the ISR task allows all history data to be used, rather than using only the model's current states and incoming interactions.The problem formulation of incremental sequential recommendation is given as: Input: The historical interactions of all users before timestamp   .Output: A recommender system that estimates the probability of user   interacting with every candidate item  ∈ I at   , and recommends a top  recommendation list with the highest probabilities to user   .
Based on the definition of ISR, JODIE, CoPE and our CPMR can all be classified as ISR models.

Graph Formulation
By joining the interactions of all sequences into one edge set E =  1 ∪  2 ∪ ... ∪   , we can represent this edge set as a bipartite interaction graph G = (U ∪ I, E), in which ∀ ∈ U, all its neighbor nodes N () are of item type, i.e., N () ⊆ I, and vice versa.Without loss of generality, we normalize the timestamps in the dataset into [0, 1] and define the moments right before and right after the timestamp   as  −  and  +  .As shown in Figure 2, we further define three types of graphs based on the time spans in CPMR.

(History Graph).
Given an interaction graph G = (U ∪ I, E), the history graph at timestamp  is formed by all the and

D. Evolution Module (context) E. Predict Module
Given an interaction graph G = (U ∪ I, E), its adjacency matrix is denoted by is the bi-adjacency matrix containing all user-item interactions in the interaction graph: Because of the sparsity problem in the recommendation dataset, the degrees of nodes can be very different.Therefore we normalize the adjacency matrix and give it as follows: where  denotes the degree matrix of , and  0 is set to 0.98 to make sure all eigenvalues of  to be in the interval [0, 1) for future modeling and approximations [6].

Pseudo-Multi-Task Learning Paradigm
Previous evolution models, JODIE [15] and CoPE [45], recurrently evolve temporal states to make recommendations as they only model a single type of temporal dynamics, i.e., historical dynamics.
In this work, we propose to model contextual dynamics in addition to the historical one.By doing so, three different tasks naturally arise: instant update of contextual/historical temporal dynamics, the fusion and continuous evolution of these two types of temporal dynamics, as well as the recommendation task.Beneath all the tasks, they share the same set of user/item temporal states.Meanwhile, each individual task has its own characteristics and target.This motivates us to employ Multi-Task Learning (MTL) as a principled solution.However, each task in conventional MTL generates its own loss, while this is not the case in our problem.Therefore, we design a new pseudo-MTL (PMTL) paradigm to better fit our need in ISR.Specifically, we designate a task that generates losses as a real task, while those not bound with losses as pseudo tasks.Figure 1 illustrates our proposed PMTL paradigm over the task of ISR.The shared-bottom network handles the fusion and evolution of temporal states.The output of shared-bottom network are feed into the pseudo tasks for instant updates and to the real task for prediction.In this way, we achieve a joint optimization akin to MTL.
In this section, we devise CPMR based on the PMTL paradigm, a system modeling both contextual and historical temporal states  4) Predict Module: to make incremental recommendations.Following our PMTL paradigm, CPMR instantiates two evolution modules and two fusion modules as the shared-bottom network (blue blocks in Figure 3.A).With the temporal states generated by the shared-bottom network, CPMR further implements two instantiations of the update module as two pseudo tasks (green blocks) to perform instant updates on temporal states.Finally, CPMR instantiates the real task of PMTL by a predict module (pink block), which generates recommendation losses.While the pseudo tasks do not directly generate losses, its updated temporal states make the next inputs different from the inputs, and thus the next recurrence's loss will be different.By jointly optimizing these losses from different recurrences, our model can not only be efficiently trained with less time of backpropagation but also reduce overfitting on specific recurrences.

Overview of CPMR
To be specific, CPMR learns three vector representations for each user and static historical temporal states and contextual temporal states, which the basic preference (attributes) and the time-varying dynamics in historical and contextual scenarios respectively.On top of these representations we implement the aforementioned as: Evolution Module Through two parallel GNN encoders, the historical and contextual temporal states users and items evolve from  +  −1 to  −  .This module contains two instantiations respectively for historical and contextual scenarios, each incorporating the corresponding input graph.The module is designed to approximate the continuous states evolution at interval ( +  −1 ,  −  ).As shown in Figure 4, CPMR runs the evolution module, the fusion module, and the update module in a loop to keep the temporal states updated.According to the model structure illustrated in Figure 3, for  = 1, 2, ..., , we can write CPMR in a recurrent form, ignoring the notations of users and items: where  ∈ R | U∪I | × denotes the static embeddings for all users and items,  ∈ R | U∪I | × denotes the users' and items' temporal states,  ∈ R | U∪I | × denotes the users' and items' final representations for making recommendations, and the Fusion function here denotes two fusion module instantiations of user or items for simplicity.In our implementation, each expert and gating network is a linear layer.
One example of the execution flow at the recurrence of   is shown in Figure 4, and summarized as follows.
(1) Right after some interactions were made at   −1 , CPMR runs the evolution module to update temporal states from  ( +  −1 ) to  ( −  ) at an interval in-between adjacent interactions ( +  −1 ,  We will introduce each module in detail in subsequent subsections.

Evolution Module
In the evolution module, the goal is to model new temporal states  ℎ () and   () for  ∈ ( +  −1 ,  −  ).Essentially, it models the interest propagation that exists in an interval between adjacent interactions (i.e., an interval without any interactions).Instead of conducting aggregation on all happened interactions ignoring time gaps as what CoPE does, we propose mining from trendy items and their corresponding users by establishing a context environment within the context window and removing the time gaps inside of it.Same to a first-in-first-out queue on the time axis, this context graph can easily adapt to new trends as time goes by.Specifically, we build our evolution module by employing two parallel CGNNs [37], each of which performs historical dynamics evolution or contextual dynamics evolution respectively.The history graph, formed by all happened interactions, captures the relatively static users' preferences and items' attributes.The context graph, formed by all recent interactions inside one context window, captures dynamic time-varying trends and accommodates potential drifts in users' interests and items' status.In this way, the evolution module in CPMR is able to capture evolution dynamics under both history and context scenarios.
To differentiate the message passing capability of different nodes in an interaction graph, we define the learnable spectral radii by a vector  ∈ R | U∪I | ×1 for all nodes.Intuitively, each   reflects the importance of node  when its embedding is used to form the embeddings of its neighbors.With  , we have the learnable adjacency matrix as  := Broadcast (Sigmoid( )) ⊙  where Broadcast function expands the vector  to R | U∪I | × by copying, ⊙ denotes element-wise multiplication, and we employ sigmoid function Sigmoid( ) to scale all eigenvalues of  to be in the interval [0, 1), since max (Sigmoid( )) < 1.In subsequent discussions, we refer to the learnable adjacency matrix when the adjacency matrix of a history/context/instant graph is used.
Considering that the same user (item) has different interests (attributes) in different scenarios [33], we define the CGNNs' ODEs for the history graph and context graph respectively as follows: where  ℎ and   use different sets of learnable spectral radii, i.e.,  ℎ ≠   , to account for the differences between historical and contextual scenarios.With  ℎ (0) =  ℎ ( +  −1 ) and   (0) =   ( +  −1 ), we have the analytical solutions of the two ODEs for  >   −1 and Δ =  −   −1 as follows: where we fix   , during the entire interval for computation, i.e.,   , =    −  , for  ∈ (  −1 ,   ).This is considered reasonable as the length of the intervals between adjacent batches of interactions is small compared to the length of the context window.With Eq. ( 5), we can directly apply approximations of matrix inverse [37] and matrix exponential [45] to obtain a discrete solution.

Fusion Module
As shown in Figure 3.B, we design the fusion module on top of the Customized Gate Control (CGC) model which is the one-layer version of the well-known Progressive Layered Extraction (PLE) model [24].In CPMR, we create two fusion module instantiations for users and items separately.Taking the instantiation for users as an example, at  =   , the inputs are users' historical temporal states  ℎ  ( −  ) and contextual temporal states    ( −  ), and the outputs are their updated historical temporal states  ℎ  (  ), updated contextual temporal states    (  ), and final representations   (  ) for recommendation.
To be more specific, we implement one shared expert network, three task-specific expert networks, and three gating networks in Figure 3.B.Each gating network takes the outputs of its corresponding task-specific expert network and the shared expert network to generate task-specific output.By this design, the shared expert network will be affected by all tasks during optimization, but the task-specific expert networks will only be affected by their own tasks.To selectively combine information from shared and taskspecific experts, each gating network learns the weights of each expert and sums the experts' output up with these weights.In the instantiation for users, the generation of users' historical temporal states can be summarized as follows: where ⊖ denotes tensor concatenation on the latent dimension,    denotes concatenated tensor input,  ℎ  denotes the weights of shared and task-specific expert networks learned from the gating network Gate ℎ  , Expert ℎ  denotes the shared expert network, Expert ℎ  and Gate ℎ  denote the expert and gating networks for the task of generating the updated users' historical temporal states.All output terms, including final representations, and historical and contextual temporal states for both users and items, can be generated by the two fusion module instantiations respectively in a similar way.

Update Module
As evidenced in previous SR models [14,34], the appearance of a new interaction has an instant impact on its corresponding user and item within a short period of time.Compared to continuous evolution, this process can be seen as discrete.The update module in our CPMR is designed to implement this sudden change in users' and items' states.At each timestamp  =   when some interactions are made, the update module takes the set of new concurrent interactions E    as input and transforms the temporal states from  (  ) to  ( +  ), which is performed in a discrete way to reflect the instant impact.As shown in Figure 3.C, given the bi-adjacency matrix of the instant graph G    ,     , the contextual temporal states update procedure for users is formulated as: where   2 and   2 are learnable weight matrices, is the diagonal degree matrix on rows of     , and    , ∈ {0, 1} | U | × is a masking matrix that gives entries of 1 to the users involved in G    .Other temporal states of users and items can be updated in a similar way.

Predict Module and Optimization
To provide a top-N recommendation list for each user who makes interactions at  =   , we calculate the inner-product similarity  of each user to all items.Given a user  (an item ) at  =   , we fuse its static embedding   (  ) and its final representation   (  ) (  (  )) with one linear layer, and compute user-item similarity by inner product: where ⊖ denotes concatenation on the latent dimension, FC  and FC  are two fully-connected layers.
Optimization During the model optimization, TBPTT is performed every 20 batches (i.e., 20 unique timestamps).For each interaction (, ,   ) in E    , we randomly sample   negative users that have never interacted with the item  before   , and   negative items that user  has never interacted with before   .We then create a negative edge set E    by connecting sampled negative users to , and sampled negative items to .We compute the prediction loss for each interaction (, ,   ) ∈ E    via InfoNCE [26]: The incremental recommendation loss at  =   is computed as the average loss over all concurrent interactions in G    :

EXPERIMENTS
In this section, we conduct experiments to evaluate the effectiveness of CPMR.We aim to answer the following questions.
• RQ1: How good is the performance of CPMR when compared with state-of-the-art evolution models?• RQ2: How does the length of the context window affect the performance of CPMR? • RQ3: How does the proposed PMTL paradigm affect joint optimization and the mutual update of temporal states?• RQ4: How do the proposed contextual temporal states affect the performance of CPMR compared to historical temporal states?

Experimental Setup
Datasets: We conduct experiments on four public sequential recommendation datasets, including three subsets from Amazon review datasets [12] ('Patio, Lawn and Garden' as Garden, 'Amazon Instant Video' as Video, and 'Video Games' as Games) 1 , and one movie rating dataset (ML-100K) 2 .The reason for not using session-based  recommendation datasets is that long-term sequences of user interactions can show clearer interest evolution compared to short-term anonymous sessions, and session-based recommendation models usually do not explicitly embed users.We use the same preprocessing pipeline of Caser [25] and CoPE [45], which recursively discards users and items with less than 5 observations until each remaining user or item contains at least 5 interactions.By following CoPE [45] and JODIE [15], each dataset is split by time into 80%/10%/10% as training, validation and test sets.The statistics of these datasets are summarized in Table 1.During experiments, we coarse the timestamp into day to make a trade-off between the granularity and efficiency of the incremental recommendation.We run CPMR and CoPE (ours) five times with different seeds on each dataset to obtain the experimental results.
Baselines: Regarding the baseline selection and quality metrics used, we follow those in CoPE [45].Specifically, we compare CPMR with various baselines, including (1) graph-based recommendation model, LightGCN [13]; (2) deep recurrent recommendation models, such as Time-LSTM [52] and RRN [34]; (3) temporal network embedding models, such as DeepCoevolve [7], JODIE [15] and CoPE [45].For JODIE and CoPE, we use their variants reported in the paper of CoPE [45]: JODIE* that disables test-time training and CoPE* that disables test-time training, meta-learning, and jump loss.We also report our own rewritten and re-tuned CoPE (ours) with meta-learning and jump loss implemented.For quality metrics, we use the MRR (mean reciprocal rank) for the targets among all items, and Recall@10 scores for target items among top-10 recommendations.For a fair comparison, we use the reported results from CoPE [45] for all baselines in the first section in Table 3.
Hyperparameters: Our selections of hyperparameters are reported in Table 2.We use  = 128 as the dimensions of static embeddings and two temporal states for CPMR and CoPE (ours).Considering the fact that incremental recommendation is very prone to overfitting, we carefully tune these models on learning rate and l-2 regularization weight (weight decay) using small step sizes.To figure out how context affects model performance, we also tune the window length from 5 days to 100 days at a step of 5 days.

Recommendation Performance (RQ1)
As shown in Table 3, CPMR outperforms all baselines on both MRR and Recall@10 across all four sequential recommendation datasets for ISR tasks.Compared with the best-performing baseline models, CPMR achieves 30.98% gains on MRR and 27.39% gains on Recall@10 on average.Statistics significance from paired t-value tests show that CPMR outperforms CoPE significantly on Video, Game and ML-100K datasets where the contextual trends are much more obvious.For Garden, we will explain why the gain is subtle in subsequent sections.The superior performance of CPMR over CoPE is credited to the design of the PMTL paradigm and context awareness.Other temporal baselines perform poorer than CoPE and CPMR because of their recurrent treatment on concurrent interactions.Without time awareness, LightGCN lacks the capability of capturing embedding dynamics, resulting in its inferiority in comparison.

Length of Context Window (RQ2)
The length of the context window  is an important hyperparameter affecting the context awareness of CPMR. Figure 5 reports the results of sensitivity study on four datasets.We use context window lengths  in multiples of 5, ranging from 5 to 100.The optimal length will be discussed below.
Garden: The optimal  is 5 days, but the total time span is 5221 days.Such a short context length suggests that hardly any contextual trends exist.Considering that garden tools iterate very slowly, a shorter  matches the reality.The lack of context also results in the smallest improvement ratio on Garden across all datasets.

ML-100K & Video:
The optimal  is 35 days.Since both datasets contain movies and TV series, the length is roughly in line with their showtime and popularity cycle on social networks as well.
Game: The optimal  is 50 days.Typically, gaming communities can engage in prolonged discussions about popular games, while meticulously crafted games often demand tens to hundreds of hours for completion.In this case, it makes sense to use a longer contextual window.
Table 4: Ablation studies of the fusion module and contextual/historical temporal states.

Ablation: Context and History (RQ3)
To validate the effectiveness of contextual awareness, we choose the two datasets with the largest and smallest CPMR improvements, Garden and Video, and design two variants for the ablation studies as follows: • w/o ctx: disable contextual temporal states by removing the contextual instantiation of the evolution module and changing the input of the fusion module into historical temporal states only (including removing tensor concatenation, Gate  network and Expert  network).• w/o his: disable historical temporal states by removing the historical instantiation of the evolution module and changing the input of the fusion module into contextual temporal states only (including removing tensor concatenation, Gate ℎ network and Expert ℎ network).
From the second section of Table 4, removing either of the two temporal states leads to a decrease in model performance.This is because one single temporal state contains less information on the dynamics and cannot conduct information fusion in the fusion module.Furthermore, for both datasets, CPMR without history performs better than the version without context, which shows the importance of context awareness when modeling evolution dynamics.

Ablation: Fusion Module (RQ4)
To validate the effectiveness of the fusion module, we design a variant for this ablation study as follows: • w/o fusion: remove the fusion module, directly adding involved temporal states  (  ) =  (  ) =  ℎ ( −  )+  ( −  ) as input of update module and predict module.
We can observe from Table 4 that, without the fusion module, the performance of CPMR drops dramatically on Video, but only slightly on Garden.Based on the experiment in Section 5.3, we believe this is because the context of the garden contains too little unique information, which makes the information fusion less effective.
To better inspect the performance of the fusion module, we also tune the number of batches per TBPTT,   , as they can be seen as the number of tasks in MTL.The results in Figure 6 show that choosing an appropriate number is important, as fewer batches may lead to inefficiency and task-specific overfitting, while more batches may also lead to a lack of guidance on temporal state evolution.We observe that setting   = 20 is a good choice for both Garden and Video datasets.

CONCLUSION
In this paper, we propose a novel recommender system, CPMR, that is equipped with both context-aware temporal dynamics modeling and Pseudo-Multi-Task Learning.By jointly optimizing multi-target incremental recommendations, CPMR is able to effectively capture and fuse historical and contextual temporal dynamic states.With such designs, our approach outperforms state-of-the-art models by 30.98% on MRR and 27.39% on Recall@10 on average, as demonstrated by extensive experimental studies.

Figure 1 :
Figure 1: A simple illustration of our proposed Pseudo-Multi-Task Learning paradigm over incremental recommendation task.

Figure 2 :
Figure 2: Illustration of the history graph, context graph and instant graph.

Definition 3 . 1 (
Instant Graph).Given an interaction graph G = (U ∪ I, E), the instant graph at timestamp   is formed by all the interactions happened right at   , i.e., G    = U ∪ I, E    and

Figure 3 : 3 (
Figure 3: The proposed structure of Context-Aware Pseudo-Multi-Task Recommender System.In subfig A, The shared-bottom network consists of all blue blocks, and each green or pink block represents a pseudo tower or a real tower.

Figure 4 :
Figure 4: Workflow of the CPMR.Modules in the same grey block are executed in a short time.

Figure 5 :
Figure 5: Results on Garden, Video, Game and ML-100K w.r.t.different context window length  (in days).
• Fusion Module The mutual update of temporal states is performed in this module.It takes the latest historical and contextual temporal states at  −  as inputs and generates final users' and items' representations for recommendations at   , and the updated historical and contextual temporal states at   via information fusion.Unlike in the evolution module, the two instantiations in this module are tailored for users and items respectively.They aim to blend the information from both historical and contextual sources and inject such information into the temporal states of each type of nodes, i.e., either user or item nodes.• Predict Module The top-N recommendation list is generated by the module.It takes the final users' and items' representations generated by the fusion module at   as inputs.• Update Module The update module represents newly arrived concurrent interactions at   as an instant graph and discretely updates the relevant temporal states from   to  +  .Same as in the evolution module, this module has two instantiations respectively for historical and contextual dynamics.
Right before some interactions are made at   , CPMR first calls the fusion module to update evolved temporal states from  ( −  ) to  (  ) via information fusion and then calls the predict module to top-N recommendations for each user involved in the coming interactions E    (3) At timestamp   , these interactions are made and the instant graph G    = U ∪ I, E    is constructed.If CPMR accumulates losses for a given number of recurrences based on the predictions in Step (2) and the instant graph G    , it will perform TBPTT to update the learnable parameters.Besides that, CPMR also runs the update module to add the effects of this set of interactions E    into temporal state  (  ) to get  ( +  ).After this, CPMR moves to the next recurrence.

Table 1 :
Statistics of Sequential Recommendation Datasets.

Table 3 :
[45]mmendation performance.R@10 is short for Recall@10.The results of all baselines in the first section are imported from CoPE[45].The results of CoPE (ours) in the second section are derived from our rewritten code.The best results and the runner-up among baselines and CPMR are highlighted in bold and underline respectively.The % Gains are calculated by comparing the best-performing baseline with CPMR.Statistical significance of pairwise differences of CMPR vs. the best baseline is determined by a paired t-test ( * * * , * * for p-value ≤ 0.01, 0.05 respectively).